CN117594157A

CN117594157A - Method and device for generating molecules of single system based on reinforcement learning

Info

Publication number: CN117594157A
Application number: CN202410077808.3A
Authority: CN
Inventors: 李中伟; 谢爱峰; 柳彦宏; 鲍雨
Original assignee: Yantai Guogong Intelligent Technology Co ltd
Current assignee: Yantai Guogong Intelligent Technology Co ltd
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-02-23
Anticipated expiration: 2044-01-19
Also published as: CN117594157B

Abstract

A molecular generation method and device of a single system based on reinforcement learning belong to the technical field of molecular generation prediction, and the method carries out de-duplication treatment on collected molecular expressions to obtain a molecular data set; expanding the molecular data set in an atomic substitution mode to obtain an expanded data set and performing deduplication treatment; pre-training the transducer model through the expanded data set subjected to the de-preprocessing to obtain a pre-training model V1; performing reinforcement learning treatment on the pre-training model V1 to obtain a pre-training model V2; and carrying out fine tuning treatment on the pre-training model V2, quantitatively selecting molecules meeting the conditions in the fine tuning treatment process to participate in training of the pre-training model V2, obtaining a pre-training model V3 after the fine tuning treatment, and carrying out new molecule generation of a single system through the pre-training model V3. The invention obviously improves the discovery efficiency of new molecules meeting the production requirement and greatly shortens the period of research and development of new molecules in laboratories in the chemical field.

Description

Method and device for generating molecules of single system based on reinforcement learning

Technical Field

The invention relates to a method and a device for generating molecules of a single system based on reinforcement learning, and belongs to the technical field of molecular generation prediction.

Background

At present, in the field of chemical molecule research and development, the design and generation of new molecules is always a time-consuming and difficult task, and the research and development personnel often waste a great deal of energy by searching new molecules from a huge chemical space through traditional methods such as chemical reaction paths and the like.

In recent years, with the vigorous development of AI technology such as deep learning, the development of AI assisted chemical molecules is receiving more and more attention, and novel innovative ideas and solutions are provided for developers. The introduction of AI technology accelerates the molecular development process, shortens the development period, reduces the development cost, provides more choices for the research personnel, and brings breakthrough progress to the chemical molecular development.

However, the bottleneck of AI molecule generation is limited by the public data set, so that the diversity of the generated molecules is difficult to deviate from the restriction of the public data set, but the generation of new molecules deviating from the existing patent protection is the most core task of molecular discovery, so that the bottleneck of insufficient data is avoided, and the generation of new molecules with diversity and meeting the target conditions is a technical problem to be solved in the field of molecular discovery.

Disclosure of Invention

Therefore, the invention provides a method and a device for generating molecules by a single system based on reinforcement learning, which solve the problems that the traditional technology cannot get rid of the bottleneck of insufficient data, cannot generate new molecules with diversity and meeting target conditions, and causes long research and development period of the new molecules.

In order to achieve the above object, the present invention provides the following technical solutions: a method of molecular generation for reinforcement learning based single systems comprising:

collecting molecular expressions from a public database, and performing deduplication processing on the collected molecular expressions to obtain a molecular data set;

expanding the molecular data set in an atomic substitution mode to obtain an expanded data set, and performing deduplication processing on the expanded data set;

pre-training the transducer model through the expanded data set subjected to the de-preprocessing to obtain a pre-training model V1; performing reinforcement learning treatment on the pre-training model V1 to obtain a pre-training model V2;

and carrying out fine adjustment treatment on the pre-training model V2, quantitatively selecting molecules meeting the conditions in the fine adjustment treatment process to participate in the training of the pre-training model V2, obtaining a pre-training model V3 after the fine adjustment treatment, and carrying out new molecule generation of a single system through the pre-training model V3.

As a preferable scheme of a molecular generation method of a single system based on reinforcement learning, br atoms are used for replacing H atoms connected with C atoms in a smiles molecular expression in the molecular data set in the process of expanding the molecular data set in an atom replacement mode.

As a preferred scheme of the molecular generation method of the single system based on reinforcement learning, the step of pre-training the transducer model through the extended data set after the deduplication processing to obtain a pre-training model V1 includes:

encoding smiles molecular expressions in the molecular dataset as a matrix;

inputting the coding matrix into a transducer model to obtain molecular coding output;

calculating a loss value between the molecular code output and the correct smiles molecular expression by using the cross entropy loss; and updating the parameters of the transducer model by adopting back propagation;

when the transducer model tends to stabilize after a plurality of rounds of training loss values, the current transducer model is saved as a pre-training model V1.

As a preferred scheme of a molecular generation method of a single system based on reinforcement learning, the step of performing reinforcement learning treatment on the pre-training model V1 to obtain a pre-training model V2 comprises the following steps:

generating smiles expression of the molecules of the current batch by utilizing the pre-training model V1;

evaluating and scoring the smiles expression of the current batch according to the set scoring standard;

training the weight of the pre-training model V1 by taking the evaluation score as the reward of the pre-training model V1;

after a plurality of rounds of iterative training, the pretraining model V1 of the last round is saved as a pretraining model V2.

As a preferred mode of a molecular generation method of a single system based on reinforcement learning, a scoring standard is setscoreThe method comprises the following steps:

；

wherein similarity represents similarity of smiles expression of the generated molecule to the molecule in a single system; score values are similarity when smiles are valid and score values are 0 when invalid;

calculating a loss value between a molecular encoding output and a correct smiles molecular expression using cross entropy lossThe formula of (2) is: />。

As a preferred embodiment of the molecular generation method based on a single system of reinforcement learning, the step of performing fine tuning processing on the pre-training model V2 includes:

the parameters of the pre-training model V2 are respectively assigned to an Agent model and a priority model, so that the Agent model participates in training, the parameters of the pre-training model V2 are updated, and the gradient freezing of the priority model does not participate in parameter updating;

generating smiles expression of the molecules by using the Agent model, screening smiles expression meeting the set condition, and stopping generating when the number meets the set threshold; generating smiles expressions with the same quantity through a primary model;

summarizing all generated smiles expressions, and then inputting AgThe Agent model and the primary model respectively obtain the output of the Agent modelAnd ∈Prior model->And uses the output of Agent model +.>And output of Prior model->Constructing a loss function;

and taking an average value of the loss values, updating parameters of the pre-training model V2 by adopting back propagation, and storing the current model as a pre-training model V3 when the pre-training model V2 tends to be stable after the loss values are trained.

As a preferred embodiment of the molecular generation method of a single system based on reinforcement learning, the output of an Agent model is usedAnd output of Prior model->Constructed loss function->The formula of (2) is:

；

in the method, in the process of the invention,a loss value of a smiles expression calculated for the Agent model; />The loss value of the smiles expression calculated for the Prior model.

The invention also provides a molecular generating device of a single system based on reinforcement learning, which comprises:

the original data acquisition module is used for collecting the molecular expressions from the public database and carrying out de-duplication processing on the collected molecular expressions to obtain a molecular data set;

the data expansion module is used for expanding the molecular data set in an atomic replacement mode to obtain an expanded data set, and performing deduplication processing on the expanded data set;

the first model training module is used for pre-training the transducer model through the expanded data set subjected to the reprocessing to obtain a pre-training model V1;

the second model training module is used for performing reinforcement learning treatment on the pre-training model V1 to obtain a pre-training model V2;

the third model training module is used for carrying out fine adjustment processing on the pre-training model V2, and quantitatively selecting molecules meeting the conditions to participate in the training of the pre-training model V2 in the fine adjustment processing process to obtain a pre-training model V3 after the fine adjustment processing;

and the molecule generation module is used for generating new molecules of a single system through the pre-training model V3.

As a preferred scheme of the molecular generating device based on a single system of reinforcement learning, in the raw data acquisition module, br atoms are used for replacing H atoms connected with C atoms in smiles molecular expressions in the molecular data set;

the first model training module includes:

the encoding processing submodule is used for encoding smiles molecular expressions in the molecular dataset into a matrix;

the coding output sub-module is used for inputting the coding matrix into a transducer model and obtaining molecular coding output;

a loss value calculation sub-module for calculating a loss value between the molecular code output and the correct smiles molecular expression by using cross entropy loss; and updating the parameters of the transducer model by adopting back propagation;

the first model storage submodule is used for storing the current transducer model as a pre-training model V1 when the transducer model tends to be stable after a plurality of rounds of training loss values.

As a preferred embodiment of the molecular generating device based on the reinforcement learning single system, the second model training module includes:

an expression generation sub-module for generating smiles expressions of molecules of the current batch by using the pre-training model V1;

the expression scoring submodule is used for evaluating and scoring the generated smiles expressions of the current batch according to the set scoring standard;

the reward training sub-module is used for training the weight of the pre-training model V1 by taking the evaluation score as the reward of the pre-training model V1;

the second model storage submodule is used for storing the pretraining model V1 of the last round as a pretraining model V2 after performing a plurality of rounds of iterative training;

in the expression scoring module, the set scoring standardscoreThe method comprises the following steps:

；

As a preferred embodiment of the molecular generating device based on the reinforcement learning single system, the third model training module includes:

the model parametrization sub-module is used for respectively assigning the parameters of the pre-training model V2 to an Agent model and a priority model, so that the Agent model participates in training, the parameters of the pre-training model V2 are updated, and the gradient freezing of the priority model does not participate in parameter updating;

the intermediate generation sub-module is used for generating smiles expressions of molecules by using the Agent model, screening smiles expressions meeting set conditions, and stopping generation when the number meets a set threshold; generating smiles expressions with the same quantity through a primary model;

the loss construction submodule is used for summarizing all the generated smiles expressions, inputting the agents model and the principles model, and respectively obtaining the output of the agents modelAnd ∈Prior model->And uses the output of Agent model +.>And output of Prior model->Constructing a loss function;

the parameter updating sub-module is used for taking an average value of the loss values and adopting back propagation to update the parameters of the pre-training model V2;

the third model storage submodule is used for storing the current model as a pre-training model V3 when the pre-training model V2 tends to be stable after the training loss value;

in the loss construction submodule, the output of an Agent model is utilizedAnd output of the Prior modelConstructed loss function->The formula of (2) is:

；

The invention has the following advantages: collecting molecular expressions from a public database, and performing deduplication processing on the collected molecular expressions to obtain a molecular data set; expanding the molecular data set in an atomic substitution mode to obtain an expanded data set, and performing deduplication processing on the expanded data set; pre-training the transducer model through the expanded data set subjected to the de-preprocessing to obtain a pre-training model V1; performing reinforcement learning treatment on the pre-training model V1 to obtain a pre-training model V2; and carrying out fine adjustment treatment on the pre-training model V2, quantitatively selecting molecules meeting the conditions in the fine adjustment treatment process to participate in the training of the pre-training model V2, obtaining a pre-training model V3 after the fine adjustment treatment, and carrying out new molecule generation of a single system through the pre-training model V3. The invention obviously improves the discovery efficiency of new molecules meeting the production requirement and greatly shortens the period of research and development of new molecules in laboratories in the chemical field.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.

FIG. 1 is a schematic flow chart of a molecular generation method of a single system based on reinforcement learning provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a reinforcement learning flow in a molecular generation method of a reinforcement learning-based single system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a fine tuning flow in a molecular generation method of a single system based on reinforcement learning according to an embodiment of the present invention;

FIG. 4 is a diagram of a molecular generating device architecture for a reinforcement learning-based single system provided in an embodiment of the present invention.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, 2 and 3, an embodiment of the present invention provides a method for generating molecules of a single system based on reinforcement learning, including the steps of:

s1, collecting a molecular expression from a public database, and performing de-duplication treatment on the collected molecular expression to obtain a molecular data set;

s2, expanding the molecular data set in an atomic replacement mode to obtain an expanded data set, and performing deduplication processing on the expanded data set;

s3, pre-training the transducer model through the extended data set subjected to the de-duplication treatment to obtain a pre-training model V1;

s4, performing reinforcement learning treatment on the pre-training model V1 to obtain a pre-training model V2;

s5, carrying out fine adjustment treatment on the pre-training model V2, and quantitatively selecting molecules meeting the conditions to participate in training of the pre-training model V2 in the fine adjustment treatment process to obtain a pre-training model V3 after the fine adjustment treatment;

s6, generating new molecules of a single system through the pre-training model V3.

In this embodiment, in step S1, molecular smiles expressions are collected from a published QM9 dataset (including the composition of 13 ten thousand organic molecules, spatial information and their corresponding attributes, which are widely used in experiments and comparisons of various data-driven molecular attribute prediction methods), and deduplication is performed to obtain ten-million smiles expressions. Next, in step S2, in the process of expanding the molecular dataset by means of atomic substitution, br atoms are used to replace H atoms connected to C atoms in smiles molecular expressions in the molecular dataset. Deduplication of the extended dataset yields a million non-duplicate molecular smiles expression, and follows 9: the ratio of 1 is divided into a training set and a verification set.

In this embodiment, in step S3, the step of pre-training the transducer model to obtain the pre-training model V1 through the extended data set after the deduplication process includes:

encoding smiles molecular expressions in the molecular dataset as a matrix;

Specifically, a transducer model is adopted to perform a pre-training task, and parameters are set as follows: batch_size setting 256; the number of heads of the Multi-Head Attention is 8; the maximum sequence length is set to 140; the optimizer adopts self-adaptive Adam; dropout is set to 0.1; setting the parameter of the warm starting mode to be 500; the loss function employs a cross entropy loss function.

The transducer model performs a pre-training task using the setup described above, first encoding the smiles expression of the molecules into a matrix, then inputting the encoded matrix into the encoder portion of the transducer model, processing it with the multi-headed attention layer and obtaining the encoded representation. The encoded representation is input to a decoder section to predict molecular smiles, cross entropy loss is used to calculate predicted values for molecular smiles and loss values between correct molecular smiles, and back propagation is used to update the transducer model parameters. Finally, when the transducer model is trained for a plurality of times, the loss value tends to be stable, the current transducer model is saved and named as V1.

In this embodiment, in step S4, the step of performing reinforcement learning processing on the pre-training model V1 to obtain the pre-training model V2 includes:

Specifically, parameters of the pre-training model V1 are imported using a transducer model architecture, where the parameters of the transducer model are set as follows: the batch_size setting 100; the number of heads of the Multi-Head Attention is 8; the maximum sequence length is set to 140; the optimizer adopts self-adaptive Adam; dropout is set to 0.1; the parameters of the warm start mode are set to 500.

Generating a batch of smiles expressions of the molecules by using the pre-training model V1, and then scoring the batch of smiles expressions according to a scoring standard, wherein the scoring standard is setscoreThe method comprises the following steps:

；

wherein similarity represents similarity of smiles expression of the generated molecule to the molecule in a single system; score values are similarity when smiles are valid and score values are 0 when invalid; smiles are measured in terms of both molecular availability and molecular similarity. Model weights are then trained using the evaluation scores as rewards for the model, and the molecular code output and correct smiles molecular expression are calculated using cross entropy lossLoss value betweenThe formula of (2) is:。

in this embodiment, in step S5, the parameters of the pretrained model V2 are imported by using the transducer model, wherein the parameters of the transducer are modified as follows: the epoch setting is 100, the batch_size is 50, and the other parameter settings are the same as those of the training process of the pre-training model V1.

The step of performing fine tuning processing on the pre-training model V2 includes:

generating smiles expression of the molecules by using the Agent model, screening smiles expression meeting the set condition according to the scoring standard formula, and stopping generating when the number meets the set threshold 30; generating smiles expressions with the same quantity through a primary model;

summarizing all generated smiles expressions, then inputting an Agent model and a priority model to obtain the output of the Agent model respectivelyAnd ∈Prior model->And uses the output of Agent model +.>And output of Prior model->Constructing a loss function;

Wherein the output of Agent model is utilizedAnd output of Prior model->Constructed loss function->The formula of (2) is:

；

Finally, generating new molecules of a single system by utilizing the pre-training model V3; and delivering the generated molecules of the single system to an experimenter for experiment screening out molecules meeting the requirements.

In summary, the invention collects the molecular expression from the public database, and performs the de-duplication processing on the collected molecular expression to obtain the molecular data set; expanding the molecular data set in an atomic substitution mode to obtain an expanded data set, and performing deduplication processing on the expanded data set; the step of pre-training the transducer model through the expanded data set after the de-preprocessing to obtain a pre-training model V1 includes: encoding smiles molecular expressions in the molecular dataset as a matrix; inputting the coding matrix into a transducer model to obtain molecular coding output; calculating a loss value between the molecular code output and the correct smiles molecular expression by using the cross entropy loss; and updating the parameters of the transducer model by adopting back propagation; when the transducer model is trained for a plurality of roundsWhen the failure value tends to be stable, the current transducer model is saved as a pre-training model V1. The step of performing reinforcement learning processing on the pre-training model V1 to obtain a pre-training model V2 includes: generating smiles expression of the molecules of the current batch by utilizing the pre-training model V1; evaluating and scoring the smiles expression of the current batch according to the set scoring standard; training the weight of the pre-training model V1 by taking the evaluation score as the reward of the pre-training model V1; after a plurality of rounds of iterative training, the pretraining model V1 of the last round is saved as a pretraining model V2. The step of performing fine tuning processing on the pre-training model V2 includes: the parameters of the pre-training model V2 are respectively assigned to an Agent model and a priority model, so that the Agent model participates in training, the parameters of the pre-training model V2 are updated, and the gradient freezing of the priority model does not participate in parameter updating; generating smiles expression of the molecules by using the Agent model, screening smiles expression meeting the set condition, and stopping generating when the number meets the set threshold; generating smiles expressions with the same quantity through a primary model; summarizing all generated smiles expressions, then inputting an Agent model and a priority model to obtain the output of the Agent model respectivelyAnd ∈Prior model->And uses the output of Agent model +.>And output of Prior model->Constructing a loss function; and taking an average value of the loss values, updating parameters of the pre-training model V2 by adopting back propagation, and storing the current model as a pre-training model V3 when the pre-training model V2 tends to be stable after the loss values are trained. The invention obviously improves the discovery efficiency of new molecules meeting the production requirement and greatly shortens the period of research and development of new molecules in laboratories in the chemical field.

It should be noted that the method of the embodiments of the present disclosure may be performed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present disclosure, the devices interacting with each other to accomplish the methods.

It should be noted that the foregoing describes some embodiments of the present disclosure. In some cases, the acts or steps recited in the present disclosure may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Example 2

Referring to fig. 4, embodiment 2 of the present invention further provides a molecular generating device of a single system based on reinforcement learning, including:

the original data acquisition module 001 is used for collecting the molecular expressions from the public database, and performing de-duplication processing on the collected molecular expressions to obtain a molecular data set;

the data expansion module 002 is configured to expand the molecular data set by means of atomic replacement to obtain an expanded data set, and perform deduplication processing on the expanded data set;

the first model training module 003 is configured to pretrain the transducer model through the extended data set after the preprocessing, so as to obtain a pretrained model V1;

the second model training module 004 is used for performing reinforcement learning processing on the pre-training model V1 to obtain a pre-training model V2;

the third model training module 005 is configured to perform fine tuning on the pre-training model V2, and quantitatively select molecules meeting the conditions to participate in training of the pre-training model V2 during the fine tuning process, so as to obtain a pre-training model V3 after the fine tuning process;

a molecular generation module 006 for generating new molecules of a single system by the pre-training model V3.

In this embodiment, in the raw data obtaining module 001, br atoms are used to replace H atoms connected to C atoms in smiles molecular expressions in the molecular dataset;

the first model training module 003 includes:

a coding processing sub-module 301, configured to code smiles molecular expressions in the molecular dataset into a matrix;

the code output sub-module 302 is configured to input the code matrix into a transducer model and obtain a molecular code output;

a loss value calculation sub-module 303 for calculating a loss value between the molecular code output and the correct smiles molecular expression using cross entropy loss; and updating the parameters of the transducer model by adopting back propagation;

the first model saving submodule 304 is configured to save the current transducer model as the pre-training model V1 when the transducer model tends to be stable after several rounds of training.

In this embodiment, the second model training module 004 includes:

an expression generation sub-module 401 for generating smiles expressions of the molecules of the current batch using the pre-training model V1;

an expression scoring submodule 402, configured to evaluate and score the generated smiles expression of the current batch according to a set scoring standard;

a reward training sub-module 403, configured to train the weight of the pre-training model V1 by using the evaluation score as a reward of the pre-training model V1;

a second model saving sub-module 404, configured to save the pre-training model V1 of the last round as a pre-training model V2 after performing a plurality of rounds of iterative training;

in the expression scoring submodule 402, a scoring standard is setscoreThe method comprises the following steps:

；

In this embodiment, the third model training module 005 includes:

a model assigning sub-module 501, configured to assign parameters of the pre-training model V2 to an Agent model and a priority model, respectively, so that the Agent model participates in training, update parameters of the pre-training model V2, and enable gradient freezing of the priority model not to participate in parameter update;

an intermediate generation sub-module 502, configured to generate smiles expressions of molecules using an Agent model, screen smiles expressions that satisfy a set condition, and stop generation when the number satisfies a set threshold; generating smiles expressions with the same quantity through a primary model;

a loss construction sub-module 503 for summarizing all generated smiles expressions, and then inputting the Agent model and the priority model to obtain the output of the Agent model respectivelyAnd ∈Prior model->And uses the output of Agent model +.>And output of Prior model->Constructing a loss function;

a parameter updating sub-module 504, configured to average the loss value and update the parameter of the pre-training model V2 by using back propagation;

a third model saving sub-module 505, configured to save the current model as a pre-training model V3 when the pre-training model V2 tends to be stable after the training loss value;

in the loss construction sub-module 503, the output of the Agent model is usedAnd output of the Prior modelConstructed loss function->The formula of (2) is:

；

It should be noted that, because the content of information interaction and execution process between the modules of the above-mentioned apparatus is based on the same concept as the method embodiment in embodiment 1 of the present application, the technical effects brought by the content are the same as the method embodiment of the present application, and specific content can be referred to the description in the foregoing illustrated method embodiment of the present application, which is not repeated herein.

Example 3

Embodiment 3 of the present invention provides a non-transitory computer-readable storage medium having stored therein program code of a reinforcement learning-based single-system molecule generation method, the program code including instructions for performing the reinforcement learning-based single-system molecule generation method of embodiment 1 or any possible implementation thereof.

Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk, SSD), etc.

Example 4

Embodiment 4 of the present invention provides an electronic device, including: a memory and a processor;

the processor and the memory complete communication with each other through a bus; the memory stores program instructions executable by the processor to invoke the program instructions capable of performing the reinforcement learning based single-system molecular generation method of embodiment 1 or any possible implementation thereof.

Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor, implemented by reading software code stored in a memory, which may be integrated in the processor, or may reside outside the processor, and which may reside separately.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.).

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. A method for generating a single-system molecule based on reinforcement learning, comprising:

2. The reinforcement learning-based single-system molecular generation method according to claim 1, wherein Br atoms are used to replace H atoms connected to C atoms in smiles molecular expressions in the molecular dataset in the process of expanding the molecular dataset by means of atomic substitution.

3. The method for generating a single-system molecule based on reinforcement learning of claim 1, wherein the step of pre-training a transducer model by the extended data set after the de-duplication process to obtain a pre-training model V1 comprises:

encoding smiles molecular expressions in the molecular dataset as a matrix;

4. The method for generating a single-system molecule based on reinforcement learning as claimed in claim 3, wherein the step of performing reinforcement learning processing on said pre-training model V1 to obtain a pre-training model V2 comprises:

5. The method for molecular generation of reinforcement learning-based single system according to claim 4, wherein the scoring criteria is setscoreThe method comprises the following steps:

；

6. The reinforcement learning-based single-system molecular generation method according to claim 4, wherein the step of performing fine tuning processing on the pre-training model V2 comprises:

7. The method for generating a single-system molecule based on reinforcement learning of claim 6, wherein an output of an Agent model is usedAnd output of Prior model->Constructed loss function->The formula of (2) is:

；

8. A single-system molecular generation device based on reinforcement learning, comprising:

9. The reinforcement learning based single system molecular generation device of claim 8, wherein Br atoms are used in the raw data acquisition module to replace H atoms connected to C atoms in smiles molecular expressions in the molecular dataset;

the first model training module includes:

the first model storage submodule is used for storing the current transducer model as a pre-training model V1 when the transducer model tends to be stable after a plurality of rounds of training loss values;

the second model training module includes:

；

10. The reinforcement learning based single-system molecular generation device of claim 9, wherein the third model training module comprises:

；