CN117334271A

CN117334271A - Method for generating molecules based on specified attributes

Info

Publication number: CN117334271A
Application number: CN202311238924.0A
Authority: CN
Inventors: 顾忠泽; 于文龙; 丁彦
Original assignee: Jiangsu Institute Of Sports Health
Current assignee: Jiangsu Institute Of Sports Health
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2024-01-02

Abstract

The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes. S1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task; s2, establishing a molecular generation model based on pre-training and model fine adjustment; s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task; s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute; s5, screening the obtained molecules; s6, providing indexes for evaluating molecular generation and quantifying the quality of the model. According to the invention, the pre-training model is introduced into the generation of drug molecules, and the deep learning is introduced into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.

Description

Method for generating molecules based on specified attributes

Technical Field

The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes.

Background

Over the past few decades, computer science has gradually incorporated the field of drug development from initial data entry into the design of auxiliary drugs, with the advent of Computer-aided drug design (Computer-Aided Drug Design, CADD). Although CADD technology performs well on certain tasks, such as virtual screening of drug molecules, challenges remain in the design and optimization of drug molecules, and as technology progresses, artificial intelligence becomes the optimal solution to this problem.

AI pharmacy is based on big data of medicine, and by using AI technology such as machine learning, deep learning, etc. to replace a large number of experiments, the structure, efficacy, etc. of the medicine are rapidly analyzed, so as to achieve the technical means of developing new medicine in short time and with low cost. Compared with the traditional computer aided drug design, the AI technology can rapidly identify drug targets, match proper molecules from a database, design and synthesize compounds and predict drug metabolism properties and physicochemical properties, thereby greatly shortening drug research and development time, reducing research and development cost and improving success rate. The advent and application of pre-trained models, again in the great background of artificial intelligence development, shortens this process.

Labeling data sets, algorithm models and computational effort are indispensable components in AI pharmacy, and are also sources of major challenges facing molecular generation at present:

data aspect: the high-quality data has high acquisition threshold and obvious restriction influence. The data sources of the drug research and development enterprises can be divided into public data and non-public data, the public data is easy to obtain, but the data quality is difficult to guarantee, and the reliability of model operation performed according to the data sources is insufficient. The non-public data is mainly accumulation of previous projects of each pharmaceutical company, and the accuracy of the data is high, but the data is extremely difficult to obtain because the data belongs to core assets of the pharmaceutical company.

Algorithm aspect: the matching requirement of the algorithm and the application scene is high. The advantages of the algorithm model in AI drug development can reflect the accuracy, calculation speed, model quantity, generalization performance and the like of the result, and different pre-training models can have different emphasis directions, so that the advantages are different, and the pre-training model with corresponding advantages is reasonably selected under specific task requirements and application scenes.

The aspect of calculating force: trimming the model may require a significant amount of computing resources, especially when the model architecture and parameters need to be adjusted. This may limit the practical application of the method.

At present, the research and development of the AI drugs in China are mainly applied to drug discovery links and preclinical research links, and are limited by the inherent complexity of biological systems and the characteristic of disease heterogeneity, and the AI technology cannot bring revolutionary changes to the efficiency and success rate of drug research and development, and the whole AI technology is still in an exploration stage. In the future, with the updating of algorithms, the breakthrough of calculation power and the development of big data, the AI technology is deeply applied to each link of new drug development, and plays an increasingly important role in the stages of compound synthesis, drug effect prediction, automatic development and the like.

Disclosure of Invention

The invention aims at solving the problems in the background technology and provides a method for generating molecules based on specified attributes. The model introduces a pre-training model into the generation of drug molecules, and introduces deep learning into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.

The technical scheme of the invention, 1, a method for generating molecules based on specified attributes, comprises the following specific steps:

s1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task;

s2, establishing a molecular generation model based on pre-training and model fine adjustment;

s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task;

s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;

s5, screening the obtained molecules;

s6, providing indexes for evaluating molecular generation and quantifying the quality of the model.

The expression form of the small molecular structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;

the small molecular data of the medicine collected in the S1 comprises public data and experimental output data, and the scale of the collected data reaches tens of millions;

making the collected drug molecule data into a table form comprising SMILES expression of the molecule, lipid partition coefficient, drug affinity, synthesis accessibility; and reasonable molecules were filtered out using the RDKit tool.

Preferably, the present invention uses a textified SMILES as the data for the input model;

the collected SMILES expressions and properties of the corresponding molecules also include solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies;

wherein the three molecular attributes of lipid distribution coefficient, drug similarity quantitative estimation and synthesis accessibility are generated according to the molecular attributes, and the target is a molecule with the same value as the value of the three attributes;

or specifying one or more combinations of six attributes of solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies.

Preferably, the model needs to generate drug molecules on the basis of designating corresponding attributes, so as to ensure that the neural network can recognize the relationships among the attributes, the molecular structures and the SMILES text information, and therefore different data sets are manufactured aiming at different attributes of the same molecular structure in S1, thereby ensuring the capability of the neural network to learn associated information.

In the S3 pre-training process, 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted in the same semantic space to encode text data;

wherein the text sequence information and the molecular structure information are learned by specifying the direction of the attention arrow.

S4, in the molecular attribute prediction process, after the molecular generation is finished, the generated molecules are input into a DPMG model, the rest attributes of the molecules are predicted, and all the attributes of the molecules are completed; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;

in the molecular attribute prediction process, a regression model is used as an end to obtain a numerical value, and MSE is used as a loss function:

for the mean square error loss, the general formula is:

in the above formula, y _i As a true measurement value of the value,for model predictive value, let +.in regression model>The loss function is expressed as:

the training objective function is the value of a and b when searching the minimum value of the following functions:

s4, generating drug molecules meeting requirements according to given attribute parameters in the generation process of target molecules with the specified attributes;

inputting the numerical value of the molecular attribute into an encoder, and inputting the obtained word encryption in the encoder into a decoder to obtain an output SMILES file;

introducing a teacher model in the output process, and comparing the existing SMILES part with a correct SMILES molecular file serving as a reference every time an atom is generated, so as to finish the fine tuning operation of the training model;

after each atom is compared with the reference SMILES file, the reference SMILES file is used as a precondition for generating atoms in the next round, and each atom is generated, a loss value is calculated through comparison, and parameters are fine-tuned through back propagation.

The data adopts binary cross entropy as the loss function when calculating the loss function:

the dataset presents data= (x ₁ ,y ₁ )(x ₂ ,y ₂ )(x ₃ ,y ₃ )(x ₄ ,y ₄ ) … …, wherein,is the input variable, i.e. a character generated in the model,/->Is the observed value, i.e. the output of the expected model, here we let y take 0 or 1, when the probability of y=1 is θ, i.e. P _θ When (y=1) =θ, the log-likelihood of the observed data points can be expressed by the above equation, where the likelihood function l (θ) is the objective function;

if a negative sign is added in front of it, it is converted into a loss function, and the loss function is y _i Cross entropy with θ;

loss function in cross entropy form for a single sample:

Loss＝-[y _i logp+(1-y _i )log(1-p)]

y _i is the observation of the i-th sample, and P is the probability of prediction.

Preferably, in S5, the basic drug-like properties of the molecules generated based on QED and SAscore are subjected to preliminary filtration.

Preferably, in S6, the molecular structure of the drug is scored by using MOSES and Guacamol, and the molecular structure is screened;

wherein,

generating molecular rationality: rationality of a molecule refers to whether the structure, properties, and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology, and pharmacology are met, and whether the molecule corresponds to a real molecule; if the molecular structure is SMILES, the MolFromSmiles method of the RDkit kit is used for checking whether the molecular structure can be converted from the SMILES format to the rdmol object, and if so, the molecular structure is a reasonable molecule.

Compared with the prior art, the invention has the following beneficial technical effects:

1. the invention is generated according to the required attribute of the drug molecules, is simple and clear, and is convenient to understand and operate.

2. In the process of the traditional method, the links of screening and optimizing are required to be subjected to multiple experiments and calculation simulation, and the purpose of accelerating the screening process and reducing the workload of the simulation and experiment is achieved by adding a small neural network capable of screening according to the specified conditions after the completion of the generation.

3. The pre-training task aims at massive unlabeled data, so that the requirement on model parameters is high, and resources and time required by model training are prolonged. According to the invention, the adaptation of the downstream task is realized by carrying out fine adjustment on the trained pre-training model, so that the calculation cost of training the large model is reduced.

Drawings

FIG. 1 is a schematic diagram of a frame structure of a model of the present invention;

FIG. 2 is a schematic diagram of the pre-training of the model of the present invention;

FIG. 3 is a schematic representation of text data encoding of the model of the present invention;

FIG. 4 is a fine-tuning model of the molecular property prediction task of the model of the present invention;

FIG. 5 is a fine-tuning model of a target molecule generation task for specified properties of the model of the present invention.

Detailed Description

Example 1

The invention provides a method for generating molecules based on specified attributes, which comprises the following specific steps:

1. collecting drug small molecule data, and preparing a data set aiming at a molecule generation task;

a table is built that includes molecular SMILES structure text files, molecular attribute values (SAS, QED, etc.). The data obtained are shown in the following table:

	data volume
		Training set	207493
Test set	50000

The attributes involved in the present model dataset include: lipid partition coefficient (LogP, partition coefficient), quantitative drug similarity estimation (QED, quantitative Estimate of Drug-like), synthetic accessibility (SAS, synthetic Accessibility Score), and the generation of molecules targeted to have the same values as those specifying these three attributes. Other molecular properties such as solubility and permeability estimates (linpinski), molecular Weight (MW), topological polar surface area (TPSA, topological polar surface area), number of hydrogen bond donors and acceptors (HBD & HBA, numbers ofhydrogen-bond detectors & acceptors), number of alarm structures (ALERT), number of rotatable bodies (ROTB) may also be specified.

It should be noted that although the present model can correspondingly implement two different downstream tasks through fine tuning, the fine tuning dataset is identical, except for the classification mode of the data, for molecular attribute prediction, the dataset is in the form of molecular SMILES and its corresponding attributes, and for molecular generation, the dataset is in the form of molecular SMILES corresponding to a certain attribute.

2. Establishing a molecular generation model based on pre-training and model fine tuning; as shown in fig. 1, the overall model framework is divided into two steps of molecular generation and molecular screening, and the three technical points of pre-training a model, pre-training fine tuning and molecular screening are involved.

3. Performing fine tuning training on the molecular generation model by using the data set manufactured in the step S1 to obtain a model suitable for a molecular generation downstream task; as shown in fig. 2, the computerized simulation and analysis of small molecules greatly speeds up the process of drug development. Characterization and understanding of molecules is an essential step in achieving this goal. Various molecular characterizations, such as molecular descriptors and fingerprints, have been proposed. Traditionally, these descriptors are designed by domain experts based on chemical and pharmaceutical knowledge for qualitatively or quantitatively representing molecules. Various shallow learning-based machine learning models are used to obtain quantitative structure-activity relationships (qsar) and quantitative structure-property relationships (QSPRs) to predict the activity and properties of molecules. With the advent of deep learning and representation learning in recent years, automatic representation and understanding of molecules by learning advanced features underlying low-level data has become an effective method of molecular modeling, making it possible to directly input original molecules for subsequent molecular analysis.

In the aspect of encoding and decoding, a sequence model in a text sequence, such as a Recurrent Neural Network (RNN), a long-short-time memory network (LSTM) and the like, is used, and a transducer is adopted to process the character sequence of the SMILES, so that the encoding and decoding effects are achieved. The model adopts 12 layers of transformers, 768 hidden layers and 12 attention arrows to realize the encoding operation of text data. In order to ensure that the coding and decoding process can not cause confusion of the generated structures due to different codebooks, the coding and decoding of the model are limited in the same semantic space

As shown in fig. 3, in the pre-training model, a part of attention arrows (solid line part) are bidirectional, and both the information at the front and the information at the rear can be connected, while the part connected by the dotted line realizes the contrast learning of molecular structure and text information through non-bidirectional connection, namely a causal relation model in an attention mechanism; modifying the model attention mechanism arrow, and then performing fine tuning training on the pre-training model by using the data set to obtain a model suitable for a molecular generation downstream task.

4. Adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;

as shown in fig. 5, a fine-tuning model of a task is generated using target molecules of specified properties: in the downstream task of total generation, the model generates molecules, the generated molecules are scored through a molecular screening link, and the scoring standard is generated molecular rationality (Validity) to generate molecular Novelty (Novelty).

During the execution of this downstream task, [ bos ] is selected as the initiator, [ seq ] is the spacer, [ seq-len ] is used to control the text length of the generated molecular SMILES file, and the input format of the molecular attributes is: attribute 1[ seq ] attribute 2[ seq ] attribute 3. The term "additionally requires attention, in the case of a model that has been trimmed, no loss is needed to update the parameters, and every output of the SMILES file is directly most input into the model.

The test benchmarks for this term are:

test index	Validity	Novelty	Attribute bias
				Expected value of test index	0.87	0.99	Less than or equal to 5%

As shown in fig. 4, a fine-tuning model of molecular property prediction tasks is used:

in this downstream task, fine tuning training is performed by changing the attention mechanism arrow and using differently labeled data sets, it should be noted that although both the molecular property prediction and the molecular generation model are called DPMG, these are essentially two different models, which are applicable to the respective downstream task, and there are differences in both structure and parameters. The input of this downstream task is the drug molecule generated by inputting the specified attribute in the last downstream task.

In this downstream task, the transducer acts as an encoder, the encoded vector passing through the transducer layer is ultimately converted to text output by a machine-learned classification model, which is back-propagated through the loss function of MSE during training to update the model parameters.

In the downstream task, the model predicts the generated molecular completion attribute, and the prediction deviation of the molecular attribute is within 5 percent and is regarded as qualified, otherwise, the model is unqualified.

5. Screening the obtained molecules;

preliminary filtration based on QED and SAscore ensures that the resulting molecules possess basic drug-like properties during the initial stages of molecular generation; the similarity calculation based on molecular fingerprints assists in eliminating structurally redundant molecules with low intellectual property. Preliminary screening of the resulting drug molecules was performed.

6. Providing an index for evaluating molecular generation and quantifying the quality of a model

To screen for rationality of the resulting molecular structure, the resulting neural network molecules need to be scored by means of a small neural network. MOSES and GuacaMol are two mainstream tools for scoring generated drug molecules, the former emphasizes testing of general drug-like indexes such as rationality, novelty and skeleton diversity of the generated molecules of the model, and the latter evaluates the multi-objective optimization capability of the model by defining a series of tasks.

Generating molecular rationality (Validity): rationality of a molecule refers to whether the structure, properties and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology and pharmacology are met, and whether it corresponds (at least theoretically) to a real molecule. If the molecular structure is SMILES, the MolFromSmiles method of RDkit toolkit is usually used to check whether it can be converted from SMILES format to rdmol object, if so, it is a reasonable molecule.

Molecular Novelty (Novelty) generation: the novelty of a molecule refers to whether the molecular structure is unique among known libraries of compounds or whether it is innovative. The index calculation mode can be set manually according to task requirements.

Table 1 model ROC-AUC curve comparison for molecular property prediction task:

	Logreg	KernalSVM	XGBoost	IRV	Multitask	GC	Weave	DPMG
									HIV	0.702	0.792	0.756	0.737	0.698	0.763	0.703	0.798
BACE	0.781	0.862	0.85	0.838	0.698	0.763	0.703	0.872
									BBBP	0.699	0.729	0.696	0.7	0.688	0.69	0.671	0.962
CLINTOX	0.722	0.669	0.799	0.77	0.778	0.807	0.832	0.984

table 2 model performance comparison of molecular generation tasks:

model	Validity	Uniqueness	Novelty
				JT-VAE	62％	100％	100％
GCPN	20％	99.97％	100％
				MRNN	65％	99.89％	100％
GraphNVP	55％	94.80％	100％
				GraphAF	68％	99.10％	100％
DPMG	85.28％	99.91％	100％

As can be seen from the contents of tables 1 and 2 above, the model built using the present invention can be generated by specifying only the values of one or more attributes, and has high effectiveness, uniqueness and novelty in the generation of molecules based on the specified attributes.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited thereto, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A method for generating molecules based on specified properties, comprising the specific steps of:

s5, screening the obtained molecules;

2. The method of claim 1, wherein the representation of the small molecule structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;

3. The method of generating molecules based on specified attributes according to claim 2, wherein in S1, textual SMILES is used as data for the input model;

4. A method of generating molecules based on specified properties according to claim 1, wherein in S1 different datasets are created for different properties of the same molecular structure.

5. The method for generating molecules based on specified attributes according to claim 1, wherein 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted in the same semantic space in the pre-training process of S3 to encode text data;

6. The method for generating molecules based on the specified attributes according to claim 1, wherein in the step S4, after the molecule is generated in the molecular attribute prediction process, the generated molecules are input into the DPMG model as input, and the rest of the attributes of the molecules are predicted to complement all the attributes of the molecules; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;

for the mean square error loss, the general formula is:

7. the method for generating molecules based on the specified attributes according to claim 1, wherein in the step of generating the target molecules with the specified attributes in S4, the drug molecules meeting the requirements are generated according to the given attribute parameters;

8. A method of generating molecules based on specified attributes according to claim 7 wherein the data uses binary cross entropy as a loss function in calculating the loss function:

the dataset presents data= (x ₁ ,y ₁ )(x ₂ ,y ₂ )(x ₃ ,y ₃ )(x ₄ ,y ₄ ) … … pattern, wherein,Is the input variable, i.e. a character generated in the model,/->Is the observed value, i.e. the output of the expected model, here we let y take 0 or 1, when the probability of y=1 is θ, i.e. P _θ When (y=1) =θ, the log-likelihood of the observed data points can be expressed by the above equation, where the likelihood function l (θ) is the objective function;

loss function in cross entropy form for a single sample:

Loss＝-[y _i logp+(1-y _i )log(1-p)]

9. The method of claim 1, wherein the basic classes of molecules generated based on QED and SAscore are initially filtered in S5.

10. The method of claim 1, wherein the molecular structure is selected by scoring the drug molecules using MOSES and GuacaMol in S6;

wherein,