CN117334271A - Method for generating molecules based on specified attributes - Google Patents
Method for generating molecules based on specified attributes Download PDFInfo
- Publication number
- CN117334271A CN117334271A CN202311238924.0A CN202311238924A CN117334271A CN 117334271 A CN117334271 A CN 117334271A CN 202311238924 A CN202311238924 A CN 202311238924A CN 117334271 A CN117334271 A CN 117334271A
- Authority
- CN
- China
- Prior art keywords
- molecular
- model
- molecules
- data
- generation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 239000003814 drug Substances 0.000 claims abstract description 38
- 229940079593 drug Drugs 0.000 claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 33
- 230000008569 process Effects 0.000 claims abstract description 17
- 238000012216 screening Methods 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 23
- 239000000370 acceptor Substances 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims description 5
- 230000014509 gene expression Effects 0.000 claims description 5
- 229910052739 hydrogen Inorganic materials 0.000 claims description 5
- 239000001257 hydrogen Substances 0.000 claims description 5
- 150000002632 lipids Chemical class 0.000 claims description 5
- 230000035699 permeability Effects 0.000 claims description 5
- 238000003786 synthesis reaction Methods 0.000 claims description 5
- 238000005192 partition Methods 0.000 claims description 4
- 230000004071 biological effect Effects 0.000 claims description 3
- 150000003384 small molecules Chemical group 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 2
- 238000005259 measurement Methods 0.000 claims description 2
- 230000000295 complement effect Effects 0.000 claims 1
- 238000011161 development Methods 0.000 abstract description 5
- 238000013135 deep learning Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000012827 research and development Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000009509 drug development Methods 0.000 description 4
- 150000001875 compounds Chemical class 0.000 description 3
- 238000009510 drug design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 101000741396 Chlamydia muridarum (strain MoPn / Nigg) Probable oxidoreductase TC_0900 Proteins 0.000 description 2
- 101000741399 Chlamydia pneumoniae Probable oxidoreductase CPn_0761/CP_1111/CPj0761/CpB0789 Proteins 0.000 description 2
- 101000741400 Chlamydia trachomatis (strain D/UW-3/Cx) Probable oxidoreductase CT_610 Proteins 0.000 description 2
- 238000004618 QSPR study Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 101100261173 Arabidopsis thaliana TPS7 gene Proteins 0.000 description 1
- 102100021257 Beta-secretase 1 Human genes 0.000 description 1
- 101000894895 Homo sapiens Beta-secretase 1 Proteins 0.000 description 1
- 238000004617 QSAR study Methods 0.000 description 1
- GLQOALGKMKUSBF-UHFFFAOYSA-N [amino(diphenyl)silyl]benzene Chemical compound C=1C=CC=CC=1[Si](C=1C=CC=CC=1)(N)C1=CC=CC=C1 GLQOALGKMKUSBF-UHFFFAOYSA-N 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 230000000857 drug effect Effects 0.000 description 1
- 230000036267 drug metabolism Effects 0.000 description 1
- 239000003596 drug target Substances 0.000 description 1
- 239000003999 initiator Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000007479 molecular analysis Methods 0.000 description 1
- 238000000302 molecular modelling Methods 0.000 description 1
- 239000002547 new drug Substances 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 125000006850 spacer group Chemical group 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
- 238000003041 virtual screening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes. S1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task; s2, establishing a molecular generation model based on pre-training and model fine adjustment; s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task; s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute; s5, screening the obtained molecules; s6, providing indexes for evaluating molecular generation and quantifying the quality of the model. According to the invention, the pre-training model is introduced into the generation of drug molecules, and the deep learning is introduced into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.
Description
Technical Field
The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes.
Background
Over the past few decades, computer science has gradually incorporated the field of drug development from initial data entry into the design of auxiliary drugs, with the advent of Computer-aided drug design (Computer-Aided Drug Design, CADD). Although CADD technology performs well on certain tasks, such as virtual screening of drug molecules, challenges remain in the design and optimization of drug molecules, and as technology progresses, artificial intelligence becomes the optimal solution to this problem.
AI pharmacy is based on big data of medicine, and by using AI technology such as machine learning, deep learning, etc. to replace a large number of experiments, the structure, efficacy, etc. of the medicine are rapidly analyzed, so as to achieve the technical means of developing new medicine in short time and with low cost. Compared with the traditional computer aided drug design, the AI technology can rapidly identify drug targets, match proper molecules from a database, design and synthesize compounds and predict drug metabolism properties and physicochemical properties, thereby greatly shortening drug research and development time, reducing research and development cost and improving success rate. The advent and application of pre-trained models, again in the great background of artificial intelligence development, shortens this process.
Labeling data sets, algorithm models and computational effort are indispensable components in AI pharmacy, and are also sources of major challenges facing molecular generation at present:
data aspect: the high-quality data has high acquisition threshold and obvious restriction influence. The data sources of the drug research and development enterprises can be divided into public data and non-public data, the public data is easy to obtain, but the data quality is difficult to guarantee, and the reliability of model operation performed according to the data sources is insufficient. The non-public data is mainly accumulation of previous projects of each pharmaceutical company, and the accuracy of the data is high, but the data is extremely difficult to obtain because the data belongs to core assets of the pharmaceutical company.
Algorithm aspect: the matching requirement of the algorithm and the application scene is high. The advantages of the algorithm model in AI drug development can reflect the accuracy, calculation speed, model quantity, generalization performance and the like of the result, and different pre-training models can have different emphasis directions, so that the advantages are different, and the pre-training model with corresponding advantages is reasonably selected under specific task requirements and application scenes.
The aspect of calculating force: trimming the model may require a significant amount of computing resources, especially when the model architecture and parameters need to be adjusted. This may limit the practical application of the method.
At present, the research and development of the AI drugs in China are mainly applied to drug discovery links and preclinical research links, and are limited by the inherent complexity of biological systems and the characteristic of disease heterogeneity, and the AI technology cannot bring revolutionary changes to the efficiency and success rate of drug research and development, and the whole AI technology is still in an exploration stage. In the future, with the updating of algorithms, the breakthrough of calculation power and the development of big data, the AI technology is deeply applied to each link of new drug development, and plays an increasingly important role in the stages of compound synthesis, drug effect prediction, automatic development and the like.
Disclosure of Invention
The invention aims at solving the problems in the background technology and provides a method for generating molecules based on specified attributes. The model introduces a pre-training model into the generation of drug molecules, and introduces deep learning into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.
The technical scheme of the invention, 1, a method for generating molecules based on specified attributes, comprises the following specific steps:
s1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task;
s2, establishing a molecular generation model based on pre-training and model fine adjustment;
s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task;
s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
s5, screening the obtained molecules;
s6, providing indexes for evaluating molecular generation and quantifying the quality of the model.
The expression form of the small molecular structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;
the small molecular data of the medicine collected in the S1 comprises public data and experimental output data, and the scale of the collected data reaches tens of millions;
making the collected drug molecule data into a table form comprising SMILES expression of the molecule, lipid partition coefficient, drug affinity, synthesis accessibility; and reasonable molecules were filtered out using the RDKit tool.
Preferably, the present invention uses a textified SMILES as the data for the input model;
the collected SMILES expressions and properties of the corresponding molecules also include solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies;
wherein the three molecular attributes of lipid distribution coefficient, drug similarity quantitative estimation and synthesis accessibility are generated according to the molecular attributes, and the target is a molecule with the same value as the value of the three attributes;
or specifying one or more combinations of six attributes of solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies.
Preferably, the model needs to generate drug molecules on the basis of designating corresponding attributes, so as to ensure that the neural network can recognize the relationships among the attributes, the molecular structures and the SMILES text information, and therefore different data sets are manufactured aiming at different attributes of the same molecular structure in S1, thereby ensuring the capability of the neural network to learn associated information.
In the S3 pre-training process, 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted in the same semantic space to encode text data;
wherein the text sequence information and the molecular structure information are learned by specifying the direction of the attention arrow.
S4, in the molecular attribute prediction process, after the molecular generation is finished, the generated molecules are input into a DPMG model, the rest attributes of the molecules are predicted, and all the attributes of the molecules are completed; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;
in the molecular attribute prediction process, a regression model is used as an end to obtain a numerical value, and MSE is used as a loss function:
for the mean square error loss, the general formula is:
in the above formula, y i As a true measurement value of the value,for model predictive value, let +.in regression model>The loss function is expressed as:
the training objective function is the value of a and b when searching the minimum value of the following functions:
s4, generating drug molecules meeting requirements according to given attribute parameters in the generation process of target molecules with the specified attributes;
inputting the numerical value of the molecular attribute into an encoder, and inputting the obtained word encryption in the encoder into a decoder to obtain an output SMILES file;
introducing a teacher model in the output process, and comparing the existing SMILES part with a correct SMILES molecular file serving as a reference every time an atom is generated, so as to finish the fine tuning operation of the training model;
after each atom is compared with the reference SMILES file, the reference SMILES file is used as a precondition for generating atoms in the next round, and each atom is generated, a loss value is calculated through comparison, and parameters are fine-tuned through back propagation.
The data adopts binary cross entropy as the loss function when calculating the loss function:
the dataset presents data= (x 1 ,y 1 )(x 2 ,y 2 )(x 3 ,y 3 )(x 4 ,y 4 ) … …, wherein,is the input variable, i.e. a character generated in the model,/->Is the observed value, i.e. the output of the expected model, here we let y take 0 or 1, when the probability of y=1 is θ, i.e. P θ When (y=1) =θ, the log-likelihood of the observed data points can be expressed by the above equation, where the likelihood function l (θ) is the objective function;
if a negative sign is added in front of it, it is converted into a loss function, and the loss function is y i Cross entropy with θ;
loss function in cross entropy form for a single sample:
Loss=-[y i logp+(1-y i )log(1-p)]
y i is the observation of the i-th sample, and P is the probability of prediction.
Preferably, in S5, the basic drug-like properties of the molecules generated based on QED and SAscore are subjected to preliminary filtration.
Preferably, in S6, the molecular structure of the drug is scored by using MOSES and Guacamol, and the molecular structure is screened;
wherein,
generating molecular rationality: rationality of a molecule refers to whether the structure, properties, and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology, and pharmacology are met, and whether the molecule corresponds to a real molecule; if the molecular structure is SMILES, the MolFromSmiles method of the RDkit kit is used for checking whether the molecular structure can be converted from the SMILES format to the rdmol object, and if so, the molecular structure is a reasonable molecule.
Compared with the prior art, the invention has the following beneficial technical effects:
1. the invention is generated according to the required attribute of the drug molecules, is simple and clear, and is convenient to understand and operate.
2. In the process of the traditional method, the links of screening and optimizing are required to be subjected to multiple experiments and calculation simulation, and the purpose of accelerating the screening process and reducing the workload of the simulation and experiment is achieved by adding a small neural network capable of screening according to the specified conditions after the completion of the generation.
3. The pre-training task aims at massive unlabeled data, so that the requirement on model parameters is high, and resources and time required by model training are prolonged. According to the invention, the adaptation of the downstream task is realized by carrying out fine adjustment on the trained pre-training model, so that the calculation cost of training the large model is reduced.
Drawings
FIG. 1 is a schematic diagram of a frame structure of a model of the present invention;
FIG. 2 is a schematic diagram of the pre-training of the model of the present invention;
FIG. 3 is a schematic representation of text data encoding of the model of the present invention;
FIG. 4 is a fine-tuning model of the molecular property prediction task of the model of the present invention;
FIG. 5 is a fine-tuning model of a target molecule generation task for specified properties of the model of the present invention.
Detailed Description
Example 1
The invention provides a method for generating molecules based on specified attributes, which comprises the following specific steps:
1. collecting drug small molecule data, and preparing a data set aiming at a molecule generation task;
a table is built that includes molecular SMILES structure text files, molecular attribute values (SAS, QED, etc.). The data obtained are shown in the following table:
data volume | |
Training set | 207493 |
Test set | 50000 |
The attributes involved in the present model dataset include: lipid partition coefficient (LogP, partition coefficient), quantitative drug similarity estimation (QED, quantitative Estimate of Drug-like), synthetic accessibility (SAS, synthetic Accessibility Score), and the generation of molecules targeted to have the same values as those specifying these three attributes. Other molecular properties such as solubility and permeability estimates (linpinski), molecular Weight (MW), topological polar surface area (TPSA, topological polar surface area), number of hydrogen bond donors and acceptors (HBD & HBA, numbers ofhydrogen-bond detectors & acceptors), number of alarm structures (ALERT), number of rotatable bodies (ROTB) may also be specified.
It should be noted that although the present model can correspondingly implement two different downstream tasks through fine tuning, the fine tuning dataset is identical, except for the classification mode of the data, for molecular attribute prediction, the dataset is in the form of molecular SMILES and its corresponding attributes, and for molecular generation, the dataset is in the form of molecular SMILES corresponding to a certain attribute.
2. Establishing a molecular generation model based on pre-training and model fine tuning; as shown in fig. 1, the overall model framework is divided into two steps of molecular generation and molecular screening, and the three technical points of pre-training a model, pre-training fine tuning and molecular screening are involved.
3. Performing fine tuning training on the molecular generation model by using the data set manufactured in the step S1 to obtain a model suitable for a molecular generation downstream task; as shown in fig. 2, the computerized simulation and analysis of small molecules greatly speeds up the process of drug development. Characterization and understanding of molecules is an essential step in achieving this goal. Various molecular characterizations, such as molecular descriptors and fingerprints, have been proposed. Traditionally, these descriptors are designed by domain experts based on chemical and pharmaceutical knowledge for qualitatively or quantitatively representing molecules. Various shallow learning-based machine learning models are used to obtain quantitative structure-activity relationships (qsar) and quantitative structure-property relationships (QSPRs) to predict the activity and properties of molecules. With the advent of deep learning and representation learning in recent years, automatic representation and understanding of molecules by learning advanced features underlying low-level data has become an effective method of molecular modeling, making it possible to directly input original molecules for subsequent molecular analysis.
In the aspect of encoding and decoding, a sequence model in a text sequence, such as a Recurrent Neural Network (RNN), a long-short-time memory network (LSTM) and the like, is used, and a transducer is adopted to process the character sequence of the SMILES, so that the encoding and decoding effects are achieved. The model adopts 12 layers of transformers, 768 hidden layers and 12 attention arrows to realize the encoding operation of text data. In order to ensure that the coding and decoding process can not cause confusion of the generated structures due to different codebooks, the coding and decoding of the model are limited in the same semantic space
As shown in fig. 3, in the pre-training model, a part of attention arrows (solid line part) are bidirectional, and both the information at the front and the information at the rear can be connected, while the part connected by the dotted line realizes the contrast learning of molecular structure and text information through non-bidirectional connection, namely a causal relation model in an attention mechanism; modifying the model attention mechanism arrow, and then performing fine tuning training on the pre-training model by using the data set to obtain a model suitable for a molecular generation downstream task.
4. Adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
as shown in fig. 5, a fine-tuning model of a task is generated using target molecules of specified properties: in the downstream task of total generation, the model generates molecules, the generated molecules are scored through a molecular screening link, and the scoring standard is generated molecular rationality (Validity) to generate molecular Novelty (Novelty).
During the execution of this downstream task, [ bos ] is selected as the initiator, [ seq ] is the spacer, [ seq-len ] is used to control the text length of the generated molecular SMILES file, and the input format of the molecular attributes is: attribute 1[ seq ] attribute 2[ seq ] attribute 3. The term "additionally requires attention, in the case of a model that has been trimmed, no loss is needed to update the parameters, and every output of the SMILES file is directly most input into the model.
The test benchmarks for this term are:
test index | Validity | Novelty | Attribute bias |
Expected value of test index | 0.87 | 0.99 | Less than or equal to 5% |
As shown in fig. 4, a fine-tuning model of molecular property prediction tasks is used:
in this downstream task, fine tuning training is performed by changing the attention mechanism arrow and using differently labeled data sets, it should be noted that although both the molecular property prediction and the molecular generation model are called DPMG, these are essentially two different models, which are applicable to the respective downstream task, and there are differences in both structure and parameters. The input of this downstream task is the drug molecule generated by inputting the specified attribute in the last downstream task.
In this downstream task, the transducer acts as an encoder, the encoded vector passing through the transducer layer is ultimately converted to text output by a machine-learned classification model, which is back-propagated through the loss function of MSE during training to update the model parameters.
In the downstream task, the model predicts the generated molecular completion attribute, and the prediction deviation of the molecular attribute is within 5 percent and is regarded as qualified, otherwise, the model is unqualified.
5. Screening the obtained molecules;
preliminary filtration based on QED and SAscore ensures that the resulting molecules possess basic drug-like properties during the initial stages of molecular generation; the similarity calculation based on molecular fingerprints assists in eliminating structurally redundant molecules with low intellectual property. Preliminary screening of the resulting drug molecules was performed.
6. Providing an index for evaluating molecular generation and quantifying the quality of a model
To screen for rationality of the resulting molecular structure, the resulting neural network molecules need to be scored by means of a small neural network. MOSES and GuacaMol are two mainstream tools for scoring generated drug molecules, the former emphasizes testing of general drug-like indexes such as rationality, novelty and skeleton diversity of the generated molecules of the model, and the latter evaluates the multi-objective optimization capability of the model by defining a series of tasks.
Generating molecular rationality (Validity): rationality of a molecule refers to whether the structure, properties and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology and pharmacology are met, and whether it corresponds (at least theoretically) to a real molecule. If the molecular structure is SMILES, the MolFromSmiles method of RDkit toolkit is usually used to check whether it can be converted from SMILES format to rdmol object, if so, it is a reasonable molecule.
Molecular Novelty (Novelty) generation: the novelty of a molecule refers to whether the molecular structure is unique among known libraries of compounds or whether it is innovative. The index calculation mode can be set manually according to task requirements.
Table 1 model ROC-AUC curve comparison for molecular property prediction task:
Logreg | KernalSVM | XGBoost | IRV | Multitask | GC | Weave | DPMG | |
HIV | 0.702 | 0.792 | 0.756 | 0.737 | 0.698 | 0.763 | 0.703 | 0.798 |
BACE | 0.781 | 0.862 | 0.85 | 0.838 | 0.698 | 0.763 | 0.703 | 0.872 |
BBBP | 0.699 | 0.729 | 0.696 | 0.7 | 0.688 | 0.69 | 0.671 | 0.962 |
CLINTOX | 0.722 | 0.669 | 0.799 | 0.77 | 0.778 | 0.807 | 0.832 | 0.984 |
table 2 model performance comparison of molecular generation tasks:
model | Validity | Uniqueness | Novelty |
JT-VAE | 62% | 100% | 100% |
GCPN | 20% | 99.97% | 100% |
MRNN | 65% | 99.89% | 100% |
GraphNVP | 55% | 94.80% | 100% |
GraphAF | 68% | 99.10% | 100% |
DPMG | 85.28% | 99.91% | 100% |
As can be seen from the contents of tables 1 and 2 above, the model built using the present invention can be generated by specifying only the values of one or more attributes, and has high effectiveness, uniqueness and novelty in the generation of molecules based on the specified attributes.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited thereto, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.
Claims (10)
1. A method for generating molecules based on specified properties, comprising the specific steps of:
s1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task;
s2, establishing a molecular generation model based on pre-training and model fine adjustment;
s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task;
s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
s5, screening the obtained molecules;
s6, providing indexes for evaluating molecular generation and quantifying the quality of the model.
2. The method of claim 1, wherein the representation of the small molecule structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;
the small molecular data of the medicine collected in the S1 comprises public data and experimental output data, and the scale of the collected data reaches tens of millions;
making the collected drug molecule data into a table form comprising SMILES expression of the molecule, lipid partition coefficient, drug affinity, synthesis accessibility; and reasonable molecules were filtered out using the RDKit tool.
3. The method of generating molecules based on specified attributes according to claim 2, wherein in S1, textual SMILES is used as data for the input model;
the collected SMILES expressions and properties of the corresponding molecules also include solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies;
wherein the three molecular attributes of lipid distribution coefficient, drug similarity quantitative estimation and synthesis accessibility are generated according to the molecular attributes, and the target is a molecule with the same value as the value of the three attributes;
or specifying one or more combinations of six attributes of solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies.
4. A method of generating molecules based on specified properties according to claim 1, wherein in S1 different datasets are created for different properties of the same molecular structure.
5. The method for generating molecules based on specified attributes according to claim 1, wherein 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted in the same semantic space in the pre-training process of S3 to encode text data;
wherein the text sequence information and the molecular structure information are learned by specifying the direction of the attention arrow.
6. The method for generating molecules based on the specified attributes according to claim 1, wherein in the step S4, after the molecule is generated in the molecular attribute prediction process, the generated molecules are input into the DPMG model as input, and the rest of the attributes of the molecules are predicted to complement all the attributes of the molecules; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;
in the molecular attribute prediction process, a regression model is used as an end to obtain a numerical value, and MSE is used as a loss function:
for the mean square error loss, the general formula is:
in the above formula, y i As a true measurement value of the value,for model predictive value, let +.in regression model>The loss function is expressed as:
the training objective function is the value of a and b when searching the minimum value of the following functions:
7. the method for generating molecules based on the specified attributes according to claim 1, wherein in the step of generating the target molecules with the specified attributes in S4, the drug molecules meeting the requirements are generated according to the given attribute parameters;
inputting the numerical value of the molecular attribute into an encoder, and inputting the obtained word encryption in the encoder into a decoder to obtain an output SMILES file;
introducing a teacher model in the output process, and comparing the existing SMILES part with a correct SMILES molecular file serving as a reference every time an atom is generated, so as to finish the fine tuning operation of the training model;
after each atom is compared with the reference SMILES file, the reference SMILES file is used as a precondition for generating atoms in the next round, and each atom is generated, a loss value is calculated through comparison, and parameters are fine-tuned through back propagation.
8. A method of generating molecules based on specified attributes according to claim 7 wherein the data uses binary cross entropy as a loss function in calculating the loss function:
the dataset presents data= (x 1 ,y 1 )(x 2 ,y 2 )(x 3 ,y 3 )(x 4 ,y 4 ) … … pattern, wherein,Is the input variable, i.e. a character generated in the model,/->Is the observed value, i.e. the output of the expected model, here we let y take 0 or 1, when the probability of y=1 is θ, i.e. P θ When (y=1) =θ, the log-likelihood of the observed data points can be expressed by the above equation, where the likelihood function l (θ) is the objective function;
if a negative sign is added in front of it, it is converted into a loss function, and the loss function is y i Cross entropy with θ;
loss function in cross entropy form for a single sample:
Loss=-[y i logp+(1-y i )log(1-p)]
y i is the observation of the i-th sample, and P is the probability of prediction.
9. The method of claim 1, wherein the basic classes of molecules generated based on QED and SAscore are initially filtered in S5.
10. The method of claim 1, wherein the molecular structure is selected by scoring the drug molecules using MOSES and GuacaMol in S6;
wherein,
generating molecular rationality: rationality of a molecule refers to whether the structure, properties, and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology, and pharmacology are met, and whether the molecule corresponds to a real molecule; if the molecular structure is SMILES, the MolFromSmiles method of the RDkit kit is used for checking whether the molecular structure can be converted from the SMILES format to the rdmol object, and if so, the molecular structure is a reasonable molecule.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311238924.0A CN117334271A (en) | 2023-09-25 | 2023-09-25 | Method for generating molecules based on specified attributes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311238924.0A CN117334271A (en) | 2023-09-25 | 2023-09-25 | Method for generating molecules based on specified attributes |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117334271A true CN117334271A (en) | 2024-01-02 |
Family
ID=89289519
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311238924.0A Pending CN117334271A (en) | 2023-09-25 | 2023-09-25 | Method for generating molecules based on specified attributes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117334271A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117594157A (en) * | 2024-01-19 | 2024-02-23 | 烟台国工智能科技有限公司 | Method and device for generating molecules of single system based on reinforcement learning |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110459274A (en) * | 2019-08-01 | 2019-11-15 | 南京邮电大学 | A kind of small-molecule drug virtual screening method and its application based on depth migration study |
CN110534164A (en) * | 2019-09-26 | 2019-12-03 | 广州费米子科技有限责任公司 | Drug molecule generation method based on deep learning |
CN112071373A (en) * | 2020-09-02 | 2020-12-11 | 深圳晶泰科技有限公司 | Drug molecule screening method and system |
CN112270951A (en) * | 2020-11-10 | 2021-01-26 | 四川大学 | Brand-new molecule generation method based on multitask capsule self-encoder neural network |
WO2022047677A1 (en) * | 2020-09-02 | 2022-03-10 | 深圳晶泰科技有限公司 | Drug molecule screening method and system |
CN114220497A (en) * | 2021-12-14 | 2022-03-22 | 中国科学院过程工程研究所 | Ionic liquid type antibiotic drug property prediction method based on transfer learning and graph neural network and high-throughput screening platform |
CN115240787A (en) * | 2022-07-26 | 2022-10-25 | 四川大学 | Brand-new molecule generation method based on deep conditional recurrent neural network |
CN115240782A (en) * | 2022-06-23 | 2022-10-25 | 中国科学院自动化研究所 | Drug attribute prediction method, device, electronic device and storage medium |
CN115359856A (en) * | 2022-07-25 | 2022-11-18 | 杭州碳硅智慧科技发展有限公司 | Training method and device of molecular generation model |
WO2023029351A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules |
CN115881244A (en) * | 2022-11-04 | 2023-03-31 | 中国药科大学 | Drug molecular skeleton replacement and screening method based on deep migration learning model |
CN116779059A (en) * | 2023-07-07 | 2023-09-19 | 北京迈高材云科技有限公司 | Molecular property prediction method based on attention mechanism migration learning |
CN117275609A (en) * | 2023-10-16 | 2023-12-22 | 深度感知生物医学科技(广州)有限公司 | Molecular design method based on variation self-encoder and transducer model |
-
2023
- 2023-09-25 CN CN202311238924.0A patent/CN117334271A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110459274A (en) * | 2019-08-01 | 2019-11-15 | 南京邮电大学 | A kind of small-molecule drug virtual screening method and its application based on depth migration study |
CN110534164A (en) * | 2019-09-26 | 2019-12-03 | 广州费米子科技有限责任公司 | Drug molecule generation method based on deep learning |
CN112071373A (en) * | 2020-09-02 | 2020-12-11 | 深圳晶泰科技有限公司 | Drug molecule screening method and system |
WO2022047677A1 (en) * | 2020-09-02 | 2022-03-10 | 深圳晶泰科技有限公司 | Drug molecule screening method and system |
CN112270951A (en) * | 2020-11-10 | 2021-01-26 | 四川大学 | Brand-new molecule generation method based on multitask capsule self-encoder neural network |
WO2023029351A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules |
CN114220497A (en) * | 2021-12-14 | 2022-03-22 | 中国科学院过程工程研究所 | Ionic liquid type antibiotic drug property prediction method based on transfer learning and graph neural network and high-throughput screening platform |
CN115240782A (en) * | 2022-06-23 | 2022-10-25 | 中国科学院自动化研究所 | Drug attribute prediction method, device, electronic device and storage medium |
CN115359856A (en) * | 2022-07-25 | 2022-11-18 | 杭州碳硅智慧科技发展有限公司 | Training method and device of molecular generation model |
CN115240787A (en) * | 2022-07-26 | 2022-10-25 | 四川大学 | Brand-new molecule generation method based on deep conditional recurrent neural network |
CN115881244A (en) * | 2022-11-04 | 2023-03-31 | 中国药科大学 | Drug molecular skeleton replacement and screening method based on deep migration learning model |
CN116779059A (en) * | 2023-07-07 | 2023-09-19 | 北京迈高材云科技有限公司 | Molecular property prediction method based on attention mechanism migration learning |
CN117275609A (en) * | 2023-10-16 | 2023-12-22 | 深度感知生物医学科技(广州)有限公司 | Molecular design method based on variation self-encoder and transducer model |
Non-Patent Citations (2)
Title |
---|
HOP P: "Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts", MOLECULAR PHARMACEUTICS, vol. 15, no. 10, 31 December 2018 (2018-12-31), pages 4371 - 4377 * |
刘景陶;刘映雪;: "计算机辅助药物设计的原理及应用", 科技创新与应用, no. 33, 28 November 2016 (2016-11-28), pages 58 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117594157A (en) * | 2024-01-19 | 2024-02-23 | 烟台国工智能科技有限公司 | Method and device for generating molecules of single system based on reinforcement learning |
CN117594157B (en) * | 2024-01-19 | 2024-04-09 | 烟台国工智能科技有限公司 | Method and device for generating molecules of single system based on reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines | |
CN108228716B (en) | SMOTE _ Bagging integrated sewage treatment fault diagnosis method based on weighted extreme learning machine | |
Knowles | ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems | |
CN108647226B (en) | Hybrid recommendation method based on variational automatic encoder | |
CN110910951A (en) | Method for predicting protein and ligand binding free energy based on progressive neural network | |
CN109994158B (en) | System and method for constructing molecular reverse stress field based on reinforcement learning | |
CN110083125B (en) | Machine tool thermal error modeling method based on deep learning | |
US11030275B2 (en) | Modelling ordinary differential equations using a variational auto encoder | |
CN113838536B (en) | Translation model construction method, product prediction model construction method and prediction method | |
CN111461286B (en) | Spark parameter automatic optimization system and method based on evolutionary neural network | |
US20230197205A1 (en) | Bioretrosynthetic method and system based on and-or tree and single-step reaction template prediction | |
CN117334271A (en) | Method for generating molecules based on specified attributes | |
CN111785326B (en) | Gene expression profile prediction method after drug action based on generation of antagonism network | |
CN114708903A (en) | Method for predicting distance between protein residues based on self-attention mechanism | |
CN106599610A (en) | Method and system for predicting association between long non-coding RNA and protein | |
CN116976505A (en) | Click rate prediction method of decoupling attention network based on information sharing | |
Zhou et al. | TransVAE-DTA: Transformer and variational autoencoder network for drug-target binding affinity prediction | |
CN114281950B (en) | Data retrieval method and system based on multi-graph weighted fusion | |
CN114819107B (en) | Mixed data assimilation method based on deep learning | |
CN116054144A (en) | Distribution network reconstruction method, system and storage medium for distributed photovoltaic access | |
CN115410642A (en) | Biological relation network information modeling method and system | |
Kavipriya et al. | Adaptive weight deep convolutional neural network (AWDCNN) classifier for predicting student’s performance in job placement process | |
CN111242379A (en) | Nuclear recursive maximum correlation entropy time sequence online prediction method based on random Fourier features | |
Walter et al. | Package ‘mistral’ | |
CN115620807B (en) | Method for predicting interaction strength between target protein molecule and drug molecule |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |