CN117334271A - Method for generating molecules based on specified attributes - Google Patents

Method for generating molecules based on specified attributes Download PDF

Info

Publication number
CN117334271A
CN117334271A CN202311238924.0A CN202311238924A CN117334271A CN 117334271 A CN117334271 A CN 117334271A CN 202311238924 A CN202311238924 A CN 202311238924A CN 117334271 A CN117334271 A CN 117334271A
Authority
CN
China
Prior art keywords
molecular
model
molecules
data
generation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311238924.0A
Other languages
Chinese (zh)
Inventor
顾忠泽
于文龙
丁彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Institute Of Sports Health
Original Assignee
Jiangsu Institute Of Sports Health
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Institute Of Sports Health filed Critical Jiangsu Institute Of Sports Health
Priority to CN202311238924.0A priority Critical patent/CN117334271A/en
Publication of CN117334271A publication Critical patent/CN117334271A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes. S1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task; s2, establishing a molecular generation model based on pre-training and model fine adjustment; s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task; s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute; s5, screening the obtained molecules; s6, providing indexes for evaluating molecular generation and quantifying the quality of the model. According to the invention, the pre-training model is introduced into the generation of drug molecules, and the deep learning is introduced into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.

Description

Method for generating molecules based on specified attributes
Technical Field
The invention relates to the technical field of drug molecule generation models, in particular to a method for generating molecules based on specified attributes.
Background
Over the past few decades, computer science has gradually incorporated the field of drug development from initial data entry into the design of auxiliary drugs, with the advent of Computer-aided drug design (Computer-Aided Drug Design, CADD). Although CADD technology performs well on certain tasks, such as virtual screening of drug molecules, challenges remain in the design and optimization of drug molecules, and as technology progresses, artificial intelligence becomes the optimal solution to this problem.
AI pharmacy is based on big data of medicine, and by using AI technology such as machine learning, deep learning, etc. to replace a large number of experiments, the structure, efficacy, etc. of the medicine are rapidly analyzed, so as to achieve the technical means of developing new medicine in short time and with low cost. Compared with the traditional computer aided drug design, the AI technology can rapidly identify drug targets, match proper molecules from a database, design and synthesize compounds and predict drug metabolism properties and physicochemical properties, thereby greatly shortening drug research and development time, reducing research and development cost and improving success rate. The advent and application of pre-trained models, again in the great background of artificial intelligence development, shortens this process.
Labeling data sets, algorithm models and computational effort are indispensable components in AI pharmacy, and are also sources of major challenges facing molecular generation at present:
data aspect: the high-quality data has high acquisition threshold and obvious restriction influence. The data sources of the drug research and development enterprises can be divided into public data and non-public data, the public data is easy to obtain, but the data quality is difficult to guarantee, and the reliability of model operation performed according to the data sources is insufficient. The non-public data is mainly accumulation of previous projects of each pharmaceutical company, and the accuracy of the data is high, but the data is extremely difficult to obtain because the data belongs to core assets of the pharmaceutical company.
Algorithm aspect: the matching requirement of the algorithm and the application scene is high. The advantages of the algorithm model in AI drug development can reflect the accuracy, calculation speed, model quantity, generalization performance and the like of the result, and different pre-training models can have different emphasis directions, so that the advantages are different, and the pre-training model with corresponding advantages is reasonably selected under specific task requirements and application scenes.
The aspect of calculating force: trimming the model may require a significant amount of computing resources, especially when the model architecture and parameters need to be adjusted. This may limit the practical application of the method.
At present, the research and development of the AI drugs in China are mainly applied to drug discovery links and preclinical research links, and are limited by the inherent complexity of biological systems and the characteristic of disease heterogeneity, and the AI technology cannot bring revolutionary changes to the efficiency and success rate of drug research and development, and the whole AI technology is still in an exploration stage. In the future, with the updating of algorithms, the breakthrough of calculation power and the development of big data, the AI technology is deeply applied to each link of new drug development, and plays an increasingly important role in the stages of compound synthesis, drug effect prediction, automatic development and the like.
Disclosure of Invention
The invention aims at solving the problems in the background technology and provides a method for generating molecules based on specified attributes. The model introduces a pre-training model into the generation of drug molecules, and introduces deep learning into the field of molecular generation through fine adjustment of the pre-training model, so that the development process of the drug is remarkably accelerated.
The technical scheme of the invention, 1, a method for generating molecules based on specified attributes, comprises the following specific steps:
s1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task;
s2, establishing a molecular generation model based on pre-training and model fine adjustment;
s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task;
s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
s5, screening the obtained molecules;
s6, providing indexes for evaluating molecular generation and quantifying the quality of the model.
The expression form of the small molecular structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;
the small molecular data of the medicine collected in the S1 comprises public data and experimental output data, and the scale of the collected data reaches tens of millions;
making the collected drug molecule data into a table form comprising SMILES expression of the molecule, lipid partition coefficient, drug affinity, synthesis accessibility; and reasonable molecules were filtered out using the RDKit tool.
Preferably, the present invention uses a textified SMILES as the data for the input model;
the collected SMILES expressions and properties of the corresponding molecules also include solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies;
wherein the three molecular attributes of lipid distribution coefficient, drug similarity quantitative estimation and synthesis accessibility are generated according to the molecular attributes, and the target is a molecule with the same value as the value of the three attributes;
or specifying one or more combinations of six attributes of solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies.
Preferably, the model needs to generate drug molecules on the basis of designating corresponding attributes, so as to ensure that the neural network can recognize the relationships among the attributes, the molecular structures and the SMILES text information, and therefore different data sets are manufactured aiming at different attributes of the same molecular structure in S1, thereby ensuring the capability of the neural network to learn associated information.
In the S3 pre-training process, 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted in the same semantic space to encode text data;
wherein the text sequence information and the molecular structure information are learned by specifying the direction of the attention arrow.
S4, in the molecular attribute prediction process, after the molecular generation is finished, the generated molecules are input into a DPMG model, the rest attributes of the molecules are predicted, and all the attributes of the molecules are completed; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;
in the molecular attribute prediction process, a regression model is used as an end to obtain a numerical value, and MSE is used as a loss function:
for the mean square error loss, the general formula is:
in the above formula, y i As a true measurement value of the value,for model predictive value, let +.in regression model>The loss function is expressed as:
the training objective function is the value of a and b when searching the minimum value of the following functions:
s4, generating drug molecules meeting requirements according to given attribute parameters in the generation process of target molecules with the specified attributes;
inputting the numerical value of the molecular attribute into an encoder, and inputting the obtained word encryption in the encoder into a decoder to obtain an output SMILES file;
introducing a teacher model in the output process, and comparing the existing SMILES part with a correct SMILES molecular file serving as a reference every time an atom is generated, so as to finish the fine tuning operation of the training model;
after each atom is compared with the reference SMILES file, the reference SMILES file is used as a precondition for generating atoms in the next round, and each atom is generated, a loss value is calculated through comparison, and parameters are fine-tuned through back propagation.
The data adopts binary cross entropy as the loss function when calculating the loss function:
the dataset presents data= (x 1 ,y 1 )(x 2 ,y 2 )(x 3 ,y 3 )(x 4 ,y 4 ) … …, wherein,is the input variable, i.e. a character generated in the model,/->Is the observed value, i.e. the output of the expected model, here we let y take 0 or 1, when the probability of y=1 is θ, i.e. P θ When (y=1) =θ, the log-likelihood of the observed data points can be expressed by the above equation, where the likelihood function l (θ) is the objective function;
if a negative sign is added in front of it, it is converted into a loss function, and the loss function is y i Cross entropy with θ;
loss function in cross entropy form for a single sample:
Loss=-[y i logp+(1-y i )log(1-p)]
y i is the observation of the i-th sample, and P is the probability of prediction.
Preferably, in S5, the basic drug-like properties of the molecules generated based on QED and SAscore are subjected to preliminary filtration.
Preferably, in S6, the molecular structure of the drug is scored by using MOSES and Guacamol, and the molecular structure is screened;
wherein,
generating molecular rationality: rationality of a molecule refers to whether the structure, properties, and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology, and pharmacology are met, and whether the molecule corresponds to a real molecule; if the molecular structure is SMILES, the MolFromSmiles method of the RDkit kit is used for checking whether the molecular structure can be converted from the SMILES format to the rdmol object, and if so, the molecular structure is a reasonable molecule.
Compared with the prior art, the invention has the following beneficial technical effects:
1. the invention is generated according to the required attribute of the drug molecules, is simple and clear, and is convenient to understand and operate.
2. In the process of the traditional method, the links of screening and optimizing are required to be subjected to multiple experiments and calculation simulation, and the purpose of accelerating the screening process and reducing the workload of the simulation and experiment is achieved by adding a small neural network capable of screening according to the specified conditions after the completion of the generation.
3. The pre-training task aims at massive unlabeled data, so that the requirement on model parameters is high, and resources and time required by model training are prolonged. According to the invention, the adaptation of the downstream task is realized by carrying out fine adjustment on the trained pre-training model, so that the calculation cost of training the large model is reduced.
Drawings
FIG. 1 is a schematic diagram of a frame structure of a model of the present invention;
FIG. 2 is a schematic diagram of the pre-training of the model of the present invention;
FIG. 3 is a schematic representation of text data encoding of the model of the present invention;
FIG. 4 is a fine-tuning model of the molecular property prediction task of the model of the present invention;
FIG. 5 is a fine-tuning model of a target molecule generation task for specified properties of the model of the present invention.
Detailed Description
Example 1
The invention provides a method for generating molecules based on specified attributes, which comprises the following specific steps:
1. collecting drug small molecule data, and preparing a data set aiming at a molecule generation task;
a table is built that includes molecular SMILES structure text files, molecular attribute values (SAS, QED, etc.). The data obtained are shown in the following table:
data volume
Training set 207493
Test set 50000
The attributes involved in the present model dataset include: lipid partition coefficient (LogP, partition coefficient), quantitative drug similarity estimation (QED, quantitative Estimate of Drug-like), synthetic accessibility (SAS, synthetic Accessibility Score), and the generation of molecules targeted to have the same values as those specifying these three attributes. Other molecular properties such as solubility and permeability estimates (linpinski), molecular Weight (MW), topological polar surface area (TPSA, topological polar surface area), number of hydrogen bond donors and acceptors (HBD & HBA, numbers ofhydrogen-bond detectors & acceptors), number of alarm structures (ALERT), number of rotatable bodies (ROTB) may also be specified.
It should be noted that although the present model can correspondingly implement two different downstream tasks through fine tuning, the fine tuning dataset is identical, except for the classification mode of the data, for molecular attribute prediction, the dataset is in the form of molecular SMILES and its corresponding attributes, and for molecular generation, the dataset is in the form of molecular SMILES corresponding to a certain attribute.
2. Establishing a molecular generation model based on pre-training and model fine tuning; as shown in fig. 1, the overall model framework is divided into two steps of molecular generation and molecular screening, and the three technical points of pre-training a model, pre-training fine tuning and molecular screening are involved.
3. Performing fine tuning training on the molecular generation model by using the data set manufactured in the step S1 to obtain a model suitable for a molecular generation downstream task; as shown in fig. 2, the computerized simulation and analysis of small molecules greatly speeds up the process of drug development. Characterization and understanding of molecules is an essential step in achieving this goal. Various molecular characterizations, such as molecular descriptors and fingerprints, have been proposed. Traditionally, these descriptors are designed by domain experts based on chemical and pharmaceutical knowledge for qualitatively or quantitatively representing molecules. Various shallow learning-based machine learning models are used to obtain quantitative structure-activity relationships (qsar) and quantitative structure-property relationships (QSPRs) to predict the activity and properties of molecules. With the advent of deep learning and representation learning in recent years, automatic representation and understanding of molecules by learning advanced features underlying low-level data has become an effective method of molecular modeling, making it possible to directly input original molecules for subsequent molecular analysis.
In the aspect of encoding and decoding, a sequence model in a text sequence, such as a Recurrent Neural Network (RNN), a long-short-time memory network (LSTM) and the like, is used, and a transducer is adopted to process the character sequence of the SMILES, so that the encoding and decoding effects are achieved. The model adopts 12 layers of transformers, 768 hidden layers and 12 attention arrows to realize the encoding operation of text data. In order to ensure that the coding and decoding process can not cause confusion of the generated structures due to different codebooks, the coding and decoding of the model are limited in the same semantic space
As shown in fig. 3, in the pre-training model, a part of attention arrows (solid line part) are bidirectional, and both the information at the front and the information at the rear can be connected, while the part connected by the dotted line realizes the contrast learning of molecular structure and text information through non-bidirectional connection, namely a causal relation model in an attention mechanism; modifying the model attention mechanism arrow, and then performing fine tuning training on the pre-training model by using the data set to obtain a model suitable for a molecular generation downstream task.
4. Adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
as shown in fig. 5, a fine-tuning model of a task is generated using target molecules of specified properties: in the downstream task of total generation, the model generates molecules, the generated molecules are scored through a molecular screening link, and the scoring standard is generated molecular rationality (Validity) to generate molecular Novelty (Novelty).
During the execution of this downstream task, [ bos ] is selected as the initiator, [ seq ] is the spacer, [ seq-len ] is used to control the text length of the generated molecular SMILES file, and the input format of the molecular attributes is: attribute 1[ seq ] attribute 2[ seq ] attribute 3. The term "additionally requires attention, in the case of a model that has been trimmed, no loss is needed to update the parameters, and every output of the SMILES file is directly most input into the model.
The test benchmarks for this term are:
test index Validity Novelty Attribute bias
Expected value of test index 0.87 0.99 Less than or equal to 5%
As shown in fig. 4, a fine-tuning model of molecular property prediction tasks is used:
in this downstream task, fine tuning training is performed by changing the attention mechanism arrow and using differently labeled data sets, it should be noted that although both the molecular property prediction and the molecular generation model are called DPMG, these are essentially two different models, which are applicable to the respective downstream task, and there are differences in both structure and parameters. The input of this downstream task is the drug molecule generated by inputting the specified attribute in the last downstream task.
In this downstream task, the transducer acts as an encoder, the encoded vector passing through the transducer layer is ultimately converted to text output by a machine-learned classification model, which is back-propagated through the loss function of MSE during training to update the model parameters.
In the downstream task, the model predicts the generated molecular completion attribute, and the prediction deviation of the molecular attribute is within 5 percent and is regarded as qualified, otherwise, the model is unqualified.
5. Screening the obtained molecules;
preliminary filtration based on QED and SAscore ensures that the resulting molecules possess basic drug-like properties during the initial stages of molecular generation; the similarity calculation based on molecular fingerprints assists in eliminating structurally redundant molecules with low intellectual property. Preliminary screening of the resulting drug molecules was performed.
6. Providing an index for evaluating molecular generation and quantifying the quality of a model
To screen for rationality of the resulting molecular structure, the resulting neural network molecules need to be scored by means of a small neural network. MOSES and GuacaMol are two mainstream tools for scoring generated drug molecules, the former emphasizes testing of general drug-like indexes such as rationality, novelty and skeleton diversity of the generated molecules of the model, and the latter evaluates the multi-objective optimization capability of the model by defining a series of tasks.
Generating molecular rationality (Validity): rationality of a molecule refers to whether the structure, properties and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology and pharmacology are met, and whether it corresponds (at least theoretically) to a real molecule. If the molecular structure is SMILES, the MolFromSmiles method of RDkit toolkit is usually used to check whether it can be converted from SMILES format to rdmol object, if so, it is a reasonable molecule.
Molecular Novelty (Novelty) generation: the novelty of a molecule refers to whether the molecular structure is unique among known libraries of compounds or whether it is innovative. The index calculation mode can be set manually according to task requirements.
Table 1 model ROC-AUC curve comparison for molecular property prediction task:
Logreg KernalSVM XGBoost IRV Multitask GC Weave DPMG
HIV 0.702 0.792 0.756 0.737 0.698 0.763 0.703 0.798
BACE 0.781 0.862 0.85 0.838 0.698 0.763 0.703 0.872
BBBP 0.699 0.729 0.696 0.7 0.688 0.69 0.671 0.962
CLINTOX 0.722 0.669 0.799 0.77 0.778 0.807 0.832 0.984
table 2 model performance comparison of molecular generation tasks:
model Validity Uniqueness Novelty
JT-VAE 62% 100% 100%
GCPN 20% 99.97% 100%
MRNN 65% 99.89% 100%
GraphNVP 55% 94.80% 100%
GraphAF 68% 99.10% 100%
DPMG 85.28% 99.91% 100%
As can be seen from the contents of tables 1 and 2 above, the model built using the present invention can be generated by specifying only the values of one or more attributes, and has high effectiveness, uniqueness and novelty in the generation of molecules based on the specified attributes.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited thereto, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims (10)

1. A method for generating molecules based on specified properties, comprising the specific steps of:
s1, collecting small molecular data of a drug, and preparing a data set aiming at a molecular generation task;
s2, establishing a molecular generation model based on pre-training and model fine adjustment;
s3, performing fine tuning training on the molecular generation model by using the data set manufactured in the S1 to obtain two different models suitable for the molecular generation downstream task;
s4, respectively adjusting the DPMG model structure of the downstream task based on molecular attribute prediction and target molecular generation of the designated attribute;
s5, screening the obtained molecules;
s6, providing indexes for evaluating molecular generation and quantifying the quality of the model.
2. The method of claim 1, wherein the representation of the small molecule structure comprises a linear input SMILES and a two-dimensional undirected cyclic graph;
the small molecular data of the medicine collected in the S1 comprises public data and experimental output data, and the scale of the collected data reaches tens of millions;
making the collected drug molecule data into a table form comprising SMILES expression of the molecule, lipid partition coefficient, drug affinity, synthesis accessibility; and reasonable molecules were filtered out using the RDKit tool.
3. The method of generating molecules based on specified attributes according to claim 2, wherein in S1, textual SMILES is used as data for the input model;
the collected SMILES expressions and properties of the corresponding molecules also include solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies;
wherein the three molecular attributes of lipid distribution coefficient, drug similarity quantitative estimation and synthesis accessibility are generated according to the molecular attributes, and the target is a molecule with the same value as the value of the three attributes;
or specifying one or more combinations of six attributes of solubility and permeability estimates, molecular mass, topological polar surface area, number of hydrogen bond donors and acceptors, number of alarm structures, number of rotatable bodies.
4. A method of generating molecules based on specified properties according to claim 1, wherein in S1 different datasets are created for different properties of the same molecular structure.
5. The method for generating molecules based on specified attributes according to claim 1, wherein 12 layers of transformers, 768 hidden layers and 12 attention arrows are adopted in the same semantic space in the pre-training process of S3 to encode text data;
wherein the text sequence information and the molecular structure information are learned by specifying the direction of the attention arrow.
6. The method for generating molecules based on the specified attributes according to claim 1, wherein in the step S4, after the molecule is generated in the molecular attribute prediction process, the generated molecules are input into the DPMG model as input, and the rest of the attributes of the molecules are predicted to complement all the attributes of the molecules; comparing the obtained result with the attribute given previously, and calculating the predicted deviation degree;
in the molecular attribute prediction process, a regression model is used as an end to obtain a numerical value, and MSE is used as a loss function:
for the mean square error loss, the general formula is:
in the above formula, y i As a true measurement value of the value,for model predictive value, let +.in regression model>The loss function is expressed as:
the training objective function is the value of a and b when searching the minimum value of the following functions:
7. the method for generating molecules based on the specified attributes according to claim 1, wherein in the step of generating the target molecules with the specified attributes in S4, the drug molecules meeting the requirements are generated according to the given attribute parameters;
inputting the numerical value of the molecular attribute into an encoder, and inputting the obtained word encryption in the encoder into a decoder to obtain an output SMILES file;
introducing a teacher model in the output process, and comparing the existing SMILES part with a correct SMILES molecular file serving as a reference every time an atom is generated, so as to finish the fine tuning operation of the training model;
after each atom is compared with the reference SMILES file, the reference SMILES file is used as a precondition for generating atoms in the next round, and each atom is generated, a loss value is calculated through comparison, and parameters are fine-tuned through back propagation.
8. A method of generating molecules based on specified attributes according to claim 7 wherein the data uses binary cross entropy as a loss function in calculating the loss function:
the dataset presents data= (x 1 ,y 1 )(x 2 ,y 2 )(x 3 ,y 3 )(x 4 ,y 4 ) … … pattern, wherein,Is the input variable, i.e. a character generated in the model,/->Is the observed value, i.e. the output of the expected model, here we let y take 0 or 1, when the probability of y=1 is θ, i.e. P θ When (y=1) =θ, the log-likelihood of the observed data points can be expressed by the above equation, where the likelihood function l (θ) is the objective function;
if a negative sign is added in front of it, it is converted into a loss function, and the loss function is y i Cross entropy with θ;
loss function in cross entropy form for a single sample:
Loss=-[y i logp+(1-y i )log(1-p)]
y i is the observation of the i-th sample, and P is the probability of prediction.
9. The method of claim 1, wherein the basic classes of molecules generated based on QED and SAscore are initially filtered in S5.
10. The method of claim 1, wherein the molecular structure is selected by scoring the drug molecules using MOSES and GuacaMol in S6;
wherein,
generating molecular rationality: rationality of a molecule refers to whether the structure, properties, and design of the molecule are consistent with the expected biological activity or pharmaceutical properties, whether the basic laws of chemistry, biology, and pharmacology are met, and whether the molecule corresponds to a real molecule; if the molecular structure is SMILES, the MolFromSmiles method of the RDkit kit is used for checking whether the molecular structure can be converted from the SMILES format to the rdmol object, and if so, the molecular structure is a reasonable molecule.
CN202311238924.0A 2023-09-25 2023-09-25 Method for generating molecules based on specified attributes Pending CN117334271A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311238924.0A CN117334271A (en) 2023-09-25 2023-09-25 Method for generating molecules based on specified attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311238924.0A CN117334271A (en) 2023-09-25 2023-09-25 Method for generating molecules based on specified attributes

Publications (1)

Publication Number Publication Date
CN117334271A true CN117334271A (en) 2024-01-02

Family

ID=89289519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311238924.0A Pending CN117334271A (en) 2023-09-25 2023-09-25 Method for generating molecules based on specified attributes

Country Status (1)

Country Link
CN (1) CN117334271A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117594157A (en) * 2024-01-19 2024-02-23 烟台国工智能科技有限公司 Method and device for generating molecules of single system based on reinforcement learning

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459274A (en) * 2019-08-01 2019-11-15 南京邮电大学 A kind of small-molecule drug virtual screening method and its application based on depth migration study
CN110534164A (en) * 2019-09-26 2019-12-03 广州费米子科技有限责任公司 Drug molecule generation method based on deep learning
CN112071373A (en) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 Drug molecule screening method and system
CN112270951A (en) * 2020-11-10 2021-01-26 四川大学 Brand-new molecule generation method based on multitask capsule self-encoder neural network
WO2022047677A1 (en) * 2020-09-02 2022-03-10 深圳晶泰科技有限公司 Drug molecule screening method and system
CN114220497A (en) * 2021-12-14 2022-03-22 中国科学院过程工程研究所 Ionic liquid type antibiotic drug property prediction method based on transfer learning and graph neural network and high-throughput screening platform
CN115240787A (en) * 2022-07-26 2022-10-25 四川大学 Brand-new molecule generation method based on deep conditional recurrent neural network
CN115240782A (en) * 2022-06-23 2022-10-25 中国科学院自动化研究所 Drug attribute prediction method, device, electronic device and storage medium
CN115359856A (en) * 2022-07-25 2022-11-18 杭州碳硅智慧科技发展有限公司 Training method and device of molecular generation model
WO2023029351A1 (en) * 2021-08-30 2023-03-09 平安科技(深圳)有限公司 Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules
CN115881244A (en) * 2022-11-04 2023-03-31 中国药科大学 Drug molecular skeleton replacement and screening method based on deep migration learning model
CN116779059A (en) * 2023-07-07 2023-09-19 北京迈高材云科技有限公司 Molecular property prediction method based on attention mechanism migration learning
CN117275609A (en) * 2023-10-16 2023-12-22 深度感知生物医学科技(广州)有限公司 Molecular design method based on variation self-encoder and transducer model

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459274A (en) * 2019-08-01 2019-11-15 南京邮电大学 A kind of small-molecule drug virtual screening method and its application based on depth migration study
CN110534164A (en) * 2019-09-26 2019-12-03 广州费米子科技有限责任公司 Drug molecule generation method based on deep learning
CN112071373A (en) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 Drug molecule screening method and system
WO2022047677A1 (en) * 2020-09-02 2022-03-10 深圳晶泰科技有限公司 Drug molecule screening method and system
CN112270951A (en) * 2020-11-10 2021-01-26 四川大学 Brand-new molecule generation method based on multitask capsule self-encoder neural network
WO2023029351A1 (en) * 2021-08-30 2023-03-09 平安科技(深圳)有限公司 Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules
CN114220497A (en) * 2021-12-14 2022-03-22 中国科学院过程工程研究所 Ionic liquid type antibiotic drug property prediction method based on transfer learning and graph neural network and high-throughput screening platform
CN115240782A (en) * 2022-06-23 2022-10-25 中国科学院自动化研究所 Drug attribute prediction method, device, electronic device and storage medium
CN115359856A (en) * 2022-07-25 2022-11-18 杭州碳硅智慧科技发展有限公司 Training method and device of molecular generation model
CN115240787A (en) * 2022-07-26 2022-10-25 四川大学 Brand-new molecule generation method based on deep conditional recurrent neural network
CN115881244A (en) * 2022-11-04 2023-03-31 中国药科大学 Drug molecular skeleton replacement and screening method based on deep migration learning model
CN116779059A (en) * 2023-07-07 2023-09-19 北京迈高材云科技有限公司 Molecular property prediction method based on attention mechanism migration learning
CN117275609A (en) * 2023-10-16 2023-12-22 深度感知生物医学科技(广州)有限公司 Molecular design method based on variation self-encoder and transducer model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HOP P: "Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts", MOLECULAR PHARMACEUTICS, vol. 15, no. 10, 31 December 2018 (2018-12-31), pages 4371 - 4377 *
刘景陶;刘映雪;: "计算机辅助药物设计的原理及应用", 科技创新与应用, no. 33, 28 November 2016 (2016-11-28), pages 58 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117594157A (en) * 2024-01-19 2024-02-23 烟台国工智能科技有限公司 Method and device for generating molecules of single system based on reinforcement learning
CN117594157B (en) * 2024-01-19 2024-04-09 烟台国工智能科技有限公司 Method and device for generating molecules of single system based on reinforcement learning

Similar Documents

Publication Publication Date Title
Li et al. DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines
CN108228716B (en) SMOTE _ Bagging integrated sewage treatment fault diagnosis method based on weighted extreme learning machine
Knowles ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems
CN108647226B (en) Hybrid recommendation method based on variational automatic encoder
CN110910951A (en) Method for predicting protein and ligand binding free energy based on progressive neural network
CN109994158B (en) System and method for constructing molecular reverse stress field based on reinforcement learning
CN110083125B (en) Machine tool thermal error modeling method based on deep learning
US11030275B2 (en) Modelling ordinary differential equations using a variational auto encoder
CN113838536B (en) Translation model construction method, product prediction model construction method and prediction method
CN111461286B (en) Spark parameter automatic optimization system and method based on evolutionary neural network
US20230197205A1 (en) Bioretrosynthetic method and system based on and-or tree and single-step reaction template prediction
CN117334271A (en) Method for generating molecules based on specified attributes
CN111785326B (en) Gene expression profile prediction method after drug action based on generation of antagonism network
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
CN106599610A (en) Method and system for predicting association between long non-coding RNA and protein
CN116976505A (en) Click rate prediction method of decoupling attention network based on information sharing
Zhou et al. TransVAE-DTA: Transformer and variational autoencoder network for drug-target binding affinity prediction
CN114281950B (en) Data retrieval method and system based on multi-graph weighted fusion
CN114819107B (en) Mixed data assimilation method based on deep learning
CN116054144A (en) Distribution network reconstruction method, system and storage medium for distributed photovoltaic access
CN115410642A (en) Biological relation network information modeling method and system
Kavipriya et al. Adaptive weight deep convolutional neural network (AWDCNN) classifier for predicting student’s performance in job placement process
CN111242379A (en) Nuclear recursive maximum correlation entropy time sequence online prediction method based on random Fourier features
Walter et al. Package ‘mistral’
CN115620807B (en) Method for predicting interaction strength between target protein molecule and drug molecule

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination