CN112116963A

CN112116963A - Automated drug design method, system, computing device and computer-readable storage medium

Info

Publication number: CN112116963A
Application number: CN202011020214.7A
Authority: CN
Inventors: 黄韬; 金锋; 魏文娟
Original assignee: Shenzhen Zhiyao Information Technology Co ltd
Current assignee: Shenzhen Zhiyao Information Technology Co ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-22

Abstract

The invention discloses an automatic drug design method, a system, a computing device and a computer readable storage medium. The method comprises the following steps: decomposing a target lead compound into segments with synthesizable modules, and sequentially inputting the segments into a trained drug design machine learning model for sampling; and reassembling the new segments output by the drug design machine learning model to obtain a new lead compound. The automatic drug design of the invention has greatly improved performance in the aspects of molecule generation effectiveness and uniqueness, and can generate molecules with high novelty, strong synthesizability and strong druggability; molecules can be easily generated in the high molecular weight region; the method can be repeatedly used for different target point lead compound generation scenes only by training once with a specific data set; the local structure of the anchoring compound can be easily achieved, and the rest can be optimized.

Description

Automated drug design method, system, computing device and computer-readable storage medium

Technical Field

The invention relates to the technical field of computers, in particular to an automatic drug design method, an automatic drug design system, a computer device and a computer readable storage medium.

Background

Designing lead compounds with desirable properties is a central task in the drug discovery phase. In Fast follow (Fast-follow) and similar (Me-too) drug design scenes, a large number of papers and patents need to be collected in the traditional process, and on the basis of reading and understanding by medicinal chemists, a compound which is novel in structure, can be synthesized and has strong druggability is designed and verified through chemical synthesis and biological characterization.

Molecular Generation (Molecular Generation) is an automated drug design method based on deep Generation learning (deep Generation learning) that has been rapidly developed in recent years. By enabling a model to learn SMILES (compound structure represented by a character string) or Molecular Graph (atom and chemical bond connection represented by a Molecular Graph) data of an input compound and mastering a statistical rule, the compound with a new structure is automatically generated, and the design working efficiency of the lead compound can be greatly improved. Common molecular generation algorithms include Recurrent Neural Networks (RNN), Generative Adaptive Networks (GAN), Variational Automatic Encoders (VAE), and so on. In any algorithm, a large number of molecular structure data examples need to be input into a model, and a neural network needs to be trained sufficiently to master the key of compound structure design.

Referring to fig. 1A and 1B, a conventional molecular generation algorithm using a SMILES string as an input generally adopts a Lead-to-Lead (L2L) framework: taking the structure data of the whole lead compound as input in a training stage; and the sampling phase is also performed, so that a new structure lead compound with the similar property with the target molecule is obtained. Application models under this framework, such as GENTRL, developed by Insilico Medicine, Inc., while enjoying compelling success, present several significant problems, including:

problem 1, the L2L framework model generated molecules were less effective (validness) and unique (uniquess).

Problem 2, the L2L framework model generated molecule has low high novelty (high novelty), and it is difficult to generate a compound having a normal structure in a chemical space (molecular similarity Tc <0.4) distant from the target molecule.

Problem 3, it is difficult for the L2L framework model to generate structurally normal compounds with Molecular Weight (MW) greater than 500, and therefore, it cannot be applied to some lead compound design scenarios requiring high Molecular Weight.

Problem 4, the L2L framework model cannot be reused across targets, and when applied to different targets, a compound with a new structure can be generated only by adding a known active compound of the target to train, which results in waste of time, labor and material resources.

Problem 5, the L2L framework does not allow the model to fix some part of the compound's substructure, so that automatic sampling can be performed in other parts.

The above problems limit the practical application value of the molecular generation model using the L2L framework.

Disclosure of Invention

The present invention provides an automated drug design method comprising: decomposing a target lead compound into segments with synthesizable modules, and sequentially inputting the segments into a trained drug design machine learning model for sampling; and reassembling the new segments output by the drug design machine learning model to obtain a new lead compound.

In one embodiment of the invention, the training dataset of the drug design machine learning model is obtained by: filtering the active compounds in the CHEMBL25 data set according to a predetermined rule to obtain an initial data set; decomposing each initial compound in the initial data set into fragments with synthesizable modules and de-duplicating to obtain a plurality of non-redundant fragments; and amplifying the non-redundant fragments by a preset multiple by using a random SMILES generation method, and taking a plurality of SMILES character strings obtained after amplification as the training data set.

In one embodiment of the invention, the predetermined rule comprises: the active compound acting target belongs to one of the human protein family GPCR A, Hydrolase, Kinase, Ligand-gated Ion Channel, Oxidoreductase, Protease, Transferase, Transporter and Voltage-gated Ion Channel; the activity test type is SINGLE PROTEIN; removing the debris-containing compound; removing compounds with molecular weight greater than 500; and removing the low activity compound of PCHEMBL < 6.

In one embodiment of the invention, the SMILES string of the input training is converted to a string of fixed length 120: if the SMILES character string is less than 120, filling up with blank spaces; if the length exceeds 120, filtering; and converting each character into a one-hot vector according to the character used for encoding the SMILES character string as a mark, and finally converting one SMILES character string into a matrix of 120 x 43 to be used as the input of the drug design machine learning model.

In one embodiment of the invention, the VAE model is based on an open source Molecular VAE model, and the Molecular structure inspection, fragment decomposition and fragment assembly use RDKIT toolkit; the deep learning framework uses a Pythrch 1.5.1, a CUDA version is 10.1.105, an operating system is Ubuntu LTS 18.04, and all computing work is completed on a 4 XGeforce RTX-2080-TI GPU server.

The present invention also provides an automated drug design system comprising: a drug design machine learning model; the input module is used for decomposing a target lead compound into fragments with synthesizable modules and sequentially inputting the fragments into the trained drug design machine learning model for sampling; and the output module is used for reassembling the new segments output by the drug design machine learning model to obtain a new lead compound.

The invention also provides a computing device comprising a memory and a processor, wherein the memory stores a program, and the processor implements the automatic drug design method when executing the program.

The present invention also provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the above-described automated drug design method.

The automatic drug design of the invention has greatly improved performance in the aspects of molecule generation effectiveness and uniqueness, and can generate molecules with high novelty, strong synthesizability and strong druggability; molecules can be easily generated in the high molecular weight region; the method can be repeatedly used for different target point lead compound generation scenes only by training once with a specific data set; the local structure of the anchoring compound can be easily achieved, and the rest can be optimized.

Drawings

Fig. 1A and 1B are schematic diagrams of model input, training and sampling under the L2L framework of the prior art.

Fig. 2A and 2B are schematic diagrams of model input, training and sampling under the L2F2L framework according to an embodiment of the present invention.

FIGS. 3A and 3B are statistical distribution comparisons of training data set properties under the frameworks of L2L and L2F 2L; in which fig. 3A is a comparison of probability density distributions of the lengths of SMILES strings of both, and fig. 3B is a comparison of molecular weight distributions.

Fig. 4 is a diagram of a VAE model structure adopted in the embodiment of the present invention, and the model is composed of three parts: encoder (Encoder), implicit space (latency), Decoder (Decoder).

Fig. 5A and 5B show the VAE model loss function as a function of the number of training times (epoch) under the L2L and L2F2L frames, respectively, during training.

FIGS. 6A-F show the distribution of KRAS compounds, novel (novel) and highly novel (high novel) compounds generated by the L2L model, novel and highly novel compounds generated by the L2F2L model, in the Synthesizability (SA) and druggability (QED) spaces; the areas marked by the dotted line frame are areas with SA <5 and QED >0.2, and the larger the SA value is, the more difficult the synthesis is; the larger the QED value, the easier the drug is.

FIGS. 7A and 7B show sampling using the L2L framework and L2F2L framework VAE models to obtain novel molecular size distributions; in which figure 7A shows the SMILES string length distribution and figure 7B shows the molecular weight distribution.

FIG. 8 shows a comparison of the L2L and L2F2L model sampling fixed local structures. It can be seen that the L2F2L model can fix α, β -unsaturated amides without sampling, thereby allowing this partial structure to be preserved for newly designed molecules. In contrast, the L2L model cannot do this.

FIG. 9 is a flow chart of an automated drug design method of an embodiment of the present invention.

Figure 10 is a block diagram of an automated drug design system according to an embodiment of the present invention.

FIG. 11 is an internal block diagram of a computing device of an embodiment of the invention.

Detailed Description

The inventors believe that attempting to have a machine model correctly output a SMILES string (i.e., a complete lead compound) of dozens or even hundreds of characters in a grammatically correct manner is a difficult task to accomplish given the limited training data and computational power. Even with a huge generative model such as GPT-3, it is still extremely difficult to generate a text paragraph with correct characters, grammar, contents, and logic after almost 45TB corpus is input, 1750 hundred million parameters are used, and a high-performance computing platform is adopted for training. Therefore, aiming at the problems in the prior art, the application develops a new approach in combination with the practical experience of the pharmaceutical chemist in designing the medicine, and provides a new molecular generation algorithm framework which takes the molecular fragments as the input, and is called Lead-to-Fragment-to-Lead (L2F 2L). The difference from the L2L framework is that, with reference to fig. 2A and B, during the training phase, the whole is broken up, the leader compound is decomposed into fragments with synthesizable modules (fragments, which refer to fragments that can be synthesized by simple reactions through existing chemical intermediates) by the BRICS algorithm (which decomposes according to whether a bond can be synthesized, and which can return a list of past iterations, with the number on the atom corresponding to a specific reaction type), and the correct syntax is learned by letting the VAE model learn these shorter SMILES strings; in the sampling stage, the target compound is also decomposed into fragments, then the fragments are sequentially input into a trained VAE model for sampling, and finally the new fragments are reassembled into the lead compound with the new structure. In practice, the L2F2L model exhibits significant advantages over the L2L model: (1) the L2F2L model has greatly improved performance in the aspects of molecular generation effectiveness and uniqueness; (2) the L2F2L model can generate molecules with high novelty, strong synthesizability and strong druggability; (3) the L2F2L model can easily generate molecules in the high molecular weight region; (4) the L2F2L model can be repeatedly used for different target point lead compound generation scenes only by training once with a specific data set; (5) the L2F2L model can easily realize the local structure of the immobilized compound, and optimize the rest part.

Method, material, data set

The application used the CHEMBL25 dataset and filtered the active compounds as follows: (1) the active compound acting target belongs to one of the human protein family GPCR A, Hydrolase, Kinase, Ligand-gated Ion Channel, Oxidoreductase, Protease, Transferase, Transporter and Voltage-gated Ion Channel; (2) the activity test type is SINGLE PROTEIN; (3) removing compounds containing debris (e.g., Na +, Cl-, OH-, etc.); (4) removing compounds with molecular weight greater than 500; (5) low active compounds were removed (PCHEMBL <6, which is a value in ChEMBL chemical library that characterizes compound activity). Finally 153,498 highly active compounds were obtained as an initial data set and named CHEMBL 25L.

Under the L2L framework, the method for randomly generating SMILES is adopted, the data size of CHEMBL25L is expanded by 10 times, and 1,534,980 SMILES character strings are used as a training data set. Under the framework of L2F2L, the present application decomposed each of the starting compounds in CHEMBL25L into several fragments, which were de-duplicated to yield 22,581 non-redundant fragments. Similarly, the present application also uses a random SMILES generation method to amplify the above non-redundant fragments by 10 times to 225,810 SMILES strings as a training data set. The SMILES character string length distribution and the molecular weight distribution of the training data under the two model frameworks have obvious difference. The length of the training data set SMILES string of the L2L framework is significantly longer than that of the training data set L2F2L (as shown in FIG. 3A); similarly, the L2L framework training dataset also had significantly higher compound molecular weights than the L2F2L training dataset (see fig. 3B). Furthermore, the present application obtained 267 compounds targeting the KRAS target from KRAS target active compounds, of which 84 compounds had a molecular weight below 500 and 183 compounds had a molecular weight above 500. Using the same method, the present application processed 267 KRAS-targeted active compounds collected in the above procedure to add training data as needed.

The SMILES string of the input training is converted to a string of fixed length 120: if the SMILES character string is less than 120, filling up with blank spaces; if the length exceeds 120, the filter is filtered. Further, each character is converted into a one-hot vector (one-bit significant vector) based on the character used to encode the SMILES string (the character constituting the SMILES string, for example, C/N/H/O or the like) as a token. Finally, a string of SMILES characters is converted into a 120 × 43 matrix as input to the model.

VAE model structure, loss function and sampling strategy

For a fair comparison, the present application uses the exact same VAE model under both the L2L and L2F2L frameworks (as in fig. 4). The model consists of three parts: encoder (Encoder), implicit space (late space), Decoder (Decoder). Where the encoder consists of three convolutional layers plus one linear layer. The mean (μ) and variance (σ) of the encoder output are represented by a 1 × 292 vector. The decoder consists of a GRU layer plus two linear layers.

The method adopts a classical VAE model loss function, namely, the loss function consists of a generation loss part and a later loss part (formula 1). The output and the input of the generation loss comparison model reflect the accuracy of the automatic encoder, and Binary cross entry is used as the generation loss. The variance reflects the difference between the implicit space vector (variance) and the standard gaussian distribution, and the KL subvergence function is used herein. The model uses Adam as a parametric optimizer.

Loss＝Binary cross entropy(output smiles vector,inputsmiles vector)+KL Divergence(latent vector,unit gaussian)

(formula 1)

There are three sampling strategies for the pre-trained (pre-trained) VAE model: (1) decoding a random vector; (2) known molecular perturbations; (3) interpolation (interpolation). Here, to facilitate comparison of the two model frameworks, the present application uses a second strategy, known as molecular perturbation. This is also the way most human pharmacogenists design drugs, i.e. make modifications on existing drug structures.

Model evaluation

In order to evaluate model performance differences under different frameworks, the present application uses the following indices to evaluate the performance of the model-generated molecules: validity (validity), uniqueness (uniqueness), novelty (novelty), and high novelty (high novelty).

Validity＝(#of valid compounds)/(#of sampling)

(formula 2)

Uniqueness＝(#of unique compounds)/(#of valid compounds)

(formula 3)

Novelty＝(#of novel compounds(Tc<1.0))/(#of unique compounds)

(formula 4)

High novelty＝(#of|ig|novel compounds(Tc<0.4))/(#of novel compounds(Tc<1.0))

(formula 5)

Wherein the validity is the proportion of the generated valid numerator (i.e. the grammatically correct SMILES string) to the number of sampling times (formula 2); the uniqueness is the ratio of unique molecules to effective molecules (equation 3); novelty is the ratio of molecules with novel structure (similarity less than 1.0 to the known compound as the sample input) to unique molecules (equation 4); high novelty is the ratio of highly novel molecules (similarity less than 0.4 to the known compound as the sample input) to novel molecules (equation 5).

To evaluate the similarity of the model-generated molecules to the 267 KRAS-targeting active compounds described previously, this application calculated Morgan fingerprints of the compounds using RDKIT and then measured the similarity between the two molecules by the trough Coefficient (Tc) (equation 6).

(formula 6)

Wherein fp1 is the fingerprint of compound 1 and fp2 is the fingerprint of compound 2.

To evaluate the synthesizability of model biomolecules, Synthetic Accessibility (SA) indicators were used to quantitatively assess the difficulty of synthesis of a given compound. The greater the SA value, the greater the difficulty of compound synthesis; conversely, the lower the difficulty. In addition, Quantitative efficacy of Drug-like parameters (QED) are used to Estimate the potency of the resulting molecule.

Software and hardware

The VAE model starting code used in the application is from an open source Molecular VAE model (https:// githu. com/topazope/Molecular-VAE), and is modified on the basis (SMILES string- > one-hot vector coding is modified, and special characters required by connection of coding segments are added). Molecular structure inspection, fragment decomposition and fragment assembly the RDKIT toolkit was used. The deep learning framework uses a pytorech 1.5.1, a CUDA (computer Unified Device Architecture, a computing platform from a graphics card vendor, Nvidia) version 10.1.105, and an operating system Ubuntu LTS 18.04. All the calculation work is completed on a 4 XGeforce RTX-2080-TI (Nvidia) GPU server.

Results

1) Both L2F2L and L2L model training converged well

Firstly, according to the application, CHEMBL25L + KRAS is used as a training data set, VAE models are respectively trained under L2L and L2F2L frameworks, and the models can be well converged after a certain number of iterations. Under the framework of L2L, only a small number of epochs are needed to achieve convergence stability, and the loss function is stabilized at about 30 (as shown in FIG. 5A). Under the framework of L2F2L, the model requires a larger number of epochs to reach convergence stability, and the loss function stabilizes at about 10 (as shown in fig. 5B). Since the training data amount of L2F2L is small, about one tenth of that of the L2L model, the time required for the two to converge is similar. To further compare the molecular generation capability differences between the two, the present application uses KRAS dataset as initial molecular input to obtain new structural compounds by molecular perturbation sampling.

2) The L2F2L model-generating molecule has better effectiveness, uniqueness and novelty

Molecular perturbation sampling was performed using the trained L2L and L2F2L models, using 267 compounds in the KRAS dataset as input. In order to eliminate random factor interference, 3 batches of independent sampling are respectively carried out on the two models; in each batch of samples, 10 samples were taken for each input molecule, totaling 2,670 samples. The L2F2L model greatly surpassed the L2L model (table 1) in terms of various sampling performance indicators. The sampling molecular effectiveness of the L2F2L model is 94.59 +/-0.37%, which is improved by 2080% compared with that of the L2L model; the uniqueness of the sampling molecule of the L2F2L model is 99.95 +/-0.02%, which is improved by 62% compared with that of the L2L model; the novelty of the sampling molecule of the L2F2L model is 99.76 +/-0.04%, which is improved by 41% compared with that of the L2L model; the sampled molecules of the L2F2L model have a high novelty of 40.93 ± 1.26%, which is 265% higher than that of the L2L model (table 1). The novel structural molecule obtained from the L2F2L model after 2,670 samplings was 2518.33 ± 10.60, while the novel molecule obtained from the L2L model was only 50.67 ± 3.79. From this last index, the efficiency of the L2F2L model for generating new molecules was improved by nearly 50-fold compared to the L2L model.

TABLE 1 sample performance comparisons of the L2L and L2F2L models were trained using the CHEMBL + KRAS data set.

3) The L2F2L model generates high novel molecules with synthesizability and druggability

Taking the molecules generated by the first sampling as an example, the application further analyzes the synthesizability and the druggability of the highly novel molecules generated by the L2L and L2F2L models. The high-new molecules are molecules with similarity (Tc) of less than 0.4 with any known compound, and the compounds have stronger innovation. The ability to find compounds with synthesizability and druggability in the highly novel molecules generated by the AI generative model is the ultimate goal of AI automated drug design. The present application therefore analyzes the distribution of KRAS compounds, as well as novel (novel) or highly novel (high novel) compounds generated by the L2L model, in the Synthesis (SA) and pharmacy (QED) spaces. As shown in FIGS. 6A-F, the KRAS compound is mostly distributed in the range of SA <5 and QED >0.2 (FIG. 6B), i.e., the spatial range of "easier to synthesize" and "more drug-like". The highly novel molecules generated by the L2L model (Tc <0.4) had only 2 compounds falling in this region, accounting for 4% of all novel molecules (fig. 6E). In contrast, the L2F2L model generated highly novel molecules, with 330 compounds falling in this region (fig. 6F), accounting for 13% of all the molecules generated. This analysis shows that the L2F2L model not only has higher molecular generation efficiency, but also is more likely to produce highly novel, easily synthesized (smaller SA), more drug-like (larger QED) high value design molecules.

4) The L2F2L model can generate novel molecules of high molecular weight

The inability to generate novel molecules of high molecular weight (MW >500) is an inherent problem of the L2L model, mainly due to the inability of the VAE model to correctly output SMILES strings of large length. Since the length of the VAE model output string is reduced (from whole molecules to fragments) in the L2F2L model, the L2F2L model can solve the problem of generating high molecular weight compounds. Perturbation is performed with the KRAS dataset as input, and referring to fig. 7A-B, the present application analyzes the SMILES string length and molecular weight distribution of two models to generate novel molecules. It can be seen that the L2L model cannot output new structural molecules with SMILES length and molecular weight comparable to the KRAS dataset; in contrast, the L2F2L model outputs a new structure molecule SMILES of similar length and molecular weight distribution to the KRAS dataset. These results indicate that the L2F2L model can produce novel molecules of high molecular weight.

5) The L2F2L model has cross-target point reusability

Under the L2L framework, to ensure sampling performance, a corresponding active compound dataset needs to be added to a background dataset (e.g., CHEMBL, ZINC) for each new target, and the model is retrained. The lack of ability to reuse across targets makes the L2L model cumbersome to use. From the principles of medicinal chemistry, a considerable number of fragments are shared between different drugs, which means that the L2F2L model can achieve cross-target reuse. To demonstrate this, the present application removed the KRAS data set from the training data set and trained the L2L and L2F2L models again. Molecular perturbation sampling was then performed with 267 compounds in the KRAS dataset as input. Similarly, in order to eliminate random factor interference, 3 batches of independent sampling are respectively carried out on the two models; in each batch of samples, 10 samples were taken for each input molecule, totaling 2,670 samples.

TABLE 2 training of the L2L and L2F2L models using the CHEMBL data set sample performance comparisons.

In this experiment, the sampling performance indexes of the L2F2L model still greatly surpass those of the L2L model (table 2). Wherein the sampling molecular effectiveness of the L2F2L model is 68.28 +/-0.21%, which is improved by 4700% compared with that of the L2L model; the uniqueness of the sampling molecule of the L2F2L model is 99.84 +/-0.17%, which is improved by 16% compared with that of the L2L model; the novelty of the sampling molecule of the L2F2L model is 99.43 +/-0.03%, which is improved by 36% compared with that of the L2L model; the sample molecule of the L2F2L model has a high novelty of 33.56 + -1.91%, which is comparable to the index of the L2L model. The novel structural molecule obtained from the L2F2L model was 1809.67 ± 5.51, whereas the novel molecule obtained from the L2L model was only 24.00 ± 2.65, after 2,670 samplings each.

Compared with the L2F2L model trained by using the CHEMBL + KRAS data set, the L2F2L model trained by using only the CHEMBL data set has only reduced effectiveness (from 94.59 +/-0.37% to 68.28 +/-0.21%), is equivalent in uniqueness and novelty, and slightly reduces high novelty (from 40.93 +/-1.26% to 33.56 +/-1.9%). From the number of novel molecules ultimately produced, the model of L2F2L trained using only the CHEMBL dataset yielded on average 1809.67 ± 5.51 new molecules, 28% less than the model of L2F2L trained using the CHEMBL + KRAS dataset. The above results indicate that the L2F2L model has cross-target reusability.

6) The L2F2L model can optimize the sampling of the local part of the compound

In the drug design process, the following scenarios are often encountered: namely, the target compound part is reserved, and the rest parts are modified. This requirement typically occurs in a Hit-to-Lead (head-to-Lead) or Lead Optimization scenario. Under the L2L framework, the above requirement cannot be met because some part of the SMILES is reserved and sampled during the sampling process cannot be controlled. However, under the framework of L2F2L, since the sampling object of the model is a segment, the present application can easily achieve the above requirement, that is, the fixed segment part is not sampled in the sampling phase, but the rest part is subjected to sampling perturbation.

Referring to fig. 8, taking the KRAS compound design as an example, for KRAS G12C mutation, the designed compound needs to have a conserved group, α, β -unsaturated amide (C ═ CC (═ O) N), in order to form a covalent bond with the mutated Cysteine (Cysteine). During sampling of the L2L model, the group is easily destroyed because the local part can not be fixed; in the sampling process of the L2F2L model, the fragment where the group is located is fixed and is not disturbed, so that a new structural molecule meeting the requirement is obtained. The results of the above experiments indicate that the L2F2L model has the ability to sample compounds locally.

The application of Artificial Intelligence (AI) technology to new drug discovery has been a focus of pharmaceutical industry in recent years, and molecular generative models have been one of the focuses of attention. The generative model constructs a hidden space (latency) by converting the SMILES string represented by the high-dimensional space into a low-dimensional vector. The vector of implicit spatial samples can be decoded to recover the SMILES string of the high-dimensional space. The potential for the application of this technology was demonstrated in a paper published by Merk et al first in 2018: by training over 50 million biologically active compounds from the ChEMBL database, they generated potential agonists of RXR or PPAR receptors. Of the 5 compounds synthesized, 2 were PPAR agonists, the other 2 were PPAR and RXR dual agonists, while the 5 th compound was inactive. In 2019, Zhavoronkov et al published in the Nature Biotechnology journal. In this article, the authors trained the DDR1 inhibitor generation model using compounds from literature and patents. Of the 6 molecules synthesized, 4 active compounds were found; the best compound 1 has the enzyme activity IC50 of 10nM and the cell activity of 10.3nM, and has relatively proper drug metabolism property.

The existing generative models expose some obvious disadvantages in the application process. Walters and Murcko point out that the biggest problem with current generative models is the lack of novelty of the AI design molecule-in view of the compounds reported to date, too similar to known compounds or compounds in training data sets, too simple to engineer to compete with human pharmacologists. The present application considers the generative model of the existing Lead-to-Lead (L2L) framework, which is a language model in nature, and has no essential difference from any Natural Language Processing (NLP) model. Considering that the molecular weight of the lead compound is between 300 and 500, the average length of the corresponding SMILES character string is about 50 characters. On the premise of limited training data and calculation, it is obviously a very challenging matter to make the machine model continuously output 50 or even longer characters to combine into a SMILES character string with a synthesizable module. Because of this, the molecules generated by the L2L model can only be slightly modified in a short space from the training compound, and it cannot be guaranteed that the SMILES character string is still a valid (structurally correct) compound after being modified to a greater extent, and further cannot guarantee its synthesization and druggability.

The method creatively invents a Lead-to-Fragment-to-Lead (L2F2L) strategy, decomposes molecules into fragments, samples and innovates the fragments, and then recombines the fragments. Referring to fig. 9, an automatic drug design method provided by an embodiment of the present invention includes: decomposing a target lead compound into segments with synthesizable modules, and sequentially inputting the segments into a trained drug design machine learning model for sampling; and reassembling the new segments output by the drug design machine learning model to obtain a new lead compound. Therefore, the average length of the SMILES character strings corresponding to the fragments is about 20, so that the difficulty of generating the model is reduced no matter training or sampling is performed; correspondingly, the effectiveness, uniqueness and novelty of the generated molecule of the L2F2L model are greatly improved compared with those of the L2L model.

Another benefit of using the L2F2L model is that the ability of the generative model to produce highly novel molecules (far from the training compound) with greatly enhanced spatial sampling. In the KRAS case presented in this application, after 10 samples of each prototype molecule, the L2F2L model yielded 330 highly novel compounds with high drug-forming potential, whereas the L2L model only had 2, with a nearly 165-fold increase in relative efficiency. Using L2F2L to generate models, such an improvement is of great significance for drug design in Fast-follow and Me-too scenarios.

In summary, the present application proposes a new molecular generative model framework: L2F2L to improve the currently used L2L framework. The L2F2L model has the following advantages: (1) excellent molecular sampling effectiveness, uniqueness and novelty; (2) can generate a considerable amount of high and new molecules, and has strong synthesizability and druggability; (3) the L2F2L model can generate high molecular weight molecules; (4) the L2F2L model can be reused across targets; (5) the L2F2L model can fix the local structure of a compound, and performs sampling Optimization on the rest part, and is suitable for the Hit-to-Lead and Lead Optimization design scene.

The automatic design of the lead compound with novel structure, synthesizability and druggability by using the AI molecular generation model has great significance for the intellectualization of the pharmaceutical industry. The existing molecular generation model mainly takes a Lead-to-Lead (L2L) framework as a main part, and generates a new Lead compound SMILES character string through implicit space sampling on the basis of learning the Lead compound SMILES character string. However, the problems of low model training and sampling performance, low novelty of generated molecules, incapability of generating high molecular weight molecules, incapability of cross-target reuse, incapability of sampling local structures and the like caused by long SMILES (precursor compounds), limit the application value of the L2L model. The application provides a new generative model framework, namely Lead-to-Fragment-to-Lead (L2F 2L). I.e. let the generative model design a new structural lead compound by learning and sampling compound fragments of a shorter smiles string. In the case of KRAS inhibitors, the effectiveness of the sampling molecules of the L2F2L model is 94.59 +/-0.37%, the uniqueness is 99.95 +/-0.02%, the novelty is 99.76 +/-0.04%, and the high novelty is 40.93 +/-1.26%, which are greatly improved compared with that of the L2L model. The L2F2L model can generate a considerable number of highly novel molecules with synthetic and pharmaceutical properties. The L2F2L model may enable cross-target reuse. The L2F2L model allows for local modification innovation of compounds. Therefore, the L2F2L framework has obvious advantages compared with the L2L framework, and has potential application values in application scenes such as Lead compound design, Hit-to-Lead, Lead Optimization and the like. The molecular generation model designed by the application has the potential of generating a lead compound which is novel in structure, can be synthesized and has good pharmacy.

Referring to fig. 10, an automated drug design system of an embodiment of the present invention comprises: a drug design machine learning model; the input module is used for decomposing a target lead compound into fragments with synthesizable modules and sequentially inputting the fragments into the trained drug design machine learning model for sampling; and the output module is used for reassembling the new segments output by the drug design machine learning model to obtain a new lead compound.

The automated drug design method of embodiments of the present invention may be implemented in a computing device. An exemplary internal structure diagram of a computing device may be shown in fig. 11, which may include a processor, memory, an external interface, a display, and an input device connected by a system bus. Wherein the processor is configured to provide computational and control capabilities. The memory includes a nonvolatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, an application program, a database, and the like. The internal memory provides an environment for the operation of the operating system and programs in the nonvolatile storage medium. The external interface includes, for example, a network interface for communicating with an external terminal through a network connection. The external interface may also include a USB interface, etc. The display of the computing device may be a liquid crystal display or an electronic ink display, and the input device may be a touch layer covered on the display, or may be, for example, a key, a trackball, or a touch pad arranged on a casing of the computing device, or may be an external keyboard, a touch pad, or a mouse.

A program stored on a non-volatile storage medium in a computing device, when executed by a processor, may implement the automated drug design method described above. Alternatively, the non-volatile storage medium may exist in a separate physical form, such as a U disk, which when connected to a processor, executes a program stored on the U disk to implement the automated drug design method described above. The method of the invention can also be realized as an APP (application program) in apple or android application markets, and the APP is downloaded to respective mobile terminals by users for operation.

Those skilled in the art will appreciate that the architecture shown in FIG. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

As described above, it can be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments can be implemented by the related hardware instructed by the computer program, which can be stored in a non-volatile computer readable storage medium, and when executed, the computer program can include the processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The computer according to the present invention is a computing device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof may include at least one memory, at least one processor, and at least one communication bus. Wherein the communication bus is used for realizing connection communication among the elements. The processor may include, but is not limited to, a microprocessor. The computer hardware may also include Application Specific Integrated Circuits (ASICs), Programmable Gate arrays (FPGAs), Digital Signal Processors (DSPs), embedded devices, etc. The computer may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers.

The computing device may be, but is not limited to, any terminal such as a personal computer, a server, etc. capable of human-computer interaction with a user through a keyboard, a touch pad, a voice control device, etc. The computing device herein may also include a mobile terminal, which may be, but is not limited to, any electronic device capable of human-computer interaction with a user through a keyboard, a touch pad, or a voice control device, for example, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a smart wearable device, and other terminals. The Network in which the computing device is located includes, but is not limited to, the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.

The memory is for storing program code. The Memory may be a circuit without a physical form and having a Memory function In an integrated circuit, such as a RAM (Random-Access Memory), a fifo (First In First out), and the like. Alternatively, the memory may be a memory in a physical form, such as a memory bank, a TF Card (Trans-flash Card), a smart media Card (smart media Card), a secure digital Card (secure digital Card), a flash memory Card (flash Card), and so on.

The processor may include one or more microprocessors, digital processors. The processor may call program code stored in the memory to perform the associated functions. For example, the various modules depicted in fig. 10 are program code stored in the memory and executed by the processor to implement the above-described methods. The processor is also called a Central Processing Unit (CPU), and may be an ultra-large scale integrated circuit, which is an operation Core (Core) and a Control Core (Control Unit).

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or elements may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An automated drug design method, comprising:

decomposing a target lead compound into segments with synthesizable modules, and sequentially inputting the segments into a trained drug design machine learning model for sampling;

and reassembling the new segments output by the drug design machine learning model to obtain a new lead compound.

2. The automated drug design method of claim 1, wherein the training dataset of the drug design machine learning model is obtained by:

filtering the active compounds in the CHEMBL25 data set according to a predetermined rule to obtain an initial data set;

decomposing each initial compound in the initial data set into fragments with synthesizable modules and de-duplicating to obtain a plurality of non-redundant fragments;

and amplifying the non-redundant fragments by a preset multiple by using a random SMILES generation method, and taking a plurality of SMILES character strings obtained after amplification as the training data set.

3. The automated drug design method of claim 2, wherein the predetermined rules comprise:

the active compound acting target belongs to one of the human protein family GPCR A, Hydrolase, Kinase, Ligand-gated Ion Channel, Oxidoreductase, Protease, Transferase, Transporter and Voltage-gated Ion Channel;

the activity test type is SINGLE PROTEIN;

removing the debris-containing compound;

removing compounds with molecular weight greater than 500; and

removing the low activity compound of PCHEMBL < 6.

4. The automated drug design method of claim 2, wherein the SMILES string of the input training is converted to a string of fixed length 120: if the SMILES character string is less than 120, filling up with blank spaces; if the length exceeds 120, filtering; and converting each character into a one-hot vector according to the character used for encoding the SMILES character string as a mark, and finally converting one SMILES character string into a matrix of 120 x 43 to be used as the input of the drug design machine learning model.

5. The automated drug design method of claim 1, wherein the drug design machine learning model is a VAE model, the VAE model comprising: encoder, implicit space, decoder; wherein the encoder comprises three convolutional layers and one linear layer, the mean (μ) and variance (σ) of the encoder output are represented by a 1 × 292 vector; the decoder includes one GRU layer and two linear layers.

6. The automated drug design method of claim 5, wherein the VAE model is based on an open source Molecular VAE model, and the Molecular structure inspection, fragment decomposition, and fragment assembly uses the RDKIT toolkit; the deep learning framework uses a Pythrch 1.5.1, a CUDA version is 10.1.105, an operating system is Ubuntu LTS 18.04, and all computing work is completed on a 4 XGeforce RTX-2080-TI GPU server.

7. An automated drug design system, comprising:

a drug design machine learning model;

the input module is used for decomposing a target lead compound into fragments with synthesizable modules and sequentially inputting the fragments into the trained drug design machine learning model for sampling;

and the output module is used for reassembling the new segments output by the drug design machine learning model to obtain a new lead compound.

8. A computing device comprising a memory and a processor, the memory storing a program, wherein the processor implements the method of any of claims 1-6 when executing the program.

9. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 6.