CN115050428B

CN115050428B - Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint

Info

Publication number: CN115050428B
Application number: CN202210654644.7A
Authority: CN
Inventors: 蔡涵萱; 王领; 巫景行; 李奕锐; 罗海林; 刘政豪
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2024-06-14
Anticipated expiration: 2042-06-10
Also published as: CN115050428A

Abstract

The invention discloses a drug property prediction method and a drug property prediction system based on deep learning fusion molecular graph and fingerprint. The prediction method comprises the following steps: prediction of different drug properties; constructing a deep learning model suitable for predicting the drug properties; selecting a specific mode according to the requirement of model construction, and splitting a data set into a training set, a testing set and a verification set; inputting the data set into a network model, training and updating parameters in the network according to the difference between the predicted result of the training set and the true value of the training set, determining optimal network parameters according to the optimal result on the verification set, and detecting the data of the test set; determining an optimal super-parameter combination of the model according to a super-parameter optimization strategy; and for the prediction of different drug properties, generating a targeted optimal model for subsequent application to new small molecule drug property prediction. The invention combines classical molecular fingerprint characteristics and solves the defect that important characteristics cannot be effectively extracted on a small-scale data set in deep learning.

Description

Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint

Technical Field

The invention relates to the technical field of deep learning prediction of drug properties, in particular to a drug property prediction method and system based on a deep learning fusion molecular graph and fingerprint.

Background

Cancer is one of the major diseases that currently endanger human health and life. There will be 1810 ten thousand new cases of cancer and 960 ten thousand cases of cancer death worldwide in 2018 according to the "2020 world cancer report" issued by the international cancer center (IARC) subordinate to the world health organization. Global cancer statistics in 2018 show that cancer morbidity and mortality in China are the first global. In 1810 Mo Xin cancer cases, china accounts for 380.4 ten thousand cases; of 960 ten thousand cancer death cases, 229.6 ten thousand cases are taken up in China. Cancer prevention and treatment have become important public health problems in China. Therefore, the urgent need for cancer drug development is emphasized in the development and implementation of significant new drugs in the past countries.

From the perspective of traditional drug molecule design, accurate prediction of molecular properties, including physicochemical and bioactive properties, as well as ADME/T (absorption, distribution, metabolism, excretion and toxicity) properties, is a fundamental challenge for molecular design. Since the concept of computer-aided drug design was proposed and developed and applied over time, as one of the most widely and well-established computational methods in molecular property prediction, quantitative structure-activity (property) relationship (QSAR/QSPR) modeling has been developed and applied by fitting and learning known data relationships using empirical, linear or nonlinear functions to estimate the activity/properties of unfamiliar chemical structures, and then applying these models to predict and design new molecules with desired functional properties. The QSAR/QSPR model, which was the precursor of artificial intelligence in the current drug development field, was limited by the lack of computational hardware and experimental data and could not be generalized and applied for thirty years ago, but with the continual accumulation of experimental data (such as chemical, biological and pharmacological related data) and the upgrade of hardware conditions, artificial Intelligence (AI) and Machine Learning (ML) algorithms created many successful cases in the drug development field and were considered as indispensable tools for building the QSAR/QSPR model, helping to rapidly and reliably predict and evaluate the biology and ADME/T characteristics of small molecules in physicochemical, drug development practices.

Generally, the ML-based QSAR/QSPR modeling prediction method is severely dependent on a proper molecular characterization mode, and currently commonly used molecular representation methods can be divided into three main categories, namely molecular descriptors, molecular fingerprints and molecular figures. Molecular descriptors and fingerprints are derived from human expert domain knowledge and are used to fully describe the structure, physicochemical, topology and structural characteristics of molecules. The representation of molecular patterns typically occurs in deep learning (DEEP LEARNING, DL) based methods, which principle is that atoms and bonds of molecules are considered as nodes and edges, and integrated dotted information is input into the structure of the deep neural network as an information material providing machine learning. Both traditional ML-based approaches and DL-based approaches proposed in recent years have created many successful cases in the field of drug development, but there is still a controversy as to whether graph-based DL models are superior to traditional descriptor-based ML models. Studies report that map-based DL models remain potentially limited in the event of insufficient data sets. The present invention speculates and verifies during development that the information captured based on the molecular representation of a graph or fingerprint is different and complementary.

The development of deep learning in the field of pharmaceutical research has data as a support for advantages and is also limited by the data. The data of drug development has larger resistance on data accumulation due to the characteristics of difficult unification of environmental standards of various sources and high noise. In the traditional drug development process for decades, the high-throughput screening and combined chemical technology are adopted, and the data in the drug development field initially touch a threshold of big data, but due to the characteristics of the data type, after standardized treatment, artificial intelligence is more a small data problem in the biomedical industry. Therefore, in the data transition period in the biomedical field, the advantages of traditional machine learning and deep learning and the respective captured complementary information are combined, and research and verification prove that the method is used as a first innovative method strategy, and the prediction accuracy is higher than that of the existing algorithm.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a drug property prediction method and a drug property prediction system based on deep learning fusion of a molecular graph and a fingerprint.

The object of the invention can be achieved by the following technical scheme.

The medicine property prediction method based on deep learning fusion molecular graph and fingerprint is used for realizing rapid property prediction of small molecular medicine, and comprises the following steps:

1) For the prediction of different drug properties, a targeted and specific data set containing a large amount of drug small molecule data is obtained;

2) Constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is assembled by using a full-connection layer;

3) Selecting a specific mode according to the requirement of model construction, and splitting a data set into a training set, a testing set and a verification set;

4) Inputting the data set into a network model, training and updating parameters in the network according to the difference between the predicted result of the training set and the true value of the training set, determining optimal network parameters according to the optimal result on the verification set, and detecting the data of the test set;

5) Determining an optimal super-parameter combination of the model according to a super-parameter optimization strategy;

6) Generating a targeted optimal model for predicting different drug properties for subsequent application to new small molecule drug property predictions;

7) For the generated optimal model, an explanatory analysis is provided for reference in subsequent drug design.

In the step 1), in order to obtain a data set for training, the method specifically includes the following steps:

1-1) for the pharmaceutical property field of the existing industry accepted classical data set, adopting the classical data set to carry out model construction;

1-2) for the field of pharmaceutical properties for which no accepted classical data sets exist in the industry, the targeted pharmaceutical activity data is collected from experimental records derived from pharmaceutical chemistry or stored in a biological laboratory, or from compound activity data provided by a database of pharmaceutical chemistry published on a network, or from databases of other routes, and is subjected to a model construction after pretreatment.

In the step 1-2), in order to preprocess the obtained original pharmaceutical activity data set, the method specifically comprises the following steps:

1-2-1) obtaining targeted raw pharmaceutical activity data from various sources;

1-2-2) checking weight according to small drug molecules, and averaging the activity data of repeated molecules;

1-2-3) carrying out dehydroions, desalination ions, structural force field optimization and the like on small molecules of the medicine;

1-2-4) for regression tasks, specific activity values are retained; for classification tasks, labeling negative and positive drug micromolecules according to a specified threshold;

1-2-5) data sets are presented as simplified molecular linear input canonical format (SMILES) for small molecules of drug and corresponding target values.

In the step 2), the constructed model specifically comprises the following points:

2-1) the feature extraction part of the model fuses two modules for extracting the molecular map features and extracting the molecular fingerprint features, and extracts the features of the drug small molecules respectively to generate corresponding feature vectors;

2-2) a module for extracting molecular diagram features from the model, and adopting a network structure of a diagram attention mechanism; generating a component graph according to the inputted SMILES format: the atoms of the molecules map nodes in the component diagram, the chemical bonds map edges in the component diagram, and physical and chemical properties of the atoms and the chemical bonds are calculated and used as initial feature vectors of the points and the edges; attention mechanisms in the network structure pay attention to the influence among adjacent atoms, namely, the attention among the adjacent atoms, so as to iteratively update the feature vectors of the atoms in the molecule; after the iterative updating is finished, integrating the feature vectors of all atoms to be used as the feature vectors of the molecular graph to be output;

2-3) a module for extracting molecular fingerprint characteristics from the model, wherein a plurality of full-connection layers are adopted; three different types of molecular fingerprints are generated according to the inputted SMILES format: molecular fingerprint MACCS FP based on substructure, molecular fingerprint PubChem FP based on substructure, molecular fingerprint Pharmacophore ErGFP based on pharmacophore; inputting the serial connection of the three fingerprints into a full-connection layer network of the module to obtain a characteristic vector of the molecular fingerprint;

2-4) the model is used for inputting the characteristic vectors generated by the two modules into a plurality of full-connection layers after splicing, so as to predict the properties of small molecules of the medicine and generate a final prediction result.

In the step 2-2), the molecular diagram feature extraction module of the model specifically comprises the following steps when extracting the molecular diagram features:

2-2-1) calculating the physicochemical property of each atom as an initial feature vector of points in the molecular diagram; the physicochemical properties specifically include: the atomic type (carbon, nitrogen, oxygen or other types), the number of attached chemical bonds, the number of charges, the chiral carbon case, the number of attached hydrogen atoms, the hybridization orbit case, the atomic mass, whether aromatic or not, etc., including atoms having an atomic number within one hundred such as carbon, nitrogen, oxygen, fluorine, etc.

2-2-2) Calculating the attention degree between adjacent atoms, and updating the expression of the atoms according to the attention iteration as follows:

e_ij＝LeakyRelu(a·[W₁h_i||W₁h_j])

Wherein h _i and h _j are iterative feature vectors of adjacent atoms i and j, W ₁ is a weight matrix, and alpha _ij is a weight; the attention value calculated between the adjacent atoms i and j is e _ij, and the attention value e _ik of each adjacent atom K and the atom i in K adjacent atoms is summed, so that the attention effect of the atom j on the atom i is calculated; before updating the atom i, carrying out normalization processing on the attention values corresponding to all neighbors of the atom i to obtain alpha _ij; the number of the multiple attentions is K, the multi-head attentions mechanism repeatedly calculates multiple attentions, and an average value of the multiple attentions is taken to update an atom i, so that an iterative characteristic vector h _i' is obtained;

2-2-3) computing feature vectors of the molecular graph, the expression being:

Wherein N is the total atomic number of the molecule, h _i' is the eigenvector of the atoms after the iteration update, and the eigenvectors of all the atoms are averaged to be used as the eigenvector of the molecule.

In the step 3), the method specifically comprises the following steps when splitting the data set required by constructing the model:

3-1) the model can define the splitting mode and splitting proportion in a self-defined way;

3-2) the built-in splitting mode of the model is as follows: randomly splitting and splitting a framework; randomly splitting, namely randomly splitting the data set out of order; firstly, calculating the skeleton number and the corresponding molecular number of small drug molecules in a data set, and orderly classifying the skeleton and the molecules with small corresponding molecular numbers into a verification set and a test set until the number of the molecules in the verification set and the test set is enough, and uniformly classifying the remaining molecules into a training set; the framework splitting can realize that molecular frameworks in a training set, a verification set and a test set are not overlapped, so that higher requirements are put on the prediction capability of the model, and the model is facilitated to find drug molecules with novel frameworks.

In the step 5), when the super parameter of the model is optimized, the method specifically comprises the following steps:

5-1) six super parameters are built in the model: the method comprises the steps of extracting a loss rate of a molecular diagram module, the number of attentiveness of the molecular diagram module, the number of attentiveness iterations of the molecular diagram module, the loss rate of a molecular fingerprint module, the feature vector dimension of the molecular fingerprint module, and the proportion of a molecular diagram to a molecular fingerprint vector when a full-connection layer of a fusion module is input;

5-2) performing super-parameter optimization on the model according to a Bayesian optimization mode, optimizing for 20 rounds, and selecting a group of super-parameters with optimal evaluation scores of the test set.

In the step 6), the prediction application of the model specifically includes the following steps:

6-1) generating an optimal prediction model aiming at specific drug properties according to the optimal super-parameter combination screened in the step 5);

6-2) when predicting a drug molecule with unknown properties, loading a corresponding optimal model, and inputting the SMILES format of the molecule into the model to obtain a predicted result of the drug molecule;

6-3) the model supports mass prediction of drug molecules with unknown properties, and realizes rapid and efficient molecular property judgment.

The step 7) is that the model specifically comprises the following steps when performing explanatory analysis:

7-1) providing two model interpretation functions such as fingerprint interpretation and molecular diagram interpretation according to the optimal prediction model for specific drug properties generated in the step 6) and the input requirement of a user;

7-2) when a user requires fingerprint interpretation, calculating importance indexes of different fingerprint sites in the model, wherein the higher the indexes are, the greater the role played by the sites in the model generation process is, and the intramolecular information represented by the sites plays an important role in designing drug molecules aiming at specific drug properties;

7-3) when the user requests the interpretation of the molecular diagram, calculating the attention value in the molecular diagram in the model, mapping the attention value of a certain part of atoms to the molecular diagram, wherein the higher the attention value of a certain part of atoms is, the greater the effect of the structure in the model generation process is, and the important effect is on designing the drug molecules aiming at specific drug properties.

A system for predicting drug properties based on deep learning fusion of molecular figures and fingerprints comprises: the data preprocessing module is used for preprocessing the collected chemical molecule activity original data set so that the model can be applied to the construction of new drug molecule property data sets; the model construction module is used for modeling the processed sample through a deep learning model based on a molecular graph and a molecular fingerprint; the deep learning model based on the molecular graph and the molecular fingerprint comprises a feature extraction module based on the molecular graph, a feature extraction module based on the molecular fingerprint and a fusion module; the feature extraction module based on the molecular graph adopts a graph attention mechanism network and focuses on judging the influence of the relationship between adjacent atoms on molecular properties; the characteristic extraction module based on the molecular fingerprints extracts the influence of molecular structures and pharmacophores on molecular properties from three different types of molecular fingerprints; the fusion module is used for merging the feature vectors obtained by the two feature extraction modules and inputting the feature vectors into a multi-layer full-connection layer network; and a prediction module: the prediction module is used for predicting the new drug small molecules according to the optimal model generated by the model construction module, so that the model is applied to the prediction of the new drug molecules; an explanatory module: the explanatory module is used for carrying out explanatory analysis on the small drug molecules according to the optimal model generated by the model construction module, so that the model can provide drug design suggestions aiming at specific drug properties for users;

A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method and application for predicting pharmaceutical properties based on deep learning fusion score and fingerprint.

A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the deep learning fusion score and fingerprint based drug property prediction method and application when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

compared with the traditional drug micromolecule property prediction method based on deep learning, the method disclosed by the invention has the advantages that classical molecular fingerprint characteristics are fused, and the defect that important characteristics cannot be effectively extracted on a small-scale data set in deep learning is overcome; compared with the method based on traditional machine learning and manual extraction of molecular features, the method disclosed by the invention combines the advantages of deep learning, autonomously calculates and extracts structural features in the molecular figures by the relation among atoms in the molecular figures of a computer, and improves the defects of manual feature extraction. The invention can be used in the property field where no known classical data set exists, provides a data preprocessing function, collects and processes the original activity data of molecules, and constructs a sample set which can be used for modeling. After the targeted optimal model is established, the method can further accurately and efficiently predict the appointed property of the molecules, thereby effectively improving the efficiency of drug research and development and accelerating the speed of virtual screening of small molecular drugs.

Drawings

FIG. 1 is a flow chart of a method for predicting drug properties based on deep learning fusion of molecular figures and fingerprints;

FIG. 2 is a schematic diagram of a preset network structure of a drug property prediction deep learning model integrating a molecular graph and a fingerprint;

FIG. 3 is a process schematic diagram of a drug property prediction deep learning model training method fusing molecular figures and fingerprints;

FIG. 4 is a graph of performance accuracy versus results for a drug property prediction method based on deep learning fusion molecular graph and fingerprint and other artificial intelligence based drug property prediction methods;

FIG. 5 is a graph of the results of ablative experiments on neural networks using molecular map portions and molecular fingerprint portions alone, based on a deep learning fusion molecular map and fingerprint based drug property prediction method;

FIG. 6 is a graph of accuracy versus results of a drug property prediction method based on deep learning fusion of molecular figures and fingerprints over a classical unbiased dataset LIT-PCBA with a currently leading variety of drug property prediction methods;

FIG. 7 is a graph of model accuracy versus results for a drug property prediction method based on deep learning fusion of molecular graphs and fingerprints using different molecular fingerprints;

FIG. 8a is a graph of the effect on the intensity of protein expression downstream of CDK9 pathway in the results of a protein experiment of a drug molecule that predicts positive for CDK9 inhibitory activity in the application of a deep-learning fusion molecular graph and fingerprint based drug property prediction method;

FIG. 8b is a graph showing the gray scale statistical analysis of the downstream protein p-RNAPICTD (Ser 2) by protein experiments of drug molecules with positive CDK9 inhibitory activity in the application of the deep learning fusion molecular graph and fingerprint based drug property prediction method;

FIG. 8c is a graph of gray scale statistical analysis of downstream protein Mcl-1 for protein experiments of drug molecules predicted positive for CDK9 inhibitory activity in application of a deep learning fusion molecular graph and fingerprint based drug property prediction method;

FIG. 8d is a graph of gray scale statistical analysis of downstream protein CLEAVED PARP for protein experiments of drug molecules predicted positive for CDK9 inhibitory activity in application of the deep-learning fusion molecular graph and fingerprint-based drug property prediction method;

FIG. 8e is a graph of an apoptosis experiment of drug molecules predicted positive for CDK9 inhibitory activity on MOLM-13 cells containing CDK9 targets in application of a deep learning fusion molecular graph and fingerprint based drug property prediction method;

FIG. 8f is a graph showing the results of apoptosis experiments of pyrithione as a control group on MOLM-13 cells containing CDK9 targets and quantitative analysis thereof in the application of the deep learning fusion molecular graph and fingerprint based drug property prediction method;

FIG. 9a is a graph showing the explanatory analysis and verification of negative small molecules for the prediction of the osmotic activity level of the blood brain barrier by a drug property prediction method based on deep learning fusion molecular graph and fingerprint;

FIG. 9b is an explanatory analysis verification graph of positive small molecules predicted to be the osmotic activity of the blood brain barrier by a drug property prediction method based on deep learning fusion molecular figures and fingerprints;

FIG. 10a is an explanatory analysis of a drug property prediction method based on deep learning fusion molecular figures and fingerprints to predict small molecule 1 with inhibitory activity against rapamycin target protein (mTOR) that contains morpholine rings and ureido pharmacophores known to play a key role and is given higher attention;

FIG. 10b is an explanatory analysis of a drug property prediction method based on deep learning fusion molecular figures and fingerprints to predict small molecule 2 with inhibitory activity against rapamycin target protein (mTOR) that contains a bridged morpholino ring pharmacophore known to play a key role and is given higher attention;

Fig. 10c is an explanatory analysis of a drug property prediction method based on deep learning fusion molecular figures and fingerprints to predict small molecule 3 with inhibitory activity against rapamycin target protein (mTOR) that contains a morpholino ring pharmacophore on pyrazolopyrimidine skeleton that is known to play a key role and is given higher attention.

Detailed Description

The following is a specific example of the application in a development team laboratory to illustrate the manner and logic of use of the application, but not to limit the scope of the application, and equivalent modifications to various selected materials of the application by those skilled in the art will fall within the scope of the application as defined in the claims appended hereto.

Example 1

The present embodiment provides a method for predicting pharmaceutical properties based on deep learning fusion molecular figures and fingerprints, taking the inhibition activity of small molecules to be predicted in the present embodiment on cyclin-dependent kinase family members (CDK 1-9, 14, 19) as an example, the method comprises the following steps:

1) Obtaining inhibition activity data comprising a plurality of small drug molecules on cyclin dependent kinase family members (CDKs 1-9, 14, 19) for use in constructing a data set comprising the steps of:

1-2) since there is no accepted established classical data set for treatment in the industry, it is necessary to collect experimental records derived from pharmaceutical chemistry or from laboratory preservation in biological laboratories, or from compound activity data provided by the pharmaceutical chemistry databases published on the network, and then to perform data preprocessing.

1-2-1) The examples of the present invention selected to collect all activity data records recorded for the target from the pharmaceutical chemistry database ChEMBL according to the target sequence numbers of CDKs 1-9, 14, 19.

1-2-2) Data screening. Only the biological activity records of test type B, report activity type IC50, EC50, ki, kd are reserved, and the drug small molecules with a plurality of activity records are subjected to weight checking, and the activity data of repeated molecules are averaged.

1-2-3) Data normalization. And (3) carrying out dehydroion and desalination ion cleaning and force field optimization on the small molecular structure.

1-2-4) Data annotation. In this embodiment, the data types belong to classification tasks, and the molecules need to be systematically labeled according to a specified threshold. For this example, the threshold was 10. Mu.M, small molecules with test activity less than or equal to 10. Mu.M were labeled as inhibitors, and small molecules with a concentration of > 10. Mu.M were labeled as non-inhibitors.

1-2-5) Obtaining standardized data. A total of 12532 compounds and their enzyme inhibition activity data for 11 CDK subtype proteins were obtained after normalization treatment, such as 1871 compounds with test records for CDK1 targets, with 883 compounds labeled non-inhibitor, 988 compounds labeled inhibitor, available data for testing for CDK2 targets containing 4305 compounds, with 1598 compounds labeled non-inhibitor, 2707 compounds labeled inhibitor, and compound activity test points for CDK9 targets containing 1330 compounds, with 243 compounds labeled non-inhibitor, 1087 compounds labeled active. The resulting standardized dataset consisted of 12532 pairs of small drug molecule SMILES formats and corresponding CDK subtype protein inhibitory activity targets.

2) And constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is assembled by using a full-connection layer. Specifically, in this embodiment, a deep learning network model based on a graph attention mechanism is adopted in the feature extraction module based on the molecular graph; in the characteristic extraction module based on the molecular fingerprint, serial bit strings of three fingerprints MACCS FP, pubChemFP and Pharmacophore ErG FP are selected as the input of the molecular fingerprint representation method. The output vectors of the two feature extraction modules are connected in series and input into a fusion module of a plurality of full-connection layers, and CDK subtype protein inhibition activity corresponding to the small molecule drug is predicted.

3) In the aspect of data set construction of a model, a random splitting mode is selected, and three sets are randomly divided according to the training set, the testing set and the verification set with the proportion of 8, 1 and 1.

4) Inputting the divided data set into a network model, training and updating parameters in the network according to the difference between the predicted result of the training set and the true value of the training set, determining optimal network parameters according to the optimal result on the verification set, and detecting the data of the test set.

Specifically, in this embodiment, when the CDK subtype protein inhibitory activity of the drug small molecule is predicted, a BCE loss function with a sigmoid function as a front is selected to comprehensively calculate the loss between the prediction result and the true value, then a counter-propagation calculation gradient is performed, an Adam optimizer is used to update the network parameters, the iteration is performed for 20 rounds, and finally the ROC-AUC is used as an evaluation index to select the network parameters of the round with the optimal performance on the verification set as a final model.

5) And determining the optimal super-parameter combination of the model according to the super-parameter optimization strategy.

In particular, in this embodiment, for the drug small molecules, six super parameters and their corresponding optimal selection ranges are set in total for their corresponding CDK subtype protein inhibitory activity prediction: the loss rate of the extracted molecular graph module ([ 0,0.05, …,0.6 ]), the attention number of the extracted molecular graph module ([ 2,3, …,8 ]), the attention iteration number of the extracted molecular graph module ([ 40,45, …,80 ]), the loss rate of the extracted molecular fingerprint module ([ 0,0.05, …,0.6 ]), the feature vector dimension of the extracted molecular fingerprint module ([ 300,350, …,600 ]), the molecular graph and the molecular fingerprint vector ratio at the input of the fusion module full link layer ([ 0,0.1, …,1 ]).

In order to find out the excellent super-parameter combination as efficiently and accurately as possible, a bayesian optimization strategy is adopted in the embodiment, and six super-parameters and the range thereof are combined and explored. The Bayesian optimization strategy calculates posterior probability parts of the existing results through Gaussian process regression according to the existing super-parameter combinations and results, obtains expected mean values and variances of the six super-parameters on each possible value, and comprehensively judges which value combination is selected by the six super-parameters during the next optimization. In the Bayes optimization process, as the number of molecules of the drug small molecules in the corresponding CDK subtype protein inhibitory activity prediction data set is not large, the chemical distribution of the data sets is different, in order to reduce the influence caused by random splitting of the sample sets, ten random number seeds are selected to split ten versions of data sets when each super-parameter combination is calculated, and the average value of ten training results is used as the evaluation value of each step of optimization. In the bayesian optimization of the embodiment, 15 steps are combined, and the super-parameter combination with the optimal evaluation index on the verification set is selected as the final super-parameter combination.

6) And for the prediction of different drug properties, generating a targeted optimal model for subsequent application to new small molecule drug property prediction.

Specifically, in this example, a total of 11 optimal models for CDKs 1-9, 14, 19 were constructed using a small molecule-based data set of inhibitory activity on cyclin-dependent kinase family members (CDKs 1-9, 14, 19) to provide a user with predictions of properties of novel drug small molecules of the CDK family.

In this embodiment, the prediction application is performed using 11 optimal models for the CDK family, specifically including the following steps:

6-1) selecting an existing library comprising a population of desired compounds or compounds for which a predicted target value is desired.

A SPECS compound library (containing 208670 compounds, https:// www.specs.net /) was selected for the present example to mine CDK9 inhibitors. A library of CDK9 inhibitor screening compounds (about 194916 compounds) was created by subjecting the SPECS library to the same standardized protocol as in step S103, and filtering by Lipinski' S rule five.

6-2) Inputting the library of compounds into a constructed optimal deep learning predictive model for CDK 9. The SMILES chemical character string of each molecule in the compound library is input into an optimal model constructed on CDK9 inhibition activity prediction by the drug property prediction algorithm based on the deep learning fusion molecular graph and the fingerprint, and the inhibition degree of the corresponding molecule to CDK9 kinase is output through calculation of each node, and the more the output value is close to 1, the more the CDK9 kinase is inhibited.

6-3) Ranking the compound library from high to low the data of degree of inhibition of CDK9 kinase calculated in CDK9 optimal model, selecting the first 1000 molecules from 194916 compounds for further analysis. Finally, 19 compounds were selected and purchased for biological experimental verification by molecular docking process software based on visual ligand-protein interaction analysis. In biological experiments, the verification result of the cell level shows that 6 compounds in the 19 compounds have obvious anticancer activity of the cancer cell level, and the in vitro CDK9 kinase inhibition test result shows that 5 compounds have obvious inhibition activity on targets.

The embodiment of the invention shows that the medicine property prediction result based on the big data and the deep learning neural network prediction model is correct, and meets the practical situation. The drug property prediction algorithm based on the deep learning fusion molecular graph and the fingerprint provided by the invention has advancement and practicability, and can provide rapid and efficient screening of drug molecular property prediction for drug chemists and practitioners in related fields.

Corresponding to the embodiment, the invention also provides a computer device.

The computer device of the present embodiment includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor can implement the deep learning fusion score and fingerprint-based drug property prediction method and application described in the embodiments when executing the program. The computer device of this example, when processing a computer program, generates a total of 11 optimal models for different kinases by data preprocessing the acquired small molecule-based inhibition activity data set of cyclin-dependent kinase family members (CDKs 1-9, 14, 19), and can realize the prediction of the properties of new drug small molecules by using the optimal models. The computer device of the embodiment can rapidly and efficiently predict the inhibitory activity of 11 CDK kinase family members, promote the research and development efficiency when research and development of CDK kinase family member related drugs, and accelerate the speed of virtual screening.

The present invention also proposes a non-transitory computer-readable storage medium corresponding to the above-described embodiments.

The non-transitory computer-readable storage medium of the present embodiment stores thereon a computer program that is executed by a processor to perform a method for predicting drug properties based on deep learning fusion of a molecular graph and a fingerprint. The non-transitory computer readable storage medium of this example contained an acquired data set based on inhibitory activity of small molecules on cyclin dependent kinase family members (CDKs 1-9, 14, 19), a pre-treated sample set, and a total of 11 optimal models for different kinases generated from the sample set. By using the non-transitory computer readable storage medium of the embodiment, the optimal model generated by the embodiment can be directly used, so that the model generation time is saved, a user can rapidly and efficiently predict the inhibitory activity of 11 CDK kinase family members, the research and development efficiency in research and development of CDK kinase family member related drugs is improved, and the virtual screening speed is accelerated.

Example 2

The present example provides a drug property prediction method and an explanatory analysis of molecular patterns based on deep learning fusion of molecular patterns and molecular fingerprints, taking the permeability activity of small molecules to be predicted in the present example on the blood brain barrier as an example, the method comprises the following steps:

1) A data set of permeability activity of a large number of small molecules of the drug against the blood brain barrier was obtained, the data set was derived from a pharmaceutical chemistry data set disclosed on the network, the data had been pre-treated to give a total of 2039 compounds, of which 1560 positive compounds, 479 negative compounds, the positive ratio of the data set was 23.49%. The standardized dataset consisted of 2039 osmotic activity yin-yang values for the drug small molecule SMILES format and its corresponding blood brain barrier.

2) And constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is assembled by using a full-connection layer. Specifically, in this embodiment, a deep learning network model based on a graph attention mechanism is adopted in the feature extraction module based on the molecular graph; in the characteristic extraction module based on the molecular fingerprint, serial bit strings of three fingerprints MACCS FP, pubChemFP and Pharmacophore ErG FP are selected as the input of the molecular fingerprint representation method. The output vectors of the two feature extraction modules are connected in series and input into a fusion module of a plurality of full-connection layers, and the blood brain barrier permeability activity value corresponding to the small molecular medicine is predicted.

Specifically, in this embodiment, when the permeability activity of the blood brain barrier corresponding to the small drug molecule is predicted, a BCE loss function with a sigmoid function as a front is selected to comprehensively calculate the loss between the predicted result and the true value, then a counter-propagation calculation gradient is performed, an Adam optimizer is used to update the network parameters, the iteration is performed for 20 rounds, and finally the ROC-AUC is used as an evaluation index to select the network parameters of the round with the optimal performance on the verification set as a final model.

In this embodiment, six super parameters and the corresponding optimal selection range are set for the drug small molecules in total when the corresponding blood brain barrier permeability activity is predicted: the loss rate of the extracted molecular graph module ([ 0,0.05, …,0.6 ]), the attention number of the extracted molecular graph module ([ 2,3, …,8 ]), the attention iteration number of the extracted molecular graph module ([ 40,45, …,80 ]), the loss rate of the extracted molecular fingerprint module ([ 0,0.05, …,0.6 ]), the feature vector dimension of the extracted molecular fingerprint module ([ 300,350, …,600 ]), the molecular graph and the molecular fingerprint vector ratio at the input of the fusion module full link layer ([ 0,0.1, …,1 ]).

In order to find out the excellent super-parameter combination as efficiently and accurately as possible, a bayesian optimization strategy is adopted in the embodiment, and six super-parameters and the range thereof are combined and explored. The Bayesian optimization strategy calculates posterior probability parts of the existing results through Gaussian process regression according to the existing super-parameter combinations and results, obtains expected mean values and variances of the six super-parameters on each possible value, and comprehensively judges which value combination is selected by the six super-parameters during the next optimization. In the Bayes optimizing process, as the number of molecules of the drug micromolecules in the corresponding blood brain barrier permeability activity prediction data set is not large, the chemical distribution of the data sets is different, in order to reduce the influence caused by random splitting of the sample sets, ten random number seeds are selected to split ten versions of data sets when each super-parameter combination is calculated, and the average value of ten training results is used as the evaluation value of each step of optimization. In this embodiment, bayesian optimization is performed for 20 steps altogether, and a hyper-parameter combination with the optimal evaluation index on the verification set is selected as a final hyper-parameter combination.

6) For the prediction of different drug properties, a targeted optimal model is generated for subsequent application to new small molecule drug property prediction and interpretation analysis.

In particular, in this embodiment, 1 optimal model for blood brain barrier permeability is constructed using a blood brain barrier permeability activity dataset based on small molecules, and the model is provided to a user for predicting properties of small molecules of a new drug for blood brain barrier permeability.

7) In this embodiment, an explanatory analysis is performed using an optimal model for blood brain barrier permeability, specifically including the steps of:

7-1) generating a molecular data set which needs to be subjected to explanatory analysis, and selecting 2 small drug molecules;

7-2) loading a pre-generated optimal model aiming at blood brain barrier permeability, and carrying out property prediction and molecular diagram module explanatory analysis on small drug molecules;

7-3) generating property predictions and analyzing the results by the molecular map module (see FIG. 9a and FIG. 9 b);

7-4) SMILES format of molecule 1: [ C@H ]1CN (C [ C @ H ] (C) N1) C2C (F) C (N) C3C (=o) C (=cn (C4 CC 4) C3C 2F) C (O) =o, and the property prediction result is 0.134, and the predicted negative molecule is consistent with the actual blood brain barrier permeability of the molecule. As shown in fig. 9a, the higher the attention value obtained by calculation and judgment, the more attention the part structure is focused on by the representative model in prediction, and the part in the circle is the part with higher attention value obtained by calculation, namely the substructure of the part where the edges of the model are considered to play an important role in that the molecule cannot penetrate the blood brain barrier. The color of the square frame is darker than that of the round frame, namely the square frame inner substructure plays an important role in that molecules cannot penetrate through the blood brain barrier. The blood brain barrier is a membrane structure, and whether the molecule can penetrate the blood brain barrier is related to the lipophilicity and polarity of the molecule, and for negative molecules, the larger the polarity of the molecule is, the lower the ClogP value of the molecule is, the less the molecule can penetrate the blood brain barrier. The molecular analysis is carried out by adopting software chembiosrow, the ClogP value in the square area is calculated to be-0.905 by adopting the software, the ClogP value in the round area is 0.934, the value of the square area is lower and the polarity is larger compared with the ClogP value of the two areas, the molecular analysis plays an important role in the failure of the molecular to penetrate the blood brain barrier, and the high attention of the model to the square area is consistent with the judgment that the model cannot penetrate the blood brain barrier in the whole molecular analysis. The SMILES format for molecule 2 is: c1CCN (CC 1) CC1cccc (C1) OCCCNC (=o) C, property prediction result is 0.988, and is predicted as a positive molecule, consistent with the actual blood brain barrier permeability of the molecule. Molecular profiling analysis as shown in fig. 9b, the part in the circle is the part with larger calculated attention value, and represents that the model pays more attention to the part structure in prediction, namely the substructure of the edge of the part of the model plays an important role in the molecular penetration of the blood brain barrier. The color of the substructure surrounded by the square frame is darker than that of the substructure surrounded by the round frame, namely the substructure in the square frame plays an important role in enabling molecules to penetrate the blood brain barrier. For positive molecules, the lower the molecular polarity, the higher the ClogP value of the molecule, the more able to penetrate the blood brain barrier. The result of quantitative analysis by software shows that the ClogP value in the square area is 2.142, the ClogP value in the round area is 1.389, and compared with the ClogP values of the two areas, the square area has higher value and smaller polarity, plays an important role in the penetration of molecules through the blood brain barrier, and the high attention of the model to the square area is consistent with the judgment that the model can penetrate through the blood brain barrier on the whole molecule. For the comparison of the two molecules, it was found that molecule 2 was a positive molecule, and the ClogP value of the low-interest circular region was much higher than that of the low-interest circular region in negative molecule 1, which side of the explanation indicated that molecule 2 was less polar overall and more likely to pass the blood brain barrier. The prediction and explanatory analysis of the two molecules by the model are consistent with the actual situation, and the optimal model constructed by the algorithm can perform correct property prediction and reasonable explanatory analysis on the molecules, thereby providing powerful help for chemists to design drug molecules.

Corresponding to the embodiment, the invention also provides a computer device.

The computer device of the present embodiment includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor can implement the deep learning fusion score and fingerprint-based drug property prediction method and application described in the embodiments when executing the program. When the computer equipment of the embodiment processes the computer program, the acquired small molecule-based blood brain barrier permeability activity data set is collected to generate 1 optimal model, and the prediction and explanatory analysis of the new drug small molecule property by using the optimal model can be realized. The computer equipment of the embodiment can rapidly and efficiently predict the permeability activity of the blood brain barrier, improves the research and development efficiency when researching and developing related drugs which need to penetrate the blood brain barrier, and accelerates the speed of virtual screening.

The non-transitory computer-readable storage medium of the present embodiment stores thereon a computer program that is executed by a processor to perform a method for predicting drug properties based on deep learning fusion of a molecular graph and a fingerprint. The non-transitory computer readable storage medium of this embodiment contains an acquired data set based on small molecule to blood brain barrier permeability activity, and generates 1 optimal model according to the data set. By using the non-transitory computer readable storage medium of the embodiment, the optimal model generated by the embodiment can be directly used, so that the model generation time is saved, a user can rapidly and efficiently predict the inhibitory activity and analyze the molecular structure of small molecules capable of penetrating the blood brain barrier, the research and development efficiency in research and development of related drugs which need to penetrate the blood brain barrier is improved, and the virtual screening speed is accelerated.

Example 3

The present embodiment provides a method for predicting pharmaceutical properties based on deep learning fusion molecular figures and fingerprints, taking the inhibition activity of small molecules to be predicted in the present embodiment on rapamycin target protein (mTOR) as an example, the method comprises the following steps:

1) Obtaining data comprising a plurality of drug small molecules inhibiting activity on rapamycin target proteins (mTOR) for constructing a data set, wherein the data set is constructed by the following steps:

1-2-1) The examples of the present invention opted to obtain relevant protein level information from the protein database UniProt and collect all activity data records recorded for the mTOR kinase from the pharmaceutical chemistry database ChEMBL according to Uniport ID for that kinase.

1-2-4) Data annotation. In this embodiment, the data types belong to classification tasks, and the molecules need to be systematically labeled according to a specified threshold. For this example, the threshold was 1. Mu.M, small molecules with test activity less than or equal to 1. Mu.M were labeled as inhibitors, and small molecules with > 1. Mu.M were labeled as non-inhibitors.

1-2-5) Obtaining standardized data. A total of 4104 compounds and their kinase inhibition activity data for mTOR kinase were obtained after normalization treatment, with 565 compounds labeled as non-inhibitors and 3539 compounds labeled as inhibitors, with a positive data set of 86.23%. The resulting standardized dataset consisted of 4104 versus small molecule SMILES format of drug and corresponding target mTOR kinase inhibitor activity.

2) And constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is assembled by using a full-connection layer. Specifically, in this embodiment, a deep learning network model based on a graph attention mechanism is adopted in the feature extraction module based on the molecular graph; in the characteristic extraction module based on the molecular fingerprint, serial bit strings of three fingerprints MACCS FP, pubChemFP and Pharmacophore ErG FP are selected as the input of the molecular fingerprint representation method. The output vectors of the two feature extraction modules are connected in series and input into a fusion module of a plurality of full-connection layers, and mTOR kinase inhibition activity corresponding to the small molecular medicine is predicted.

Specifically, in this embodiment, when the mTOR kinase inhibitory activity of the drug small molecule is predicted, a BCE loss function with sigmoid function as a front is selected to comprehensively calculate the loss between the predicted result and the true value, then a counter-propagation calculation gradient is performed, an Adam optimizer is used to update the network parameters, the iteration is performed for 40 rounds, and finally the ROC-AUC is used as an evaluation index to select the network parameters with optimal rounds on the verification set as a final model.

In order to find out the excellent super-parameter combination as efficiently and accurately as possible, a bayesian optimization strategy is adopted in the embodiment, and six super-parameters and the range thereof are combined and explored. The Bayesian optimization strategy calculates posterior probability parts of the existing results through Gaussian process regression according to the existing super-parameter combinations and results, obtains expected mean values and variances of the six super-parameters on each possible value, and comprehensively judges which value combination is selected by the six super-parameters during the next optimization. In the Bayes optimization process, because the molecular number of the drug small molecules in the corresponding mTOR kinase inhibition activity prediction data sets is not large, the chemical distribution of the data sets is different, in order to reduce the influence caused by random splitting of the sample sets, ten random number seeds are selected to split ten versions of data sets when each super-parameter combination is calculated, and the average value of ten training results is used as the evaluation value of each step of optimization. In this embodiment, bayesian optimization is performed for 20 steps altogether, and a hyper-parameter combination with the optimal evaluation index on the verification set is selected as a final hyper-parameter combination.

In particular, in this example, 1 optimal model was constructed using a data set based on the inhibitory activity of small molecules on mTOR kinase, and the user was provided with predictions of the properties and explanatory analysis of molecular map modules for new drug small molecules of the CDK family.

7) In this example, using an optimal model for mTOR kinase, predictive and analytical applications were performed, specifically comprising the steps of:

7-1) generating a molecular data set which needs to be subjected to explanatory analysis, and selecting 3 small drug molecules;

7-2) loading a pre-generated optimal model for the mTOR kinase, and carrying out property prediction and molecular diagram module explanatory analysis on the small drug molecules;

7-3) generating property prediction results and analyzing results (such as FIG. 10a, FIG. 10b, FIG. 10 c) by a molecular diagram module;

7-4) the predicted values for mTOR inhibitory activity were 0.984, 0.949 and 0.964, respectively, for three molecules, all predicted as positive molecules, consistent with the actual inhibition of the mTOR kinase by the molecules. The inhibitory activity of a small molecule against an mTOR kinase is greatly dependent on the binding capacity of the molecule to hydrogen bonds, and in the box labeling of three molecules, morpholine rings and ureido are hydrogen bond acceptors and can form hydrogen bond action, so that the inhibition activity on the mTOR kinase is very high. Consistent with the results of molecules predicted to be positive. On the pre-generated optimal model, the prediction and molecular graph explanatory analysis of the three small molecules are consistent with the actual inhibitory activity of the molecules and the actual hydrogen bonding capability of the substructures, and the model constructed based on the invention is proved to be capable of realizing the prediction and model explanation of the inhibitory activity of mTOR kinase. The invention can effectively help chemists to carry out mass drug screening and drug molecule design on mTOR kinase.

Corresponding to the embodiment, the invention also provides a computer device.

The computer device of the present embodiment includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor can implement the deep learning fusion score and fingerprint-based drug property prediction method and application described in the embodiments when executing the program. When the computer program is processed, the obtained data set based on the inhibition activity of the mTOR kinase is subjected to data preprocessing to generate 1 optimal model, and the prediction and explanatory analysis of the new drug small molecule property by using the optimal model can be realized. The computer equipment of the embodiment can rapidly and efficiently predict the inhibition activity of the mTOR kinase, promote the research and development efficiency when researching and developing related drugs which need to target the mTOR kinase, and accelerate the speed of virtual screening.

Corresponding to the above embodiment, the present embodiment further provides a system for predicting pharmaceutical properties based on deep learning fusion of molecular figures and fingerprints, including: the data preprocessing module is used for preprocessing the collected chemical molecule activity original data set so that the model can be applied to the construction of new drug molecule property data sets; the model construction module is used for modeling the processed sample through a deep learning model based on a molecular graph and a molecular fingerprint; the deep learning model based on the molecular graph and the molecular fingerprint comprises a feature extraction module based on the molecular graph, a feature extraction module based on the molecular fingerprint and a fusion module; the feature extraction module based on the molecular graph adopts a graph attention mechanism network and focuses on judging the influence of the relationship between adjacent atoms on molecular properties; the characteristic extraction module based on the molecular fingerprints extracts the influence of molecular structures and pharmacophores on molecular properties from three different types of molecular fingerprints; the fusion module is used for merging the feature vectors obtained by the two feature extraction modules and inputting the feature vectors into a multi-layer full-connection layer network; and a prediction module: the prediction module is used for predicting the new drug small molecules according to the optimal model generated by the model construction module, so that the model is applied to the prediction of the new drug molecules; an explanatory module: the explanatory module is used for carrying out explanatory analysis on the small drug molecules according to the optimal model generated by the model construction module, so that the model can provide drug design suggestions aiming at specific drug properties for users.

The invention also proposes a non-transitory computer readable storage medium.

The non-transitory computer-readable storage medium of the present embodiment stores thereon a computer program that is executed by a processor to perform a method for predicting drug properties based on deep learning fusion of a molecular graph and a fingerprint. The non-transitory computer readable storage medium of this embodiment contains an acquired data set based on small molecule inhibition activity of mTOR kinase, and generates 1 optimal model according to the data set. By using the non-transitory computer readable storage medium of the embodiment, the optimal model generated by the embodiment can be directly used, so that the model generation time is saved, a user can rapidly and efficiently predict the activity and analyze the molecular structure of the small molecule capable of inhibiting the mTOR kinase, the research and development efficiency when the related medicine which needs to target the mTOR kinase is researched and developed is improved, and the speed of virtual screening is accelerated.

Claims

1. The medicine property prediction method based on the deep learning fusion molecular graph and the fingerprint is characterized by comprising the following steps of:

7) For the generated optimal model, providing an explanatory analysis for reference of subsequent drug design;

The step 2) specifically comprises the following steps:

2-3) a module for extracting molecular fingerprint characteristics from the model, wherein a plurality of full-connection layers are adopted; three different types of molecular fingerprints are generated according to the inputted SMILES format: molecular fingerprint MACCS FP based on substructure, molecular fingerprint PubChem FP based on substructure, molecular fingerprint Pharmacophore ErG FP based on pharmacophore; inputting the serial connection of the three fingerprints into a full-connection layer network of the module to obtain a characteristic vector of the molecular fingerprint;

2-4) the model splices the feature vectors generated by the two modules and inputs the feature vectors into a plurality of full-connection layers for predicting the properties of small molecules of the medicine to generate a final prediction result;

the step 2-2) specifically comprises the following steps:

2-2-1) calculating the physicochemical property of each atom as an initial feature vector of points in the molecular diagram; the physicochemical properties specifically include: atom type, number of chemical bonds attached, number of charges, chiral carbon case, number of hydrogen attached, hybridization orbital case, atomic mass, whether aromatic or not; the atomic type is an atom with an atomic number within one hundred;

Wherein, And/>Iterative feature vector for adjacent atoms i and j,/>Is a weight matrix,/>Is the weight; the calculated attention value between adjacent atoms i and j is/>; Before updating the atom i, carrying out normalization processing on the attention values corresponding to all neighbors of the atom i to obtain/>; Repeatedly calculating multiple attentions by using a multi-head attentions mechanism, and updating an atom i by taking the average value of the multiple attentions to obtain an iterative feature vector/>；

2-2-3) Computing feature vectors of the molecular graph, the expression being:

Wherein, For the feature vector of the atoms after the iteration update is finished, taking the average value of the feature vectors of all the atoms as the feature vector of the molecules;

the step 5) specifically comprises the following steps:

5-2) performing super-parameter optimization on the model according to a Bayesian optimization mode, optimizing for 20 rounds, and selecting a group of super-parameters with optimal evaluation scores of the test set;

Step 6) specifically comprises the following steps:

6-3) the model supports mass prediction of drug molecules with unknown properties, so that quick and efficient molecular property judgment is realized;

The step 7) specifically comprises the following steps:

7-1) providing fingerprint interpretation and molecular diagram interpretation functions according to the optimal prediction model for specific drug properties generated in the step 6) and the input requirements of a user;

7-3) when the user requests the interpretation of the molecular diagram, calculating the attention value in the molecular diagram in the model, mapping the attention value to the molecular diagram, wherein the larger the attention value of a part of atoms, the larger the effect of the structure in the model generating process is, and playing an important role in designing the drug molecules aiming at specific drug properties.

2. The method for predicting pharmaceutical properties based on deep learning fusion score and fingerprint according to claim 1, wherein step 1) specifically comprises the steps of:

3. The method for predicting pharmaceutical properties based on deep learning fusion score and fingerprint according to claim 2, wherein the step 1-2) specifically comprises the steps of:

1-2-3) carrying out dehydroion, desalination ion and structural force field optimization on small molecules of the medicine;

1-2-5) data sets are presented as simplified molecular linear input specification format SMILES for small molecules of a drug and corresponding target values.

4. The method for predicting pharmaceutical properties based on deep learning fusion score and fingerprint according to claim 1, wherein the step 3) specifically comprises the steps of:

3-1) a model self-defining splitting mode and splitting proportion;

3-2) the built-in splitting mode of the model is as follows: randomly splitting and splitting a framework; wherein, randomly splitting, then randomly splitting the data set out of order; and firstly, calculating the skeleton number and the corresponding molecular number of the drug micromolecules in the data set, and orderly classifying the skeleton and the molecules with small corresponding molecular numbers into a verification set and a test set until the molecular numbers of the verification set and the test set are enough, so that the remaining molecules are uniformly classified into a training set.

5. The system for realizing the drug property prediction method based on the deep learning fusion molecular graph and the fingerprint according to any one of claims 1-4 is characterized by comprising a data preprocessing module, a model construction module, a model prediction module and a model interpretation module;

The data preprocessing module is used for preprocessing the collected chemical molecular activity original data set so that the model can be applied to the construction of a new drug molecular property data set;

The model construction module is used for modeling the processed sample through a deep learning model based on a molecular graph and a molecular fingerprint; the deep learning model based on the molecular graph and the molecular fingerprint comprises a feature extraction module based on the molecular graph, a feature extraction module based on the molecular fingerprint and a fusion module; the feature extraction module based on the molecular graph adopts a graph attention mechanism network and focuses on judging the influence of the relationship between adjacent atoms on molecular properties; the characteristic extraction module based on the molecular fingerprints extracts the influence of molecular structures and pharmacophores on molecular properties from three different types of molecular fingerprints; the fusion module is used for merging the feature vectors obtained by the two feature extraction modules and inputting the feature vectors into a multi-layer full-connection layer network;

The prediction module is used for predicting the new drug small molecules according to the optimal model generated by the model construction module, so that the model is applied to the prediction of the new drug molecules;

the model interpretation module is used for performing interpretation analysis on the small drug molecules according to the optimal model generated by the model construction module, so that the model can provide drug design suggestions aiming at specific drug properties for users.