CN114386694A

CN114386694A - Drug molecule property prediction method, device and equipment based on comparative learning

Info

Publication number: CN114386694A
Application number: CN202210026795.8A
Authority: CN
Inventors: 王俊; 叶贤斌; 高鹏; 谢国彤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-04-22
Anticipated expiration: 2042-01-11
Also published as: CN114386694B; WO2023134063A1

Abstract

The application discloses a drug molecule property prediction method, device and equipment based on contrast learning, relates to the technical field of artificial intelligence, and can solve the technical problems of low efficiency and poor prediction performance of drug molecule property prediction at present. The method comprises the following steps: generating a target molecular graph structure of the target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule; determining a first feature vector corresponding to the structure of the target molecular diagram by using the trained diagram neural network model; determining a second feature vector corresponding to the target three-dimensional conformation by using the trained convolutional neural network model, wherein the graph neural network model and the convolutional neural network model are obtained by performing comparison learning on a positive sample pair and a negative sample pair and performing combined training; and constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into the trained property prediction model to obtain a property prediction result of the target drug molecule.

Description

Drug molecule property prediction method, device and equipment based on comparative learning

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device and equipment for predicting the property of a drug molecule based on comparative learning.

Background

The research and development period of the medicine is long, the investment is large and the risk is high. In order to fully develop the rules behind drug molecules and accelerate the speed of drug discovery, researchers in the field of drug development try to introduce a machine learning method into the research of pharmaceutical chemistry from the beginning of the century, and an accurate and efficient molecular property prediction model can greatly reduce the dependence on experiments, reduce the cost and accelerate the progress.

At present, the property prediction of drug molecules can be carried out based on a method of molecular fingerprints and molecular descriptors, however, the method needs a great deal of professional knowledge to carry out optimization design, and is lack of universality and expansibility. The selection of the molecular descriptors is a tedious and time-consuming process, and the selected descriptors can apply strong preset prior to the model, so that the model generates deviation, and the prediction performance of the model is further influenced.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, and a device for predicting drug molecule properties based on contrast learning, which can be used to solve the technical problems of low efficiency and poor prediction performance of predicting drug molecule properties at present.

According to one aspect of the present application, there is provided a method for predicting a property of a drug molecule based on comparative learning, the method comprising:

generating a target molecular graph structure of a target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule;

determining a first feature vector corresponding to the target molecular graph structure by using a pre-trained graph neural network model;

determining a second feature vector corresponding to the target three-dimensional conformation by using a convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained by performing comparative learning of a positive sample pair and a negative sample pair and joint training;

and constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

According to another aspect of the present application, there is provided a drug molecule property prediction apparatus based on comparative learning, the apparatus including:

the first generation module is used for generating a target molecular graph structure of a target drug molecule according to a chemical molecular structure and generating a target three-dimensional conformation of the target drug molecule;

the first determining module is used for determining a first feature vector corresponding to the target molecular graph structure by using a pre-trained graph neural network model;

the second determining module is used for determining a second feature vector corresponding to the target three-dimensional conformation by using a convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained by performing comparative learning of a positive sample pair and a negative sample pair and joint training;

and the input module is used for constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method for predicting a property of a drug molecule based on comparative learning.

According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the above-mentioned method for predicting a property of a drug molecule based on comparative learning when executing the program.

Compared with the mode of predicting the properties of the drug molecules by the method based on the molecular fingerprints and the molecular descriptors, the method, the device and the equipment for predicting the properties of the drug molecules based on the contrast learning can firstly construct the positive and negative sample pairs, realize the joint training of the graph neural network model and the convolutional neural network model by utilizing the positive and negative sample pairs through the double-angle contrast learning, and further put the graph neural network model and the convolutional neural network model which are trained in advance into the prediction of the properties of the drug molecules. When drug molecule property prediction is performed, specifically, a target molecular graph structure of a target drug molecule and a target three-dimensional conformation of the target drug molecule are generated according to a chemical molecular structure, a first feature vector corresponding to the target molecular graph structure is further determined by using a graph neural network model which is trained in advance, and a second feature vector corresponding to the target three-dimensional conformation is determined by using a convolutional neural network model which is trained in advance; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. The technical scheme in the application provides a pre-training strategy for 2D molecular diagram structural data and 3D conformation dual-angle joint training, and key 2D and 3D structural information can be learned while efficient calculation is carried out. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the plane structure and the three-dimensional structure of the compound can be learned from large-scale label-free data, the model obtained under the common condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, the brand-new model is prevented from being trained from the beginning for each downstream task, the problem of insufficient generalization performance caused by deep learning model training on the scenes lacking labeled drug molecules is solved, the efficiency of drug molecule property prediction can be improved, and the property prediction accuracy of the drug molecules is ensured.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application to the disclosed embodiment. In the drawings:

fig. 1 is a schematic flow chart of a drug molecule property prediction method based on comparative learning provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating another method for predicting the property of a drug molecule based on comparative learning provided in the embodiments of the present application;

FIG. 3 is a schematic structural diagram of a drug molecule property prediction device based on comparative learning according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of another drug molecule property prediction device based on comparative learning provided in the embodiment of the present application.

Detailed Description

The embodiment of the application can realize the prediction of the drug molecule property based on the artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In recent years, a graph neural network has been developed as an emerging technology in deep learning, and has excellent performance on graph data. Supervised learning based graph neural networks have had great success over the past few years, relying on a large amount of artificially labeled graph data to optimize for learning strong expressive power. Large scale labeled graph data, especially label data based on the pharmaceutical chemistry field, are often difficult to obtain, and labeling of such data often requires expert knowledge in the biochemical field. In most cases, it is difficult to acquire a large amount of label data, so that the supervised learning based graph neural network hardly develops its powerful learning ability. How to utilize large-scale label-free molecular data to perform pre-training is a hot point and difficulty of research when a graph network learns potential features and information.

Similar to the pretraining task of BERT language model (BERT), many researchers have proposed a training strategy for pretraining data based on a molecular graph, and the pretraining strategy based on a graph network is to perform self-supervision pretraining at a node level of a graph first and then perform multi-task supervision pretraining at a global level of the graph. After pre-training a Graph Neural Network (GNN) model by using a large amount of label-free Graph data, fine-tuning the pre-trained GNN model on a downlink task. Specifically, a linear classifier is added on top of the graph-level representation to predict the downstream graph labels. The entire model, i.e., the pre-trained GNN and the downstream linear classifier, is then fine-tuned end-to-end.

The pre-training strategy described above is to look at the molecules of a molecular diagram, i.e. a 2D planar structure, as a molecular diagram structure (graph) data with atoms as nodes of the diagram and chemical bonds as edges. And constructing graph data of a large amount of unlabeled molecular data respectively, and feeding the graph data into the GNN model for pre-training. However, the molecular graph data based on the 2D planar structure ignores the 3D structural information, i.e. stereochemical information, of the chemical molecule, such a pre-training strategy lacks the stereochemical information of the chemical, and thus the GNN model does not capture the general information of the chemistry well.

Therefore, in response to the drawbacks of the above-mentioned pre-training strategy, a pre-training strategy based on dual angle-driven contrast learning of molecular 3D conformation and molecular 2D graph data is proposed in the present application. Specifically, for one unmarked molecular data, three-dimensional conformation (3D volume) data and Graph data of the optimal conformation of the molecular data are respectively constructed, feature extraction is performed on the molecular data and the Graph data through a simple Convolutional Neural Network (CNN) model and a Graph Neural Network (GNN) model, and training of the model is performed based on a concept of comparative learning. And finally, obtaining a CNN model and a GNN model after pre-training is completed, fusing the characteristics of the 3D conformation and the 2D graph data aiming at a specific downstream task, and predicting the property of the drug molecule by using a property prediction model based on the fused characteristics.

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Aiming at the technical problems of low efficiency and poor prediction performance of the prediction of the drug molecule property at present, the application provides a drug molecule property prediction method based on comparative learning, as shown in fig. 1, the method comprises the following steps:

101. generating a target molecular graph structure of the target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule.

Drugs can be generally classified into chemical small molecule drugs and biological large molecule drugs, the small molecule drugs are chemically synthesized active substance small molecules, how the small molecule drugs affect the receptor depends on the affinity and the efficacy of the small molecule drugs, and the properties are determined by chemical structures. In the application, the target drug molecule corresponds to the chemically synthesized active substance micromolecule, the relative molecular mass of the target drug molecule is 200-700, and the prediction of the unknown drug property is realized through intelligent analysis of the chemical structure of the drug micromolecule.

In a specific application scenario, before the step of this embodiment is executed, the chemical molecular structure of the target drug molecule may be extracted in advance, and then the target molecular graph structure of the target drug molecule and the target three-dimensional conformation of the target drug molecule are generated according to the chemical molecular structure. Accordingly, for this embodiment, each atom in a drug molecule can be represented as a node (node) in the molecular diagram structure, and the force between atoms is represented by an edge (edge) between nodes. Nodes can carry different information to express different atomic symbols, and edges (edges) can also carry different information to express different acting force modes, so that the chemical molecular structure of a chemical molecule is expressed by a molecular diagram structure in a computer. Accordingly, a conventional Distance Geometry (Distance Geometry) method can be used to generate the target three-dimensional conformation, as follows: generating a connection boundary matrix of molecules through connection table information in a chemical molecular structure corresponding to target drug molecules; using a triangular boundary smoothing algorithm to smooth the boundary matrix; randomly generating a distance matrix according to the boundary matrix; mapping the generated distance matrix into a three-dimensional space and calculating coordinates for each atom; and roughly optimizing the calculated coordinate result by using a force field and a boundary matrix to further obtain the target three-dimensional conformation of the target drug molecule.

The execution main body can be a device for predicting the property of the drug molecule, can be configured at a client side or a server side, can be subjected to comparative learning of a positive sample pair and a negative sample pair in advance, and can be used for training a graph neural network model and a convolutional neural network model in a combined manner, so that after a target molecule graph structure and a target three-dimensional conformation of the target drug molecule are determined, a first feature vector corresponding to the graph structure of the target molecule is determined by using the graph neural network model trained in advance; determining a second feature vector corresponding to the target three-dimensional conformation by using a convolutional neural network model trained in advance; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

102. And determining a first feature vector corresponding to the target molecular diagram structure by using a pre-trained diagram neural network model.

For the embodiment, the method can be applied to a graph neural network to extract the first feature vector of the target drug molecule. The input to the graph neural network is typically a graph structure with node or edge attributes as described above, i.e. comprising the adjacency matrix a of the graph and the corresponding attribute information X. Its final output generally depends on the specific task, such as node classification outputting the labels of the nodes, graph classification outputting the labels of the graph, link prediction outputting the presence or absence of links. Taking a chemical molecular graph as an example, GNN updates its own information by aggregating the characteristics of neighboring nodes and its own characteristics in the previous layer in each iteration according to the attributes of the adjacency matrix and each node (atom) of the molecular graph and the information of the edges (chemical bonds) connected between them, and usually performs nonlinear transformation on the aggregated information. By stacking the multi-layer network, each node can acquire neighbor node information within a corresponding hop count. For a chemical molecular graph, an individual node hidden vector does not well represent a chemical molecule, and finally, in order to represent the overall information of the molecule from the topological structure of the graph, the information vector representation of the whole graph can be obtained through a mode of average pooling and the like, namely, the overall information representation of the graph is represented by a hidden variable rich in structural information.

Before the graph neural network is applied, the graph neural network needs to be pre-trained by combining task scenes. In the past, the training of the graph neural network in the medicinal chemical task is dependent on a specific task and a large amount of corresponding labeled data, the discovery of the medicinal molecules based on supervised learning has been successful in the past years, and a few studies show that the graph neural network can well process the data of the medicinal chemical molecules and extract corresponding characteristics. However, large-scale labeling data, especially based on the pharmaceutical chemistry field, is often difficult to obtain, and labeling of these data requires expert knowledge of the corresponding biochemical field. Similarly, the same problems are faced in the fields of natural language processing and computer vision.

Fortunately, it is relatively easy to obtain a large amount of raw label-free chemical molecule data that is label-free and therefore can be classified as unsupervised learning. How to train by using the unlabeled chemical molecular data and obtain a pre-training model with strong generalization capability is a difficulty of the current research. In contrast, in the application, a pre-model can be obtained by using a self-supervision learning method, and particularly, in the learning process of the model, a supervision signal can be constructed by using input data, and the model can be supervised and learned, so that potential features and information in the data can be effectively learned. From the perspective of methodology, currently mainstream self-supervision pre-training learning methods can be divided into two main categories, namely generation-based and comparison-based learning. In the application, the idea of comparative learning can be adopted, and the main idea is to construct positive and negative samples from input data, to make the model distinguish the positive and negative samples in the implicit expression space, and to construct a pre-training task, namely a supervision signal, from unmarked input data, so as to realize the self-supervision learning of the graph neural network model by using the positive and negative samples.

Correspondingly, for the embodiment, after the graph neural network model is obtained through training, the molecular graph structure of the target drug molecule can be input into the graph neural network model, so as to obtain the first feature vector under the corresponding molecular scale.

103. And determining a second feature vector corresponding to the target three-dimensional conformation by using a convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained by performing comparison learning on a positive sample pair and a negative sample pair and performing combined training.

For the embodiment, the method can be applied to a convolutional neural network model to extract the second feature vector of the target drug molecule. The convolutional neural network is a feedforward neural network containing convolutional calculation and having a deep structure, the hidden layer of the convolutional neural network can contain 3 common structures of a convolutional layer, a pooling layer and a full-link layer, and in some more modern algorithms, there may be complex structures such as an inclusion module and a residual block (residual block).

Before performing the embodiment steps 102 and 103, a positive sample pair and a negative sample pair may be constructed in advance by using unlabeled drug molecules, the positive sample pair is constructed to have a molecular diagram structure and a three-dimensional conformation corresponding to the same drug molecule, and the negative sample pair is constructed to have a molecular diagram structure and a three-dimensional conformation corresponding to different drug molecules; and further, the graph neural network model and the convolutional neural network model can be jointly trained through comparative learning of the positive sample pair and the negative sample pair, so that the embedded vector distance of the graph neural network model and the convolutional neural network model under the positive sample pair is smaller, and the embedded vector distance under the negative sample pair is larger. The purpose of the comparative learning is to shorten the distance of the similar samples and increase the distance of the dissimilar samples, wherein the distances between the positive and negative samples and the embedded vector are both measured by the inner product of the vectors. Through the joint training, the positions of the corresponding output hidden vectors of the two models in the vector space can be adjusted, the distance between homologous vectors is reduced, and the distance between non-homologous vectors is increased.

104. And constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

The property prediction model may correspond to any one of the existing neural network models, for example, a linear regression model, a decision tree model, a neural network model, a support vector machine model, a hidden markov model, etc., and is not specifically limited in this application; the property prediction result may specifically include one or more of target binding property prediction, activity prediction, toxicity prediction, efficacy prediction, water solubility prediction, adverse reaction prediction, prediction of a treatment effect for a certain disease, and the like, and the type of the property prediction may be specifically set according to an actual application prediction scenario, which is not specifically limited in this scheme. It should be noted that, before executing the steps of this embodiment, the property prediction model needs to be trained in advance using the label samples, so as to implement the property prediction of the target drug molecule by using the property prediction model that is trained in advance.

For this embodiment, after determining and obtaining the first feature vector under the target molecule graph structure and the second feature vector under the target three-dimensional conformation corresponding to the target drug molecule based on the embodiment steps 102 and 103, the property prediction result of the target drug molecule is determined and obtained by fusing the first feature vector and the second feature vector and inputting the third feature vector obtained by fusion into a pre-trained property prediction model.

According to the method for predicting the properties of the drug molecules based on the contrast learning, positive and negative sample pairs can be constructed firstly, the positive and negative sample pairs are utilized to realize the combined training of the graph neural network model and the convolutional neural network model through the dual-angle contrast learning, and then the graph neural network model and the convolutional neural network model which are trained in advance can be put into the prediction of the properties of the drug molecules. When drug molecule property prediction is performed, specifically, a target molecular graph structure of a target drug molecule and a target three-dimensional conformation of the target drug molecule are generated according to a chemical molecular structure, a first feature vector corresponding to the target molecular graph structure is further determined by using a graph neural network model which is trained in advance, and a second feature vector corresponding to the target three-dimensional conformation is determined by using a convolutional neural network model which is trained in advance; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. The technical scheme in the application provides a pre-training strategy for 2D molecular diagram structural data and 3D conformation dual-angle joint training, and key 2D and 3D structural information can be learned while efficient calculation is carried out. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the plane structure and the three-dimensional structure of the compound can be learned from large-scale label-free data, the model obtained under the common condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, the brand-new model is prevented from being trained from the beginning for each downstream task, the problem of insufficient generalization performance caused by deep learning model training on the scenes lacking labeled drug molecules is solved, the efficiency of drug molecule property prediction can be improved, and the property prediction accuracy of the drug molecules is ensured.

Further, as a refinement and an extension of the embodiments of the above embodiments, in order to fully illustrate the implementation process in this embodiment, another drug molecule property prediction method based on comparative learning is provided, as shown in fig. 2, the method includes:

201. and on the basis of the positive sample pair and the negative sample pair, the neural network model of the joint training graph and the convolutional neural network model are trained through comparison and learning.

In a specific application scenario, for the present embodiment, the embodiment step 202 may specifically include: obtaining a first drug molecule and a second drug molecule which are not marked, wherein the first drug molecule and the second drug molecule have different corresponding chemical molecular structures; generating a first molecular graph structure and a first three-dimensional conformation of a first drug molecule, and generating a second molecular graph structure and a second three-dimensional conformation of a second drug molecule; constructing a positive sample pair by using the first molecular graph structure and the first three-dimensional conformation and/or by using the second molecular graph structure and the second three-dimensional conformation, and constructing a negative sample pair by using the first molecular graph structure and the second three-dimensional conformation and/or by using the second molecular graph structure and the first three-dimensional conformation; and training the graph neural network model and the convolutional neural network model by utilizing the positive sample pair and the negative sample pair in a combined manner, and adjusting model parameters of the graph neural network model and/or the convolutional neural network model, so that the embedded vector distance of the graph neural network model and the convolutional neural network model under the positive sample pair is smaller than a first preset threshold value, and the embedded vector distance under the negative sample pair is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value.

For the embodiment, in the training process, besides judging whether the positive and negative samples are homologous (derived from the same drug molecule), the objective function during training is based on the definition of maximized mutual information, additionally using InfoNCE as a loss function to respectively estimate the distance between the embedded vectors of the positive and negative sample pairs, and the distance calculation is based on the inner product formula of the vectors, so that the generalization performance of the model is optimized by minimizing the distance of the positive sample pair and maximizing the distance of the negative sample pair, and the model can fully learn the mutual information of molecular double angles.

The formula of the InfoNCE loss function is as follows:

where f refers to a graph neural network with trainable parameters, x refers to raw data (raw molecular graph data, also commonly referred to as anchor data points), x refers to⁺Refers to data similar or equal to x (3D Voxel), x^jRefers to the jth negative sample of the structure, and N refers to the number of negative samples. The purpose of the contrast learning is to shorten the distance of similar samples and increase the distance of dissimilar samples, where the distance of positive and negative samples to the anchor data point is measured by the inner product of the vectors. Minimizing the InfoNCE loss function is equivalent to maximizing the lower bound of the mutual information of the positive sample and the anchor data, thereby enabling the graph neural network to learn similar information between the local 3D constellation and the 2D planar graph data.

In addition, in order to construct a positive and negative sample pair, the present embodiment additionally relates to designing a corresponding self-supervised training task, wherein in the preprocessing step, the map data of the first drug molecule and/or the second drug molecule and the corresponding 3D conformation data form a positive sample pair, and correspondingly, the map data of the first drug molecule and the 3D conformation data of the second drug molecule form a negative sample pair, or the 3D conformation data of the first drug molecule and the map data of the second drug molecule form a negative sample pair. As an alternative, when generating the negative sample pairs, a corresponding number of different graph data may be randomly selected from the graph database with a 50% probability, and the corresponding number of different graph data and the 3D conformation data of the first drug molecule and/or the second drug molecule may form the negative sample pairs. Another 50% probability randomly selects a corresponding number of different 3D conformations from the 3D conformation database, forming negative sample pairs with the graph data of the first drug molecule and/or the second drug molecule. The method of constructing negative examples is dynamically and randomly adjusted during the training process. Specifically, one primary molecular graph data and its corresponding 3D constellation data form a positive sample pair, and a corresponding number of graph data or a corresponding number of 3D constellation data are randomly selected to form a negative sample pair.

It should be noted that, when the graph neural network model and the convolutional neural network model are jointly trained, the correctly paired positive sample pair and the incorrectly paired negative sample pair both include a molecular graph structure and a three-dimensional conformation. When contrast learning training is carried out, the molecular diagram structure in the same sample pair (positive sample pair/negative sample pair) can be input into the graph neural network model, the three-dimensional conformation in the same sample pair is simultaneously input into the convolutional neural network model, the graph neural network model and the convolutional neural network model both output an embedding vector aiming at the sample pair (positive sample pair/negative sample pair), and in view of the fact that the pairing condition of the sample pair is known (correct/wrong), the joint training condition of the graph neural network model and the convolutional neural network model can be judged according to the embedding vector distance obtained through comparison learning of the two embedding vectors, so that model parameters of the graph neural network model and the convolutional neural network model are adjusted, the finally trained graph neural network model and convolutional neural network model are enabled to be minimized in the embedding vector distance under the positive sample pair and maximized in the embedding vector distance under the negative sample pair.

202. Generating a target molecular graph structure of the target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule.

The target molecule sub-graph structure carries an adjacent matrix and attribute information, the attribute information comprises node initial characteristic vectors and edge initial characteristic vectors, the node initial characteristic vectors and the edge initial characteristic vectors are determined according to a preset vector generation rule, the adjacent matrix is an n x n matrix formed by representing node connection relations, elements with connection relations in the adjacent matrix are represented as 1, elements without connection relations are 0, and n is the number of nodes contained by a target small molecule; the attribute information may include a node initial feature vector and an edge initial feature vector of the atom. The node initial feature vector is generated according to a first preset vector generation rule, wherein the first preset vector generation rule can be shown in table 1, and the node initial feature vector can be a 27-bit feature vector formed by mixing the number of 6-bit chemical bonds, the number of 5-bit formal charges, the chirality of 4-bit atoms, the number of 5-bit bound hydrogen atoms, and the number of 5-bit atomic orbitals, and the aromaticity of + 1-bit and the atomic mass of 1-bit. The edge initial feature vector is generated according to a second predetermined vector generation rule, where the second predetermined vector generation rule can be shown in table 2, and the edge initial feature vector can be a 12-bit feature vector formed by a 4-bit chemical bond type + 1-bit conjugation + 1-bit stereoselectivity in a ring + 6-bit stereoselectivity.

TABLE 1

TABLE 2

203. Inputting the target molecular graph structure and the adjacency matrix and attribute information carried in the target molecular graph structure into a graph neural network model which is trained in advance, and obtaining node hidden vectors of all nodes in the target molecular graph structure.

For this embodiment, the target molecular graph structure and the adjacency matrix and attribute information carried in the target molecular graph structure may be input into the graph neural network model, and the node implicit vectors of each node in the target molecular graph structure may be obtained by iterative learning of the graph neural network model.

In particular, the main process of graph neural network model learning is to iteratively aggregate and update the neighbor information of nodes in graph data. In one iteration, each node updates its own information by aggregating the characteristics of neighboring nodes and the characteristics of its previous layer, and usually performs nonlinear transformation on the aggregated information. By stacking the multi-layer network, each node can acquire neighbor node information within a corresponding hop count.

Wherein the learning of the neural network model is in a node message passing modeIt is understood that two processes are involved, a message passing (message passing) stage and a read (readout) stage. The information transfer phase is a forward propagation phase which runs T steps circularly and passes through an information function M_tObtaining information by updating the function U_tAnd updating the nodes.

Information function M_tAnd update function U_tIs characterized by the formula:

wherein e is_vwA feature vector representing an edge from node v to w.

The read (ready) phase calculates a feature vector for the representation (rendering) of the whole graph, implemented using a function R whose formula is characterized by:

wherein the whole time step number is represented, wherein the function M_t，U_tAnd R may use different model settings, such as Graph Convolutional Network (GCN), Attention model (GAT), and the like.

The central idea of the neural network model for learning molecular representation can be understood as: if the initial feature vectors are used for expressing different nodes and different edges respectively, the final stable feature vector expression mode of the nodes can be found through an iterative mode of message propagation. After a fixing step, such as a T step, the feature vector corresponding to each node may be balanced to some extent and not changed. Thus, with the final stable feature vector for each node, compared to the original node feature vector, the final feature vector for each node also contains information about its neighboring nodes and the entire graph (e.g., some atomic nodes in a chemical molecule, assuming their contribution to a certain property of the molecule is the greatest, will have a corresponding more specific expression in the final feature vector).

204. And generating a first characteristic vector of the target drug molecule by using the node implicit vectors of all the nodes.

For this embodiment, after determining the node hidden vector of each node in the target molecular structure diagram based on the embodiment step 203, the information vector representation of the whole target molecular structure diagram can be further obtained according to the node hidden vector of each node (for example, the information representation of the molecular level of the whole molecular compound is extracted through the characteristics of the atomic nodes and the chemical bond information of the connecting edges between the atoms). As a preferred mode, the embodiment steps may specifically include: calculating an implicit vector average value of node implicit vectors, and determining the implicit vector average value as a first characteristic vector of a target drug molecule; or, determining the node hidden vector with the maximum corresponding hidden vector value as the first feature vector.

205. And determining a second feature vector corresponding to the target three-dimensional conformation by using the pre-trained convolutional neural network model.

The convolutional neural network model comprises a data input layer, a convolutional calculation layer, a pooling layer and a full-connection layer. For this embodiment, a target three-dimensional constellation may be input to a convolution calculation layer through a data input layer to perform convolution operation, so as to obtain a feature map, the feature map is further subjected to pooling operation by using a pooling layer, and finally, after iterative convolution pooling processing of a multilayer convolution calculation layer and a pooling layer, a plurality of (e.g., 5) feature maps, i.e., a plurality of (e.g., 5) matrices, are obtained, and then the matrices are expanded according to rows and connected into vectors, which are transmitted to a fully-connected layer, which is a BP neural network, each feature map in the map may be regarded as neurons arranged in a matrix form, and after multilayer convolution, an implicit variable representation (a second feature vector) of a 3D constellation may be obtained, where the implicit variable may well represent features of the constellation. Correspondingly, step 205 in the embodiment may specifically include: inputting a target three-dimensional conformation into a convolutional neural network model which is trained in advance through a data input layer, and performing iterative convolution pooling processing on a convolution calculation layer and a pooling layer to obtain a characteristic diagram; and expanding the characteristic diagram according to rows and transmitting the characteristic diagram into the full-connection layer to obtain a second characteristic vector corresponding to the target three-dimensional conformation.

206. And constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

In a specific application scenario, before the step of this embodiment is executed, the step of this embodiment further includes: taking a sample feature vector matched with a preset property prediction task corresponding to a target drug molecule as a training sample, and training a preset property prediction model; and calculating a loss function of the property prediction model, and judging that the property prediction model is trained completely when the loss function is smaller than a third preset threshold value. The loss function is used for representing a prediction error of a prediction result of the property prediction model relative to a sample marking result, a preset threshold value is between 0 and 1 and is used for representing the training precision of the property prediction model, the closer the preset threshold value is to 1, the higher the training precision of the property prediction model is, and a specific numerical value of the preset threshold value can be set according to an actual application scene and is not specifically limited herein. The property prediction model may correspond to any one of the existing neural network models, for example, a linear regression model, a decision tree model, a neural network model, a support vector machine model, a hidden markov model, etc., and may be adaptively selected according to the actual application requirements, which is not specifically limited in this application.

Correspondingly, for the present embodiment, as a preferred mode, the step 206 of the embodiment may specifically include: performing vector fusion processing on the first feature vector and the second feature vector according to a preset vector fusion rule to obtain a third feature vector; and inputting the third feature vector serving as an input feature into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. Wherein, the preset vector splicing rule may include: splicing the first feature vector after the second feature vector to obtain a third feature vector; or splicing the second feature vector after the first feature vector to obtain a third feature vector; or, the first feature vector and the second feature vector are added to obtain a third feature vector and the like.

By the aid of the drug molecule property prediction method based on contrast learning, positive and negative sample pairs can be constructed firstly, and the positive and negative sample pairs are used for realizing combined training of the graph neural network model and the convolutional neural network model through double-angle contrast learning, so that the graph neural network model and the convolutional neural network model which are trained in advance can be put into drug molecule property prediction. When drug molecule property prediction is performed, specifically, a target molecular graph structure of a target drug molecule and a target three-dimensional conformation of the target drug molecule are generated according to a chemical molecular structure, a first feature vector corresponding to the target molecular graph structure is further determined by using a graph neural network model which is trained in advance, and a second feature vector corresponding to the target three-dimensional conformation is determined by using a convolutional neural network model which is trained in advance; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. The technical scheme in the application provides a pre-training strategy for 2D molecular diagram structural data and 3D conformation dual-angle joint training, and key 2D and 3D structural information can be learned while efficient calculation is carried out. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the plane structure and the three-dimensional structure of the compound can be learned from large-scale label-free data, the model obtained under the common condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, the brand-new model is prevented from being trained from the beginning for each downstream task, the problem of insufficient generalization performance caused by deep learning model training on the scenes lacking labeled drug molecules is solved, the efficiency of drug molecule property prediction can be improved, and the property prediction accuracy of the drug molecules is ensured.

Further, as a specific implementation of the method shown in fig. 1 and fig. 2, an embodiment of the present application provides a drug molecule property prediction device based on comparative learning, as shown in fig. 3, the device includes: a first generation module 31, a first determination module 32, a second determination module 33, and an input module 34;

a first generation module 31, configured to generate a target molecular graph structure of the target drug molecule according to the chemical molecular structure, and generate a target three-dimensional conformation of the target drug molecule;

a first determining module 32, configured to determine a first feature vector corresponding to the target molecular diagram structure by using a pre-trained diagram neural network model;

the second determining module 33 is configured to determine a second feature vector corresponding to the target three-dimensional conformation by using a convolutional neural network model which is trained in advance, where the graph neural network model and the convolutional neural network model are obtained through comparative learning of a positive sample pair and a negative sample pair and joint training;

and the input module 34 is configured to construct a third feature vector according to the first feature vector and the second feature vector, and input the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

In a specific application scenario, in order to implement joint training of the graph neural network model and the convolutional neural network model through comparative learning, as shown in fig. 4, the apparatus further includes: the system comprises an acquisition module 35, a second generation module 36, a construction module 37 and a first training module 38;

the obtaining module 35 may be configured to obtain a first drug molecule and a second drug molecule without labels, where the first drug molecule and the second drug molecule have different corresponding chemical molecular structures;

a second generation module 36 operable to generate a first molecular graph structure and a first three-dimensional conformation of the first drug molecule, and to generate a second molecular graph structure and a second three-dimensional conformation of the second drug molecule;

a construction module 37, operable to construct a positive sample pair using the first molecular graph structure and the first three-dimensional conformation, and/or using the second molecular graph structure and the second three-dimensional conformation, and to construct a negative sample pair using the first molecular graph structure and the second three-dimensional conformation, and/or using the second molecular graph structure and the first three-dimensional conformation;

the first training module 38 is configured to jointly train the graph neural network model and the convolutional neural network model by using the positive sample pair and the negative sample pair, and adjust model parameters of the graph neural network model and/or the convolutional neural network model, so that an embedded vector distance of the graph neural network model and the convolutional neural network model under the positive sample pair is smaller than a first preset threshold, and an embedded vector distance under the negative sample pair is larger than a second preset threshold, where the second preset threshold is larger than the first preset threshold.

In a specific application scenario, a target molecular graph structure carries an adjacency matrix and attribute information, wherein the attribute information comprises a node initial feature vector and an edge initial feature vector, and the node initial feature vector and the edge initial feature vector are determined according to a preset vector generation rule; correspondingly, the first determining module 32 is specifically configured to input the target molecular graph structure, the adjacency matrix and the attribute information into a graph neural network model which is trained in advance, and obtain node hidden vectors of each node in the target molecular graph structure; and generating a first characteristic vector of the target drug molecule by using the node implicit vectors of all the nodes.

In a specific application scenario, when the node hidden vectors of each node are used to generate the first feature vector of the target drug molecule, the first determining module 32 may be specifically configured to calculate a hidden vector average value of the node hidden vectors, and determine the hidden vector average value as the first feature vector of the target drug molecule; or, determining the node hidden vector with the maximum corresponding hidden vector value as the first feature vector.

In a specific application scene, the convolutional neural network model comprises a data input layer, a convolutional calculation layer, a pooling layer and a full-connection layer; correspondingly, the second determining module 33 is specifically configured to input the target three-dimensional conformation into a pre-trained convolutional neural network model through the data input layer, and perform iterative convolutional pooling processing on the convolutional calculation layer and the pooling layer to obtain a feature map; and expanding the characteristic diagram according to rows and transmitting the characteristic diagram into the full-connection layer to obtain a second characteristic vector corresponding to the target three-dimensional conformation.

In a specific application scenario, when a third feature vector is constructed according to the first feature vector and the second feature vector and is input into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule, the input module 34 is specifically configured to perform vector fusion processing on the first feature vector and the second feature vector according to a preset vector fusion rule to obtain the third feature vector; and inputting the third feature vector serving as an input feature into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

In a specific application scenario, to implement pre-training of the property prediction model, as shown in fig. 4, the apparatus further includes: a second training module 39, a calculation module 310;

the second training module 39 is configured to train a preset property prediction model by using the sample feature vector matched with the preset property prediction task corresponding to the target drug molecule as a training sample;

and the calculating module 310 may be configured to calculate a loss function of the property prediction model, and when the loss function is smaller than a third preset threshold, determine that the property prediction model training is completed.

It should be noted that other corresponding descriptions of the functional units related to the drug molecule property prediction apparatus based on contrast learning provided in this embodiment may refer to the corresponding descriptions in fig. 1 to fig. 2, and are not repeated herein.

Based on the method shown in fig. 1 to 2, correspondingly, the present embodiment further provides a storage medium, which may be volatile or nonvolatile, and has computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the method for predicting the property of the drug molecule based on the comparative learning shown in fig. 1 to 2 is implemented.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, or the like), and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, or the like) to execute the method of the embodiments of the present application.

Based on the method shown in fig. 1 to fig. 2 and the virtual device embodiments shown in fig. 3 and fig. 4, in order to achieve the above object, the present embodiment further provides a computer device, where the computer device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the method for predicting a property of a drug molecule based on comparative learning as described above with reference to fig. 1 to 2.

Optionally, the computer device may further include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, a sensor, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be understood by those skilled in the art that the present embodiment provides a computer device structure that is not limited to the physical device, and may include more or less components, or some components in combination, or a different arrangement of components.

The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the computer device described above, supporting the operation of information handling programs and other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware.

By applying the technical scheme, compared with the prior art, the method can firstly construct the positive and negative sample pairs, realize the joint training of the graph neural network model and the convolutional neural network model by utilizing the positive and negative sample pairs through double-angle contrast learning, and then put the graph neural network model and the convolutional neural network model which are trained in advance into the prediction of the drug molecular properties. When drug molecule property prediction is performed, specifically, a target molecular graph structure of a target drug molecule and a target three-dimensional conformation of the target drug molecule are generated according to a chemical molecular structure, a first feature vector corresponding to the target molecular graph structure is further determined by using a graph neural network model which is trained in advance, and a second feature vector corresponding to the target three-dimensional conformation is determined by using a convolutional neural network model which is trained in advance; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. The technical scheme in the application provides a pre-training strategy for 2D molecular diagram structural data and 3D conformation dual-angle joint training, and key 2D and 3D structural information can be learned while efficient calculation is carried out. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the plane structure and the three-dimensional structure of the compound can be learned from large-scale label-free data, the model obtained under the common condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, the brand-new model is prevented from being trained from the beginning for each downstream task, the problem of insufficient generalization performance caused by deep learning model training on the scenes lacking labeled drug molecules is solved, the efficiency of drug molecule property prediction can be improved, and the property prediction accuracy of the drug molecules is ensured.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for predicting the property of a drug molecule based on comparative learning is characterized by comprising the following steps:

2. The method of claim 1, wherein before determining the first feature vector corresponding to the target molecular graph structure using the pre-trained graph neural network model, the method further comprises:

obtaining a first drug molecule and a second drug molecule which are not marked, wherein the first drug molecule and the second drug molecule have different corresponding chemical molecular structures;

generating a first molecular graph structure and a first three-dimensional conformation of the first drug molecule, and generating a second molecular graph structure and a second three-dimensional conformation of the second drug molecule;

constructing a positive sample pair using the first molecular graph structure and the first three-dimensional conformation, and/or using the second molecular graph structure and the second three-dimensional conformation, constructing a negative sample pair using the first molecular graph structure and the second three-dimensional conformation, and/or using the second molecular graph structure and the first three-dimensional conformation;

training a graph neural network model and a convolutional neural network model by utilizing the positive sample pair and the negative sample pair in a combined manner, and adjusting model parameters of the graph neural network model and/or the convolutional neural network model, so that the embedded vector distance of the graph neural network model and the convolutional neural network model under the positive sample pair is smaller than a first preset threshold value, and the embedded vector distance under the negative sample pair is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value.

3. The method according to claim 1, wherein the target molecular graph structure carries an adjacency matrix and attribute information, and the attribute information includes a node initial feature vector and an edge initial feature vector, wherein the node initial feature vector and the edge initial feature vector are determined according to a preset vector generation rule;

the determining a first feature vector corresponding to the target molecular graph structure by using the pre-trained graph neural network model comprises:

inputting the target molecular graph structure, the adjacency matrix and the attribute information into a pre-trained graph neural network model to obtain node hidden vectors of each node in the target molecular graph structure;

and generating a first characteristic vector of the target drug molecule by using the node implicit vectors of all the nodes.

4. The method of claim 3, wherein the generating a first feature vector of the target drug molecule using the node hidden vectors of the respective nodes comprises:

calculating an implicit vector average value of the node implicit vectors, and determining the implicit vector average value as a first characteristic vector of the target drug molecules; or the like, or, alternatively,

and determining the node hidden vector with the maximum corresponding hidden vector value as the first feature vector.

5. The method of claim 1, wherein the convolutional neural network model comprises a data input layer, a convolutional computation layer, a pooling layer, a fully-connected layer;

the determining a second feature vector corresponding to the target three-dimensional conformation by using the pre-trained convolutional neural network model comprises:

inputting the target three-dimensional conformation into a pre-trained convolutional neural network model through the data input layer, and performing iterative convolutional pooling processing on the convolutional calculation layer and the pooling layer to obtain a characteristic diagram;

and expanding the characteristic diagram according to rows and transmitting the characteristic diagram into the full-connection layer to obtain a second characteristic vector corresponding to the target three-dimensional conformation.

6. The method of claim 1, wherein the constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain the property prediction result of the target drug molecule comprises:

performing vector fusion processing on the first feature vector and the second feature vector according to a preset vector fusion rule to obtain a third feature vector;

and inputting the third feature vector serving as an input feature into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.

7. The method of claim 1, further comprising:

taking a sample feature vector matched with a preset property prediction task corresponding to the target drug molecule as a training sample, and training a preset property prediction model;

and calculating a loss function of the property prediction model, and judging that the property prediction model is trained completely when the loss function is smaller than a third preset threshold value.

8. A drug molecule property prediction device based on contrast learning, comprising:

9. A storage medium on which a computer program is stored, the program, when executed by a processor, implementing the contrast learning-based drug molecule property prediction method of any one of claims 1 to 7.

10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method for drug molecule property prediction based on comparative learning of any one of claims 1 to 7 when executing the program.