CN111524557B - Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence - Google Patents

Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence Download PDF

Info

Publication number
CN111524557B
CN111524557B CN202010332227.1A CN202010332227A CN111524557B CN 111524557 B CN111524557 B CN 111524557B CN 202010332227 A CN202010332227 A CN 202010332227A CN 111524557 B CN111524557 B CN 111524557B
Authority
CN
China
Prior art keywords
character string
product molecule
molecule
sample
product
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010332227.1A
Other languages
Chinese (zh)
Other versions
CN111524557A (en
Inventor
赵沛霖
于洋
黄俊洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010332227.1A priority Critical patent/CN111524557B/en
Publication of CN111524557A publication Critical patent/CN111524557A/en
Application granted granted Critical
Publication of CN111524557B publication Critical patent/CN111524557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the application discloses an artificial intelligence-based inverse synthesis prediction method, an artificial intelligence-based inverse synthesis prediction device and a storage medium, wherein the method comprises the following steps: obtaining the graph structure of a product molecule and the attribute characteristics of atoms in the product molecule; predicting, by a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a characteristic feature of atoms in the product molecule; performing bond breaking treatment on the product molecules based on the broken chemical bonds to obtain at least one synthon; and predicting the reactant molecules at least according to the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to at least one synthon through a sequence learning model. The method can effectively improve the prediction precision of the inverse synthetic reaction of the organic compound, so that the prediction process of the inverse synthetic reaction of the organic compound is easier to visualize and has better interpretability.

Description

Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
Technical Field
The present application relates to the field of artificial intelligence (Artificial Intelligence, AI) technology, and in particular, to a method, apparatus, device and storage medium for predicting inverse synthesis of an organic compound molecule.
Background
Nowadays, new organic compounds play an increasingly important role in the fields of materials, agriculture, environmental and medical science, etc. Given a new organic compound, the relevant technicians need to determine the synthetic route of the organic compound to produce the organic compound efficiently and accurately; it can be seen that the rapid and accurate determination of the synthetic route for a given organic compound is a very important task. The process of pushing out its corresponding reactant for a given organic compound is called the reverse synthesis reaction.
In recent years, with the rising and rapid development of artificial intelligence technology, the prediction of inverse synthetic reaction is gradually considered as a deep learning problem to be solved. In particular, since a molecule can be expressed as a uniquely defined string, such as a simplified molecular linear input specification (Simplified molecular input line entry specification, SMILES), both a product molecule and a reactant molecule can be converted to corresponding SMILES strings, and accordingly, the inverse synthetic reaction prediction can be regarded as a sequence prediction process from a product SMILES string to a reactant SMILES string. The current inverse synthesis prediction method based on sequence learning mainly comprises a Seq2Seq method and an SCROP method.
The inventor of the application researches and discovers that although the above-mentioned inverse synthetic reaction prediction method can predict the reactant SMILES character string according to the product SMILES character string through a sequence learning model, the sequence learning model generally cannot capture the graph structure information of the molecules from the SMILES character string, and the graph structure information of the molecules often plays a very important role in the inverse synthetic prediction process. In summary, the prediction accuracy of the current inverse synthesis prediction method still needs to be improved.
Disclosure of Invention
The embodiment of the application provides an inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence, which can effectively improve the prediction precision of an inverse synthesis reaction.
In view of this, the present application provides, in a first aspect, an artificial intelligence based inverse synthetic prediction method, the method comprising:
obtaining a graph structure of a product molecule and attribute characteristics of atoms in the product molecule, wherein the product molecule is an organic compound molecule;
predicting, by a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a characteristic feature of atoms in the product molecule;
performing bond breaking treatment on the product molecules based on the broken chemical bonds to obtain at least one synthon;
And predicting the reactant molecules at least according to the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to the at least one synthons through a sequence learning model.
A second aspect of the present application provides an artificial intelligence based inverse synthetic prediction apparatus, the apparatus comprising:
the acquisition module is used for acquiring the graph structure of a product molecule and the attribute characteristics of atoms in the product molecule, wherein the product molecule is an organic compound molecule;
a first prediction module for predicting, by a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a property characteristic of atoms in the product molecule;
the bond breaking module is used for breaking bonds of the product molecules based on the broken chemical bonds to obtain at least one synthon;
and the second prediction module is used for predicting the reactant molecules at least according to the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to the at least one synthons through a sequence learning model.
A third aspect of the present application provides an apparatus comprising a processor and a memory:
The memory is used for storing a computer program;
the processor is configured to execute the steps of the artificial intelligence based inverse synthetic prediction method according to the first aspect described above according to the computer program.
A fourth aspect of the present application provides a computer readable storage medium storing a computer program for performing the steps of the artificial intelligence based inverse synthetic prediction method of the first aspect described above.
A fifth aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the artificial intelligence based inverse synthetic prediction method of the first aspect described above.
From the above technical solutions, the embodiments of the present application have the following advantages:
the embodiment of the application provides an artificial intelligence-based inverse synthesis prediction method, which predicts potential bond breaking positions in a product molecule (namely an organic compound molecule) according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule through a graph neural network (Graph Neural Networks, GNN) model in the process of predicting the reactant molecule based on the product molecule; then, carrying out bond breaking treatment on the product molecule based on the predicted potential bond breaking position to obtain a synthon of the product molecule; further, the sequence learning model predicts the reactant molecules of the inverse synthetic reaction based on the type of the inverse synthetic reaction, the product molecules, and the character strings corresponding to the synthons. Compared with the implementation mode of predicting reactant molecules according to the corresponding character strings of the product molecules only through a sequence learning model in the related art, the method provided by the embodiment of the application not only utilizes the character string information of the product molecules in the prediction process of the inverse synthetic reaction of the organic compound, but also fuses the molecular diagram structure information with higher reference value for the prediction of the inverse synthetic reaction into the prediction process through a graph neural network model, so that the prediction accuracy of the inverse synthetic reaction of the organic compound can be improved to a certain extent; in addition, the method provided by the embodiment of the application predicts the bond breaking position through the graph neural network model, and then complements the synthons after bond breaking treatment through the sequence learning model, and the treatment mode enables the prediction process of the inverse synthetic reaction of the organic compound to be easier to visualize and has extremely high interpretability.
Drawings
Fig. 1 is an application scenario schematic diagram of an inverse synthetic prediction method provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of an inverse synthetic prediction method according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a training method of the neural network model according to an embodiment of the present application;
fig. 4 is a flow chart of a training method of a sequence learning model according to an embodiment of the present application;
fig. 5 is a schematic diagram of an implementation of a first stage in the inverse synthetic prediction method provided in the embodiment of the present application;
FIG. 6 is a diagram of an implementation architecture of a second stage in the inverse synthetic prediction method provided in the embodiments of the present application;
fig. 7 is a schematic structural diagram of an inverse synthesis prediction apparatus according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of another inverse synthetic prediction apparatus according to an embodiment of the present disclosure;
FIG. 9 is a schematic structural diagram of yet another inverse synthetic prediction device according to an embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
The scheme provided by the embodiment of the application relates to an artificial intelligence inverse synthesis prediction technology, and is specifically described by the following embodiment.
In the related art, a method for predicting an inverse synthetic reaction of an organic compound generally predicts a reactant molecule according to a character string corresponding to a product molecule by using a sequence learning model. The sequence learning model applied in the inverse synthetic reaction prediction method is difficult to capture molecular diagram structural information important for inverse synthetic reaction prediction, so that prediction accuracy is often not ideal.
Aiming at the problems of the related art, the embodiment of the application provides an inverse synthesis prediction method based on artificial intelligence, which can effectively utilize the graph structure information of product molecules in the prediction process of the inverse synthesis reaction of the organic compound and improve the prediction accuracy of the inverse synthesis reaction.
Specifically, in the artificial intelligence-based inverse synthesis prediction method provided in the embodiments of the present application, a graph structure of a product molecule (i.e., an organic compound molecule) and an attribute feature of an atom in the product molecule are obtained first; then, predicting a broken chemical bond in the product molecule according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule through a pre-trained graph neural network model; then, performing bond breaking treatment on the product molecule based on the predicted broken chemical bond to obtain at least one synthon of the product molecule; further, by means of a pre-trained sequence learning model, the reactant molecules of the inverse synthetic reaction are predicted according to the character strings corresponding to the type of the inverse synthetic reaction, the character strings corresponding to the product molecules and the character strings corresponding to at least one synthon.
Compared with the implementation mode of predicting reactant molecules according to the corresponding character strings of the product molecules only through a sequence learning model in the related art, the method provided by the embodiment of the application not only utilizes the character string information of the product molecules in the prediction process of the reverse synthesis reaction of the organic compound, but also integrates the molecular diagram structure information which is important for the prediction of the reverse synthesis reaction into the prediction process through a graph neural network model, so that the accuracy of the prediction of the reverse synthesis reaction of the organic compound can be improved to a certain extent. In addition, the method provided by the embodiment of the application predicts the bond breaking position through the graph neural network model, and then complements the synthons after bond breaking treatment through the sequence learning model, and the treatment mode enables the prediction process of the inverse synthetic reaction of the organic compound to be easier to visualize and has extremely high interpretability.
It should be understood that, in practical application, the artificial intelligence-based inverse synthesis prediction method provided in the embodiments of the present application may be applied to an electronic device, such as a terminal device, a server, etc., capable of supporting operation of a neural network model. The terminal device may be a computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), etc. The server can be an application server or a Web server; in actual deployment, the server may be an independent server, or may be a cluster server or a cloud server.
In order to facilitate understanding of the technical solution provided by the embodiments of the present application, an application scenario of the artificial intelligence-based inverse synthesis prediction method is described below by taking an example of a scenario in which the artificial intelligence-based inverse synthesis prediction method provided by the embodiments of the present application is applied to interaction between a terminal device and a server.
Referring to fig. 1, fig. 1 is a schematic application scenario diagram of an inverse synthetic prediction method provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes a terminal device 110 and a server 120, where the terminal device 110 and the server 120 communicate through a network. The terminal device 110 is configured to provide basic information required for predicting the inverse synthetic reaction of the organic compound, such as product molecule information (including a graph structure of a product molecule, an attribute feature of an atom in the product molecule, a character string corresponding to the product molecule, and the like) and inverse synthetic reaction type information (such as a character string corresponding to the inverse synthetic reaction type, and the like) to the server 120. The server 120 is configured to perform the inverse synthesis prediction method provided in the embodiments of the present application, and predict reactant molecules of the inverse synthesis reaction based on the basic information provided by the terminal device 110.
In particular, when the terminal device 110 transmits basic information required for inverse synthesis reaction prediction to the server 120 through a network, the server 120 may first call a pre-trained neural network model 121, and predict potential bond breaking positions in the product molecule according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule, that is, predict a chemical bond with a high possibility of breaking in the product molecule as a broken chemical bond. The server 120 may then perform a bond breaking process 122 on the product molecule based on the broken chemical bonds predicted by the neural network model to obtain several synthons of the product molecule. Further, the server 120 may call the pre-trained sequence learning model 123, predict the reactant molecules of the inverse synthetic reaction according to the character strings corresponding to the type of the inverse synthetic reaction, the character strings corresponding to the product molecules, and the character strings corresponding to the synthons, and transmit the predicted reactant molecules to the terminal device 110 through the network.
In practical applications, when the graph neural network model 121 predicts a broken chemical bond, the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule may be used, and besides, the attribute characteristics of the chemical bond in the product molecule, the characteristics of the type of inverse synthesis reaction, etc. the information used when the graph neural network model 121 predicts a broken chemical bond is not limited in this application.
In addition, when the sequence learning model 123 predicts the reactant molecules, the input sequence information (i.e., the character string information) may be enhanced by using the graphic structure information of the product molecules and the graphic structure information of each synthons, in addition to the character string corresponding to the type of inverse synthetic reaction, the character string corresponding to the product molecules, and the character string corresponding to each synthon, and the information used when the sequence learning model 123 predicts the reactant molecules is not limited in this application.
It should be understood that the basic information provided by the terminal device 110 to the server 120 is merely an example, and in practical applications, the terminal device 110 may provide less or more basic information to the server 120, for example, the terminal device 110 may only provide the product molecular formula to the server 120, and further, the server 120 itself determines other information required in the inverse synthetic reaction prediction process based on the product molecular formula, and no limitation is made on the basic information provided by the terminal device 110 to the server 120.
It should be understood that, the application scenario shown in fig. 1 is merely an example, and in practical application, the terminal device may independently execute the inverse synthesis prediction method provided in the embodiment of the present application, the server may independently execute the inverse synthesis prediction method provided in the embodiment of the present application, and the terminal device and the server may cooperatively execute the inverse synthesis prediction method provided in the embodiment of the present application, which does not make any limitation on the application scenario of the inverse synthesis prediction method provided in the embodiment of the present application.
The artificial intelligence-based inverse synthetic prediction method provided in the present application is described in detail below by way of examples.
Referring to fig. 2, fig. 2 is a flow chart of an inverse synthetic prediction method according to an embodiment of the present application. For convenience of description, the following embodiments will be described taking a server as an execution subject. As shown in fig. 2, the inverse synthetic prediction method includes the steps of:
step 201: obtaining the graph structure of a product molecule and the attribute characteristics of atoms in the product molecule, wherein the product molecule is an organic compound molecule.
In the technical solution provided in the embodiments of the present application, when the server predicts the inverse synthesis reaction of the organic compound molecule (i.e., the product molecule) and determines the corresponding reactant molecule, it is necessary to obtain the graph structure of the product molecule and the attribute characteristics of the atoms in the product molecule.
Wherein, the graph structure of the product molecule can be the only determined graph structure obtained by converting the product molecule according to a predetermined sequence; for example, each atom in a product molecule can be considered a point, each bond in a product molecule can be considered an edge, from which point the product molecule is converted into an adjacency matrix A ε R n×n The structure of the product molecule is shown, and n is the number of atoms in the product molecule.
Wherein the attribute characteristics of atoms in the product molecule consist of the characteristic vector of each atom in the product molecule; in particular, for each atom in the product molecule, a feature vector may be extracted, which is determined based on the key features of the atom corresponding thereto, which is capable of representing the original properties of the atom; assuming that the eigenvector length of the atom is d and the number of atoms included in the product molecule is n, the eigenvector X εR can be utilized n×d Each row of X corresponds to a feature vector of one atom in the product molecule, representing a characteristic feature of the atom in the product molecule.
For example, in practical applications, for each atom in a product molecule, the feature vector for that atom may be determined based on any one or more of the following key features:
Atomic type: the atomic number of the atom (referring to the number of elements in order in the periodic table);
number of linked keys: the number of different chemical bonds to which the atom belongs;
form charge: a charge assigned to the atom in the product molecule;
chiral: means that a molecule cannot coincide with its mirror image, e.g. one's hands, left hand, and right hand, which are mirror images of each other, do not coincide;
number of attached hydrogen atoms: the number of hydrogen atoms to which the atoms are attached;
atomic hybridization: sp, sp2, sp3d or sp3d2;
fragrance: is capable of characterizing whether the atom is within an aromatic ring system;
atomic weight: the weight of the atom;
high frequency reaction center characteristics: whether an atom has a high frequency reaction center characteristic depends on whether the molecular subgraph containing the atom is a high frequency reaction center; the high-frequency reaction center is a center which is extracted from the product of the inverse synthetic training set and is frequently reacted;
reaction type: the type of chemical reaction of the reverse synthetic reaction may also be a feature of atoms.
It should be understood that in practical applications, in addition to determining the feature vector of each atom in the product molecule according to the above key features, the feature vector of an atom may be determined according to other features that can reflect the attribute of the atom.
Optionally, in order to further ensure that the subsequent graph neural network model can accurately predict potential bond breaking positions in the product molecule, the server can acquire the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule, and can also acquire the attribute characteristics of chemical bonds in the product molecule, so that when the graph neural network model is used for predicting broken chemical bonds in the subsequent process, the attribute characteristics of the chemical bonds in the product molecule can be comprehensively referred to, and the broken chemical bonds in the product molecule can be predicted more accurately.
Wherein, the attribute characteristic of the chemical bond in the product molecule consists of the characteristic vector of each chemical bond in the product molecule; specifically, a feature vector may be extracted for each chemical bond in the product molecule, the feature vector being determined based on key features of the chemical bond corresponding thereto, which can represent the original properties of the chemical bond; assuming that the length of the eigenvector of the chemical bond is p and the number of the chemical bonds included in the product molecule is m, the eigenvector Z ε R can be utilized m×p Representing the attribute characteristics of chemical bonds in the product molecules, each row of Z corresponds to the characteristic vector of one chemical bond in the product molecules.
For example, in practical applications, for each chemical bond in a product molecule, the feature vector for that chemical bond may be determined based on any one or more of the following key features:
Key type: represents the type of the chemical bond, such as single bond, double bond, triple bond, aromatic bond, etc.;
conjugation characteristics: indicating whether the chemical bond is conjugated;
ring key feature: indicating whether the chemical bond is part of a ring bond;
molecular stereochemistry: no chiral factor, any chiral factor, or stereochemistry of double bonds, etc.
It should be understood that in practical applications, in addition to determining the feature vector of each chemical bond in the product molecule according to the above key features, the feature vector of the chemical bond may be determined according to other features that can reflect the attribute of the chemical bond.
In one possible implementation, the server may interact with other devices (e.g., terminal devices, servers, etc.), and obtain the above-described graph structure of the product molecule, the attribute characteristics of atoms in the product molecule, and the attribute characteristics of chemical bonds in the product molecule from the other devices. That is, the other device may determine the attribute characteristics of the corresponding graph structure, atom and the attribute characteristics of the chemical bond in advance with respect to the product molecule to be subjected to the inverse synthesis reaction prediction, and further, the determined graph structure, atom attribute characteristics and chemical bond attribute characteristics may be provided to the server, which performs the inverse synthesis reaction prediction based on the graph structure, atom attribute characteristics and chemical bond attribute characteristics.
In another possible implementation manner, the server itself may determine, for the product molecules to be predicted by the inverse synthetic reaction, the attribute features of the corresponding graph structure, the atom, and the attribute features of the chemical bond, and further perform the inverse synthetic reaction prediction based on the determined graph structure, the determined attribute features of the atom, and the determined attribute features of the chemical bond.
It should be understood that, in practical applications, the server may also obtain the graph structure of the product molecule, the attribute characteristics of the atoms in the product molecule, and the attribute characteristics of the chemical bonds in the product molecule by other methods, and the manner in which the server obtains these information is not limited in this application.
Step 202: predicting, by a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a characteristic feature of atoms in the product molecule.
After the server obtains the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule, the graph structure of the product molecule and the attribute characteristics of the atoms in the product molecule can be input into a pre-trained graph neural network model, the graph neural network model correspondingly analyzes and processes the input graph structure and the attribute characteristics of the atoms, and potential bond breaking positions in the product molecule can be determined, namely, chemical bonds with high possibility of breaking in the product molecule are determined to be used as broken chemical bonds.
In one possible implementation, the server can predict the fracture probability of each chemical bond in the product molecule according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule through a graph neural network model; further, a chemical bond having a chemical bond breakage probability larger than a preset threshold value is determined as the above-mentioned broken chemical bond.
Specifically, the graph neural network model can predict the fracture probability of each chemical bond in the product molecule according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule; taking the graph neural network model as an example of the graph attention model (Graph Attention Networks, GAT), the server may map the input original features to advanced features through the GAT, as shown in formula (1):
H=f(X,A;θ) (1)
wherein X is the attribute characteristic of atoms in the input product molecules, A is the graph structure of the input product molecules, and theta is the parameter of the GAT model (obtained by learning in advance based on training samples); h E R n×b The resulting advanced features are mapped for GAT based on the original features of the input, where each row corresponds to an advanced feature vector of one atom in the product molecule, n is the number of atoms in the product molecule, and b is the dimension of the advanced feature vector (typically set manually).
After H is obtained, GAT can further predict the probability of cleavage of each chemical bond in the product molecule in the reverse synthesis reaction, specifically, atom x can be predicted by the formula (2) i And atom x j Chemical bond (x) i ,x j ) Probability of fracture:
wherein c ij Representing a chemical bond (x) i ,x j ) Fracture probability w ε R 2b ,[h i ,h j ]Representing advanced featuresh i And advanced feature h j Is a series of high-level features h i Is the atom x in H i High-level feature vector of (1), high-level feature h j Is the atom x in H j Is described. For the product molecules not present in the graphic structure (x i ,x j ) There is no need to predict the corresponding fracture probability.
The graph neural network model determines the fracture probability c of each chemical bond in the product molecule through the process ij Then, the determined fracture probability c of each chemical bond is outputted ij The method comprises the steps of carrying out a first treatment on the surface of the Further, the server may determine the fracture probability c for each chemical bond ij Judging whether the fracture probability is larger than a preset threshold value (such as 0.5, etc.), if so, determining the fracture probability c ij The corresponding chemical bond is a broken chemical bond in the product molecule.
In another possible implementation manner, in order to further improve the accuracy of the prediction of the broken chemical bonds, the server may predict the breaking probability of each chemical bond in the product molecule and the number of chemical bond breaks in the product molecule according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule through a graph neural network model; further, the breaking probability of each chemical bond in the product molecule is sorted in a descending order, and the chemical bond corresponding to the breaking probability of the number of the broken chemical bonds in the front order is determined as the breaking chemical bond.
Specifically, the graph neural network model can predict the breaking probability of each chemical bond in the product molecule and the number of chemical bond breaks in the product molecule according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule; taking the graph neural network model as the GAT as an example, the server can map the input original features into advanced features through the GAT, as shown in the formula (3):
H=f(X,A;θ) (3)
wherein X is the attribute characteristic of atoms in the input product molecules, A is the graph structure of the input product molecules, and theta is the parameter of the GAT model (obtained by learning in advance based on training samples); h E R n×b Advanced mapping of GAT based input raw featuresFeatures, where each row corresponds to a high-level eigenvector of one atom in the product molecule, n is the number of atoms in the product molecule, and b is the dimension of the high-level eigenvector (typically set manually).
After H is obtained, GAT can further predict the number of chemical bonds that the product molecule may break in the reverse synthesis reaction, specifically, the probability of the number of various broken bonds can be predicted by formula (4):
wherein a represents the occurrence probability corresponding to the number of broken bonds, W E R (N+1)×b N represents the maximum number of broken bonds in the training set; h is a i Is the atom x in H i Correspondingly, the GAT model can predict the number of bond breaks k in the product molecule by the formula (5):
k=argmax j a j -1 (5)
for example, assume a 2 The number of chemical bond breaks k=2-1=1 in the product molecule, which is the largest element in the set of probabilities a.
Furthermore, GAT also requires prediction of the probability of cleavage of each chemical bond in the product molecule in the reverse synthesis reaction, and atom x can be predicted specifically by formula (6) i And atom x j Chemical bond (x) i ,x j ) Probability of fracture:
wherein c ij Representing a chemical bond (x) i ,x j ) Fracture probability w ε R 2b ,[h i ,h j ]Representing advanced features h i And advanced feature h j Is a series of high-level features h i Is the atom x in H i High-level feature vector of (1), high-level feature h j Is the atom x in H j Is described. Absent from the graphic structure of the product molecule(x i ,x j ) There is no need to predict the corresponding fracture probability.
The graph neural network model determines the fracture probability c of each chemical bond in the product molecule through the process ij And outputting the determined breaking probability c of each chemical bond after the number k of chemical bonds in the product molecule are broken ij And the number k of chemical bond breaks; further, the server may determine the breaking probability c for each chemical bond ij And (3) performing descending order sorting, and further determining chemical bonds corresponding to k breaking probabilities with the top order as breaking chemical bonds in the product molecules.
The inventors of the present application tested the two implementations described above using the public dataset uspto_50k and found that predicting a broken chemical bond using the second implementation has a higher accuracy than predicting a broken chemical bond using the first implementation. Specifically, the accuracy of predicting the broken chemical bond on the public data set uspto_50k is 74% with the first implementation, and the accuracy of predicting the broken chemical bond on the public data set uspto_50k is 86% with the second implementation.
Note that, if the server obtains the attribute characteristic Z of the chemical bond in the product molecule in step 201, the graph neural network model used in step 202 should be a graph neural network capable of compatibly processing the attribute characteristic Z of the chemical bond, such as a message passing neural network (Message Passing Neural Networks, MPNN) model.
In this case, the server can predict the broken chemical bonds in the product molecule from the graph structure of the product molecule, the property features of atoms in the product molecule, and the property features of chemical bonds in the product molecule through the graph neural network model. For example, the graph neural network model can predict the breaking probability of each chemical bond in the product molecule, and then the server determines the chemical bond with the breaking probability larger than a preset threshold value as the breaking chemical bond; for another example, the graph neural network model can predict the breaking probability of each chemical bond in the product molecule and the number of chemical bond breaks in the product molecule, and then the server determines the chemical bond with the highest breaking probability of the previous chemical bond breaking number as the breaking chemical bond.
It should be understood that the above-mentioned graph neural network model may be any graph neural network, and the graph neural network model applied in the present application is not specifically limited herein.
Step 203: and performing bond breaking treatment on the product molecules based on the broken chemical bonds to obtain at least one synthon.
After predicting the broken chemical bond in the product molecule by the server through the graph neural network model, the product molecule can be subjected to bond breaking treatment based on the predicted broken chemical bond, so that a synthon of the product molecule is obtained. In general, the cleavage of a product molecule may result in one synthon or may result in a plurality of synthons, and the number of synthons obtained is not limited in any way herein.
Step 204: and predicting the reactant molecules at least according to the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to the at least one synthons through a sequence learning model.
After the server performs the bond breaking process on the product molecule based on the broken chemical bond predicted by the neural network model to obtain the synthons, the character string corresponding to the inverse synthesis reaction type, the character string corresponding to the product molecule, and the character string corresponding to each synthon obtained in step 203 can be determined. And further, the server inputs the character string corresponding to the reverse synthesis reaction type, the character string corresponding to the product molecule and the character string corresponding to each synthesizer into a pre-trained sequence learning model, and the sequence learning model performs corresponding analysis processing on the character string corresponding to the reverse synthesis reaction type, the character string corresponding to the product molecule and the character string corresponding to each synthesizer, so that the character string corresponding to the reactant molecule can be predicted, and the reactant molecule of the reverse synthesis reaction can be obtained by converting the character string corresponding to the reactant molecule.
In specific implementation, the server may first combine the character string corresponding to the inverse synthetic reaction type, the character string corresponding to the product molecule, and the character string corresponding to the synthesizer to obtain the character string to be processed. Each reaction type may be represented as a string RXN accordingly, e.g., the ith reaction type may be represented by a string rxn_i; each Product molecule can be represented as a corresponding SMILES string, denoted Product; each Synthon may be represented as a SMILES string, respectively, denoted Synthon.
Assuming that two synthons are obtained through the key breaking process in step 203, the SMILES strings corresponding to the two synthons are Synthon1 and Synthon2 respectively, at this time, the string rxn_i corresponding to the type of inverse synthesis reaction, the string Product corresponding to the Product molecule, and the strings Synthon1 and Synthon2 corresponding to the synthons may be combined by (7), so as to obtain the to-be-processed string U:
U=<RXN_i>Product<LINK>Sython1.Synthon2 (7)
then, the target character string S is predicted from the character string to be processed U by a sequence learning model, which may be various types of sequence learning models such as Long Short-Term Memory (LSTM) model, a Transformer model, and the like. Taking a sequence learning model as a transducer model as an example, the transducer model can process the character string U to be processed through the method (8) to obtain a target character string S:
S=g(U;φ) (8)
Where g represents the transducer network, phi represents a parameter of the transducer network, and the target string S is the SMILES string of all reactants predicted by the sequence learning model.
After the sequence learning model determines the target character string S, the server may split the target character string S into at least one target sub-character string based on the separator in the target character string S, and further convert the target sub-character string into a corresponding reactant molecule. In general, a string corresponding to each reactant molecule is serially connected to a target string S predicted by the sequence learning model, so that the target string S needs to be split based on a separator (e.g., ") in the target string S to obtain multiple segments of SMILES strings, where each segment of string corresponds to one reactant molecule.
Optionally, in order to further improve accuracy of reactant molecule prediction, the graph structure information of the product molecule and the synthon may be comprehensively referenced in the process of predicting the reactant molecule by using the sequence learning model, and the input sequence information (i.e., character string information) may be enhanced by using the graph structure information of the product molecule and the synthon. That is, after the server performs the bond breaking process on the product molecule to obtain at least one synthon, the graph structure of the at least one synthon can be determined accordingly, and then the reactant molecule of the inverse synthetic reaction is predicted according to the graph structure of the product molecule, the graph structure of the at least one synthon, the character string corresponding to the type of the inverse synthetic reaction, the character string corresponding to the product molecule, and the character string corresponding to the at least one synthon through the sequence learning model.
Specifically, the server may convert the synthons into uniquely determined graph structures according to a predetermined sequence, to obtain graph structures of the synthons; further, the graph structure of the product molecule, the graph structure of the synthon, and the character string corresponding to the type of the inverse synthetic reaction, the character string corresponding to the product molecule, and the character string corresponding to the synthon obtained in step 201 are input into a sequence learning model, and the sequence learning model performs corresponding analysis processing on the input graph structure of the product molecule, the graph structure of the synthon, the character string corresponding to the type of the inverse synthetic reaction, the character string corresponding to the product molecule, and the character string corresponding to the synthon, so as to predict the reactant molecule of the inverse synthetic reaction.
According to the inverse synthesis prediction method provided by the embodiment of the application, in the prediction process of the inverse synthesis reaction of the organic compound, not only is the character string information of the product molecules utilized, but also molecular diagram structure information with higher reference value for the prediction of the inverse synthesis reaction of the organic compound is fused into the prediction process through a graph neural network model, so that the prediction accuracy of the inverse synthesis reaction of the organic compound can be improved to a certain extent; in addition, the method provided by the embodiment of the application predicts the bond breaking position through the graph neural network model, and then complements the synthons after bond breaking treatment through the sequence learning model, and the treatment mode enables the prediction process of the inverse synthetic reaction of the organic compound to be easier to visualize and has extremely high interpretability.
The embodiment of the application also provides a training method of the graph neural network model, and the training method of the graph neural network model is described in detail through the embodiment.
Referring to fig. 3, fig. 3 is a flowchart of a training method of the neural network model according to an embodiment of the present application. For convenience of description, the following embodiments will be described taking a server as an execution subject.
As shown in fig. 3, the training method of the neural network model includes the following steps:
step 301: acquiring a first training sample; the first training sample includes a graphic structure of a sample product molecule, an attribute feature of an atom in the sample product molecule, and a true breaking chemical bond in the sample product molecule, the sample product molecule being an organic compound molecule.
Before the server trains the graphic neural network model to be trained, a plurality of first training samples are usually required to be acquired, wherein each first training sample comprises a graphic structure of a sample product molecule, an attribute characteristic of an atom in the sample product molecule and a real breaking chemical bond in the sample product molecule.
Wherein the pattern of sample product molecules is a uniquely defined pattern of sample product molecules converted in a predetermined order, e.g. sample product molecules can be converted in a predetermined order into a contiguous matrix A.epsilon.R n×n As a diagram structure of the sample product molecule, n is the number of atoms in the sample product molecule. The attribute characteristics of the atoms in the sample product molecule are composed of the feature vector of each atom in the sample product molecule, and the specific manner of determining the feature vector of the atom is described in detail in step 201 of the embodiment shown in fig. 2, which is not repeated here; assuming that the length of the eigenvector of the atom is d and the number of atoms in the sample product molecule is n, the eigenvalue matrix X εR can be utilized n×d Representing the characteristic features of atoms in the sample product molecules. The actual broken chemical bond in the sample product molecule may be expressed in the form of a configuration tag, e.g., tag "1" may be configured for the actual broken chemical bond in the sample product molecule "The label "0" may be configured for other chemical bonds in the sample product molecule than the truly broken chemical bond.
Optionally, if the graphic neural network model to be trained can synthesize the attribute features of the chemical bonds in the reference product molecules in the process of predicting the broken chemical bonds, the first training sample may further include the attribute features of the chemical bonds in the sample product molecules. The attribute features of the chemical bonds in the sample product molecule are composed of feature vectors of each chemical bond in the sample product molecule, and the specific manner of determining the feature vectors of the chemical bonds is described in detail in step 201 of the embodiment shown in fig. 2, which is not repeated here; assuming that the length of the eigenvector of the chemical bond is p and the number of the chemical bonds in the sample product molecule is m, the eigenvalue matrix ZεR can be utilized m×p Representing the characteristic features of chemical bonds in the sample product molecules.
Optionally, if the graph neural network model to be trained can predict the number of chemical bond breaks in the product molecule, the first training sample may further include the number of real chemical bond breaks in the sample product molecule.
Step 302: and determining the predicted fracture probability of each chemical bond in the sample product molecule according to the graph structure of the sample product molecule in the first training sample and the attribute characteristics of atoms in the sample product molecule through a graph neural network model to be trained.
After the server acquires the first training sample, the graph structure of the sample product molecules in the first training sample and the attribute characteristics of atoms in the sample product molecules can be input into a graph neural network model to be trained, and the graph neural network model to be trained can determine the predicted fracture probability of each chemical bond in the sample product molecules after corresponding analysis processing is performed on the input graph structure of the sample product molecules and the attribute characteristics of the atoms in the sample product molecules.
It should be understood that the graphic neural network model to be trained herein may be any graphic neural network model, such as GAT, MPNN, etc., and the graphic neural network model to be trained herein is not limited in any way.
In one possible implementation, the graphic neural network model to be trained may be used only to predict the probability of cleavage of each chemical bond in the product molecule. Assuming that the graphic neural network model to be trained is GAT, the GAT to be trained can map the inputted graphic structure a of the sample product molecule and the attribute feature X of the atoms in the sample product molecule to advanced features by the formula (9):
H=f(X,A;θ) (9)
wherein H is E R n×b Mapping the GAT to be trained based on the input graph structure A of the sample product molecules and the attribute characteristics X of atoms in the sample product molecules to obtain high-level characteristics, wherein each row corresponds to a high-level characteristic vector of one atom in the sample product molecules, n is the number of atoms in the sample product molecules, and b is the dimension (usually set manually) of the high-level characteristic vector; θ is a parameter in the GAT to be trained.
After H is obtained, the GAT to be trained can further predict the probability of cleavage of each chemical bond in the sample product molecule in the reverse synthesis reaction, and specifically can determine the atom x by the formula (10) i And atom x j Chemical bond (x) i ,x j ) Is a predicted fracture probability:
wherein c ij Representing a chemical bond (x) i ,x j ) Is a predicted fracture probability of w.epsilon.R 2b ,[h i ,h j ]Representing advanced features h i And advanced feature h j Is a series of high-level features h i Is the atom x in H i High-level feature vector of (1), high-level feature h j Is the atom x in H j Is described. For the sample product molecules not present in the graphic structure (x i ,x j ) There is no need to determine the corresponding predicted fracture probability.
In another possible implementation, the graph neural network model to be trained can be used to predict the probability of cleavage of each chemical bond in a product molecule and the number of chemical bond breaks in the product molecule. Still assume that the graphic neural network model to be trained is a GAT, and when the GAT to be trained is trained by using the first training sample, the way that the GAT to be trained determines the predicted fracture probability of each chemical bond in the sample product molecule is the same as the previous implementation way.
When determining the number of predicted chemical bond breaks in a sample product molecule, the GAT to be trained can be realized by the formulas (12) and (13) on the basis of the implementation mode, and specifically, the GAT to be trained can predict the probability of the number of various broken bonds in the sample product molecule by the formula (12):
wherein a represents the occurrence probability corresponding to the number of broken bonds, W E R (N+1)×b N represents a preset maximum number of broken bonds; h is a i Is the atom x in H i Is described.
Furthermore, the GAT to be trained can determine the predicted number of chemical bond breaks k in the sample product molecule by formula (13) based on the calculation result of formula (12):
k=argmax j a j -1 (13)
for example, assume a 2 The number k=2-1=1 of predicted chemical bond breaks in the sample product molecules, which is the largest element in the set of probabilities a.
It should be noted that, if the first training sample further includes an attribute feature of a chemical bond in a sample product molecule, the aforementioned graph neural network model to be trained should be a graph neural network that can be compatible with the attribute feature Z of the chemical bond, such as an MPNN model. And when training the graphic neural network model to be trained, the attribute characteristics of chemical bonds in sample product molecules included in the first training sample are required to be input into the graphic neural network model to be trained, so that the graphic neural network model to be trained learns the attribute characteristics of the chemical bonds.
Step 303: a first target loss function is determined based on the predicted fracture probability for each chemical bond in the sample product molecule and the true fracture chemical bonds in the sample product molecule.
And the graph neural network model to be trained correspondingly processes the graph structure of the input sample product molecule and the attribute characteristics of atoms in the sample product molecule, and then outputs the predicted fracture probability of each chemical bond in the sample product molecule. Further, the server may determine the first objective loss function based on the predicted fracture probability for each chemical bond in the sample product molecule and the true fracture chemical bonds in the sample product molecule in the first training sample.
In the case where the graph neural network model to be trained determines only the predicted fracture probability of each chemical bond in the sample product molecule, the first objective loss function constructed may be as shown in equation (14):
where N is the total number of first training samples used for training, s is the index of the first training sample currently used,chemical bonds (x) in sample product molecules determined for the graphic neural network model to be trained i ,x j ) Is predicted fracture probability of->For the first training sample included for characterizing chemical bonds (x i ,x j ) Whether the authentic tag will break (will break to 1 and not to 0), l may be any suitable loss function, such as a cross entropy loss function, etc.
Under the condition that the graph neural network model to be trained can determine not only the predicted breaking probability of each chemical bond in the sample product molecule, but also the number of predicted chemical bond breaks in the sample product molecule, the constructed first objective loss function can be shown as a formula (15):
wherein N is the total number of first training samples used for training, s is the index of the first training samples currently used, k is the predicted number of chemical bond breaks in sample product molecules determined by the graph neural network model to be trained, For the number of real chemical bond breaks in the sample product molecule, < >>Chemical bonds (x) in sample product molecules determined for the graphic neural network model to be trained i ,x j ) Is predicted fracture probability of->For the first training sample included for characterizing chemical bonds (x i ,x j ) Whether the authentic tag will break (will break to 1, will not break to 0), -a.about.>And->May be any suitable loss function, such as a cross entropy loss function, or the like.
It should be understood that the first objective loss functions shown in the above formulas (14) and (15) are only examples, and in practical applications, other functional forms may be adopted as the first objective loss function, and the specific form of the first objective loss function is not limited in this application.
Step 304: and training the graph neural network model to be trained based on the first target loss function.
The server can continuously update the parameters of the graph neural network model to be trained by optimizing the first target loss function so as to achieve the aim of training the graph neural network model to be trained. And when the to-be-trained graph neural network model meets the training ending condition, if the prediction error of the broken chemical bond is smaller than a preset threshold value, the training times reach the preset training times, and the like, the training of the graph neural network model can be considered to be completed.
The graph neural network model in the embodiment of the application can accurately predict potential bond breaking positions in the product molecules based on graph structure information of the product molecules and attribute characteristics of atoms in the product molecules; therefore, the molecular diagram structure information is effectively integrated into the process of predicting the inverse synthetic reaction of the organic compound, and the accuracy of predicting the inverse synthetic reaction of the organic compound is improved. In addition, the potential bond breaking position in the product molecule is predicted, which is helpful to realize the visualization of the prediction process of the reverse synthesis reaction of the organic compound, so that the prediction process of the reverse synthesis reaction of the organic compound is more interpretable.
The embodiment of the application also provides a training method of the sequence learning model, and the training method of the sequence learning model is described in detail through the embodiment.
Referring to fig. 4, fig. 4 is a flow chart of a training method of a sequence learning model according to an embodiment of the present application. For convenience of description, the following embodiments will be described taking a server as an execution subject. As shown in fig. 4, the training method of the sequence learning model includes the following steps:
step 401: acquiring a second training sample; the second training sample comprises a first character string and a second character string corresponding to the same inverse synthetic reaction, wherein the first character string is obtained by combining a character string corresponding to the type of the inverse synthetic reaction, a character string corresponding to a sample product molecule and a character string corresponding to a synthon of the sample product molecule, the second character string is obtained by combining a character string corresponding to a reactant molecule of the inverse synthetic reaction, and the sample product molecule is an organic compound molecule.
Before the server trains the sequence learning model to be trained, a plurality of second training samples are usually required to be acquired, each second training sample comprises a first character string and a second character string corresponding to the same inverse synthetic reaction, wherein the first character string is obtained by combining the character string corresponding to the type of the inverse synthetic reaction, the character string corresponding to a sample product molecule and the character string corresponding to a synthesizer of the sample product molecule, and the second character string is obtained by combining the character strings corresponding to reactant molecules of the inverse synthetic reaction.
Specifically, each reaction type may be represented as a string RXN accordingly, e.g., the ith reaction type may be represented by a string rxn_i; each Product molecule can be represented as a corresponding SMILES string, denoted Product; each Synthon may be represented as a SMILES string, respectively, denoted Synthon; the corresponding molecule (i.e., reactant molecule) of each synthon can also be correspondingly represented as a SMILES string, denoted as a Reactant.
Assuming that two synthons of the sample Product molecule exist, the strings corresponding to the two synthons are Synthon1 and Synthon2 respectively, at this time, the string rxn_i corresponding to the type of inverse synthesis reaction, the string Product corresponding to the sample Product molecule, and the strings Synthon1 and Synthon2 corresponding to the synthons of the sample Product molecule can be combined by the formula (16), so as to obtain the first string U in the second training sample:
U=<RXN_i>Product<LINK>Sython1.Synthon2 (16)
Assuming that there are two correct Reactant molecules corresponding to the sample product molecules, the character strings corresponding to the two Reactant molecules are respectively Reactant1 and Reactant2, the character strings corresponding to the two Reactant molecules can be combined by the formula (17) to obtain a second character string V in the second training sample:
V=Reactant1.Reactant2 (17)
at this time, the first character string U and the second character string V may be used to form a second training sample (U, V).
Optionally, in order to further improve the robustness of the sequence learning model to be trained, relevant information of a synthon predicted by using the graph neural network model for the sample product molecules can be blended into the second training sample. And adding a third character string corresponding to the same inverse synthetic reaction in the second training sample, wherein the third character string is obtained by combining a character string corresponding to the type of the inverse synthetic reaction, a character string corresponding to a sample product molecule and a character string corresponding to a prediction synthon, and the prediction synthon is obtained by predicting according to the graph structure of the sample product molecule and the attribute characteristics of atoms in the sample product molecule based on a graph neural network model.
Specifically, the graph structure of the sample product molecule and the attribute characteristics of atoms in the sample product molecule can be processed by using the graph neural network model, the breaking chemical bond of the sample product molecule is determined, then the breaking process is performed on the sample product molecule based on the breaking chemical bond to obtain a predicted synthon, and the predicted synthon is further converted into a corresponding SMILES character string.
Assume that the sample product molecules in the second training sample are processed to obtain three predicted synthons, wherein the three predicted synthons respectively correspond to the character strings respectivelyAnd->At this time, the string RXN_i corresponding to the inverse synthesis reaction type, the string Product corresponding to the sample Product molecule, and the string +.> And->Combining to obtain a third character string +.>
At this time, the first character string U, the second character string V and the third character string can be utilizedComposing a second training sample
Step 402: and determining a first reactant prediction character string according to the first character string in the second training sample through a sequence learning model to be trained.
After the server acquires the second training sample, the first character string in the second training sample can be input into a sequence learning model to be trained, and after the sequence learning model to be trained carries out corresponding analysis processing on the input first character string, a corresponding first reactant prediction character string can be determined.
It should be understood that the sequence learning model to be trained herein may be any sequence learning model, such as LSTM, transformer, etc., and the sequence learning model to be trained herein is not limited in any way.
Taking a sequence model to be trained as a transducer model as an example, after inputting a first character string U into the transducer model to be trained, the transducer model to be trained can process the first character string U through (19) to obtain a first reactant prediction character string S:
S=g(U;φ) (19)
where g represents the transducer model to be trained, phi represents the parameters of the transducer model, and the first reactant prediction string S is the SMILES string of all reactants predicted by the transducer model.
If the second training sample further includes a third string, the server also needs to be connectedAnd determining a second reactant prediction character string according to the third character string through a sequence learning model to be trained. Specifically, taking the sequence model to be trained as a transducer model as an example, the third character string is obtainedAfter inputting the to-be-trained transducer model, the to-be-trained transducer model can apply the third character string +_via (20)>Processing to obtain a second reactant predictive character string
Wherein g represents a transducer model to be trained, phi represents parameters of the transducer model, and the second reactant prediction stringSMILES strings of all reactants predicted for the transducer model.
Step 403: a second target loss function is determined based on an error between the first reactant predictive string and the second string in the second training sample.
And after the sequence learning model to be trained correspondingly processes the input first character string, outputting a first reactant prediction character string. Further, the server may determine an error in the first reactant predictive string relative to a second string in the second training sample, and determine a second objective loss function based thereon.
In the case where only the first character string and the second character string are included in the second training sample, the second objective loss function constructed by the server may be as shown in formula (21):
wherein N is the total number of second training samples used for training, S is the index of the second training sample currently used, S s Predicting a character string for a sequence learning model to be trained based on a first reactant determined by the first character string U, V s For the second string in the second training sample,may be any suitable loss function such as a negative likelihood function or the like.
In the case that the second training sample includes the first string, the second string, and the third string, and the second reactant prediction string is determined based on the third string using the sequence learning model to be trained in step 402, the server may determine the second objective loss function according to an error between the first reactant prediction string and the second string in the second training sample, and an error between the second reactant prediction string and the second string in the second training sample. Specifically, the second objective loss function constructed by the server may be as shown in equation (22):
Wherein N is the total number of second training samples used for training, S is the index of the second training sample currently used, S s A character string is predicted for the first reactant determined for the sequence learning model to be trained based on the first character string U,learning a model for the sequence to be trained based on the third string +.>A defined second reactant predictive string, V s For the second trainingTraining a second string in the sample, +.>May be any suitable loss function such as a negative likelihood function or the like.
It should be understood that the second objective loss function shown in the above formulas (21) and (22) is only an example, and in practical application, other functional forms may be adopted as the second objective loss function, and the specific form of the second objective loss function is not limited in this application.
The inventor tests a sequence learning model obtained by training in the two modes (a mode of training samples without the third character string is called a first mode, a mode of training samples with the third character string is called a second mode) by using a public data set USPTO_50K, finds that the sequence learning model obtained by training in the first mode realizes the inverse synthesis prediction method provided by the embodiment of the application, the inverse synthesis prediction precision is 63%, and the sequence learning model obtained by training in the second mode realizes the inverse synthesis prediction method provided by the embodiment of the application, the inverse synthesis prediction precision is 70%, and therefore, the third character string is fused when the sequence learning model is trained, so that the inverse synthesis prediction precision can be effectively improved.
Step 404: and training the sequence learning model to be trained based on the second target loss function.
The server can continuously update the parameters of the sequence learning model to be trained by optimizing the second objective loss function so as to achieve the purpose of training the sequence learning model to be trained. When the sequence learning model to be trained meets the training ending condition, if the prediction error of the reactant molecules is smaller than a preset threshold value, the training times reach the preset training times, and the like, the training of the sequence learning model can be considered to be completed.
The sequence learning model in the embodiment of the application can accurately predict the reactant molecules based on the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to the synthons; on the basis that the information of a molecular diagram structure is introduced into the graph neural network model, the accuracy of the inverse synthesis reaction prediction of the organic compound can be improved in a synergistic auxiliary manner.
In order to further understand the artificial intelligence-based inverse synthesis prediction method provided in the embodiments of the present application, an overall exemplary description of the artificial intelligence-based inverse synthesis prediction method provided in the embodiments of the present application is provided below with reference to fig. 5 and 6.
The overall flow of the inverse synthesis prediction method provided by the embodiment of the application is mainly divided into two stages. The first stage is used for predicting potential bond breaking positions in product molecules and carrying out bond breaking treatment on the product molecules to obtain synthons; the second stage is used for complementing the synthons by using the information of the product molecules, thereby obtaining reactant molecules of the inverse synthetic reaction.
Fig. 5 is a schematic diagram of the implementation architecture of the first stage. As shown in fig. 5, the graph neural network model 501 is a model utilized in the first stage, whose inputs include a graph structure 502 of a product molecule and attribute features 503 of atoms in the product molecule, the attribute features 503 of atoms in the product molecule being determined from key features of each atom in the product molecule. Optionally, the input of the graph neural network model 501 may further include an attribute feature 504 of a chemical bond in a product molecule and a reaction type feature 505 of the inverse synthetic reaction, where the attribute feature 504 of the chemical bond in the product molecule is determined according to a key feature of each chemical bond in the product molecule, and the reaction type feature 505 may be set to 0 without knowing the reaction type of the inverse synthetic reaction.
The graph neural network model 501 can predict the fracture probability of each chemical bond in a product molecule and the number of the fracture chemical bonds in the product molecule according to the input information; assuming that the graph neural network model 501 predicts that the number of broken bonds in the product molecule is k, the predicted broken bonds in the product molecule should be the top k bonds with the highest probability of breaking. Furthermore, the product molecule is subjected to bond breaking treatment based on the k broken chemical bonds, and accordingly several synthons are obtained.
Fig. 6 is a schematic diagram of an implementation architecture of the second stage. As shown in fig. 6, the synthon-to-molecule (Synthon to Molecule, syn2 Mol) model 601 is a model utilized in the second stage, and the Syn2Mol model 601 may be any sequence learning model. The inputs include a string 602 corresponding to the type of inverse synthetic reaction, a string 603 corresponding to the product molecule, and a string 604 corresponding to the synthesizer. Optionally, the inputs to the Syn2Mol model 601 may also include the graphic structure of the product molecule and the graphic structure of the synthon.
The Syn2Mol model 601 can complement the synthons accordingly according to the input information to obtain the SMILES character strings corresponding to the reactant molecules, and further, can convert the SMILES character strings corresponding to the reactant molecules into corresponding reactant molecules, so as to complete the prediction of the inverse synthesis reaction.
For the artificial intelligence-based inverse synthesis prediction method described above, the application also provides a corresponding artificial intelligence-based inverse synthesis prediction device, so that the above-mentioned inverse synthesis prediction method is practically applied and implemented.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an inverse synthesis prediction apparatus 700 corresponding to the artificial intelligence-based inverse synthesis prediction method shown in fig. 2 above, the inverse synthesis prediction apparatus comprising:
An obtaining module 701, configured to obtain a graph structure of a product molecule and an attribute feature of an atom in the product molecule, where the product molecule is an organic compound molecule;
a first prediction module 702, configured to predict, by using a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a property feature of an atom in the product molecule;
a bond breaking module 703, configured to perform bond breaking treatment on the product molecule based on the broken chemical bond, so as to obtain at least one synthon;
the second prediction module 704 is configured to predict, according to a sequence learning model, at least a character string corresponding to an inverse synthetic reaction type, a character string corresponding to the product molecule, and a character string corresponding to the at least one synthon.
Optionally, on the basis of the inverse synthetic prediction apparatus shown in fig. 7, the first prediction module 702 is specifically configured to:
predicting the breaking probability of each chemical bond in the product molecule and the number of chemical bond breaks in the product molecule at least according to the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule through the graph neural network model;
And sequencing the breaking probability of each chemical bond in the product molecule in a descending order, and determining the chemical bond corresponding to the breaking probability of the number of the chemical bonds which are sequenced to be the front as the breaking chemical bond.
Optionally, on the basis of the inverse synthetic prediction apparatus shown in fig. 7, the first prediction module 702 is specifically configured to:
predicting, by the graph neural network model, a fracture probability of each chemical bond in the product molecule based at least on a graph structure of the product molecule and an attribute feature of an atom in the product molecule;
and determining a chemical bond with a fracture probability larger than a preset threshold value as the fracture chemical bond.
Optionally, on the basis of the inverse synthesis prediction apparatus shown in fig. 7, the obtaining module 701 is further configured to: acquiring attribute characteristics of chemical bonds in the product molecules;
the first prediction module 702 is specifically configured to:
predicting a broken chemical bond in the product molecule according to the graph structure of the product molecule, the attribute characteristics of atoms in the product molecule and the attribute characteristics of chemical bonds in the product molecule through the graph neural network model.
Optionally, on the basis of the inverse synthesis prediction device shown in fig. 7, the attribute features of the atoms in the product molecule include any one or more of the following:
Atom type, number of linkages, formal charge, chirality, number of attached hydrogen atoms, atomic hybridization, aromaticity, atomic weight, high frequency reaction center characteristics, reverse synthetic reaction type;
the attribute characteristics of the chemical bonds in the product molecules include any one or more of the following:
bond type, conjugated character, cyclic bond character, molecular stereochemistry character.
Optionally, on the basis of the inverse synthetic prediction device shown in fig. 7, the second prediction module 704 is specifically configured to:
combining the character string corresponding to the reverse synthesis reaction type, the character string corresponding to the product molecule and the character string corresponding to the at least one synthesizer to obtain a character string to be processed;
predicting a target character string according to the character string to be processed through the sequence learning model;
dividing the target character string into at least one target substring based on separators in the target character string;
converting the target substring into the corresponding reactant molecule.
Optionally, on the basis of the inverse synthetic prediction device shown in fig. 7, the second prediction module 704 is specifically configured to:
determining a graph structure of the at least one synthon;
And predicting the reactant molecules according to the graph structure of the product molecules, the graph structure of the at least one synthon, the character string corresponding to the reverse synthesis reaction type, the character string corresponding to the product molecules and the character string corresponding to the at least one synthon through the sequence learning model.
Alternatively, referring to fig. 8, fig. 8 is a schematic structural diagram of another inverse synthesis prediction apparatus 800 according to an embodiment of the present application, based on the inverse synthesis prediction apparatus shown in fig. 7. As shown in fig. 8, the inverse synthesis prediction apparatus 800 further includes: a first training module 801, where the first training module 801 is configured to:
acquiring a first training sample; the first training sample comprises a graph structure of a sample product molecule, an attribute characteristic of atoms in the sample product molecule and a real fracture chemical bond in the sample product molecule, wherein the sample product molecule is an organic compound molecule;
determining the predicted fracture probability of each chemical bond in the sample product molecule according to the graph structure of the sample product molecule in the first training sample and the attribute characteristics of atoms in the sample product molecule through a graph neural network model to be trained;
Determining a first target loss function based on the predicted fracture probability of each chemical bond in the sample product molecule and the true fracture chemical bonds in the sample product molecule;
and training the graph neural network model to be trained based on the first target loss function.
Alternatively, referring to fig. 9, fig. 9 is a schematic structural diagram of another inverse synthesis prediction apparatus 900 according to an embodiment of the present application, based on the inverse synthesis prediction apparatus shown in fig. 7. As shown in fig. 9, the inverse synthesis prediction apparatus 900 further includes: a second training module 901, the second training module 901 is configured to:
acquiring a second training sample; the second training sample comprises a first character string and a second character string corresponding to the same inverse synthetic reaction, the first character string is obtained by combining a character string corresponding to the type of the inverse synthetic reaction, a character string corresponding to a sample product molecule and a character string corresponding to a synthon of the sample product molecule, the second character string is obtained by combining a character string corresponding to a reactant molecule of the inverse synthetic reaction, and the sample product molecule is an organic compound molecule;
determining a first reactant prediction character string according to the first character string in the second training sample through a sequence learning model to be trained;
Determining a second target loss function based on an error between the first reactant predictive string and the second string in the second training sample;
and training the sequence learning model to be trained based on the second target loss function.
Optionally, on the basis of the inverse synthetic prediction apparatus shown in fig. 9, the second training sample further includes: a third character string corresponding to the inverse synthetic reaction, the third character string being obtained by combining a character string corresponding to the type of inverse synthetic reaction, a character string corresponding to the sample product molecule, and a character string corresponding to a prediction synthon predicted from a graph structure of the sample product molecule and attribute characteristics of atoms in the sample product molecule based on the graph neural network model;
the second training module 901 is specifically configured to:
determining a second reactant prediction character string according to the third character string in the second training sample through the sequence learning model to be trained;
and determining the second target loss function according to the error between the first reactant prediction string and the second string in the second training sample and the error between the second reactant prediction string and the second string in the second training sample.
In the inverse synthesis reaction prediction process, the inverse synthesis reaction prediction device provided by the embodiment of the application not only utilizes character string information of product molecules, but also fuses molecular diagram structure information with higher reference value for the organic compound inverse synthesis reaction prediction into the prediction process through the graph neural network model, so that the prediction accuracy of the organic compound inverse synthesis reaction can be improved to a certain extent; in addition, the device provided by the embodiment of the application predicts the bond breaking position through the graph neural network model, and then complements the synthons after bond breaking treatment through the sequence learning model, and the treatment mode enables the prediction process of the inverse synthetic reaction of the organic compound to be easier to visualize and has extremely high interpretability.
The embodiment of the application also provides equipment for carrying out the predictive reverse synthesis reaction, which can be particularly a server and terminal equipment, and the server and the terminal equipment provided by the embodiment of the application are introduced from the aspect of hardware materialization.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a server 1000 according to an embodiment of the present application. The server 1000 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 1022 (e.g., one or more processors) and memory 1032, one or more storage media 1030 (e.g., one or more mass storage devices) storing applications 1042 or data 1044. Wherein memory 1032 and storage medium 1030 may be transitory or persistent. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, central processor 1022 may be configured to communicate with storage medium 1030 to perform a series of instruction operations in storage medium 1030 on server 1000.
The server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input/output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 10.
Wherein, the CPU 1022 is configured to perform the following steps:
obtaining a graph structure of a product molecule and attribute characteristics of atoms in the product molecule, wherein the product molecule is an organic compound molecule;
predicting, by a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a characteristic feature of atoms in the product molecule;
performing bond breaking treatment on the product molecules based on the broken chemical bonds to obtain at least one synthon;
and predicting the reactant molecules at least according to the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to the at least one synthons through a sequence learning model.
Optionally, the CPU 1022 may also be configured to perform the steps of any implementation of the artificial intelligence-based inverse synthetic prediction method provided in embodiments of the present application.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only those portions relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, refer to the method portions of the embodiments of the present application. The terminal can be any terminal equipment including a smart phone, a computer, a tablet personal computer, a personal digital assistant and the like, taking the terminal as an example of the computer:
fig. 11 is a block diagram showing a part of the structure of a computer related to a terminal provided in an embodiment of the present application. Referring to fig. 11, a computer includes: radio Frequency (RF) circuitry 1111, memory 1120, input unit 1130, display unit 1140, sensor 1150, audio circuit 1160, wireless fidelity (wireless fidelity, wiFi) module 1170, processor 1180, power supply 1190, and the like. Those skilled in the art will appreciate that the computer architecture shown in fig. 11 is not limiting and that more or fewer components than shown may be included, or that certain components may be combined, or that different arrangements of components may be utilized.
The memory 1120 may be used to store software programs and modules, and the processor 1180 executes the software programs and modules stored in the memory 1120 to perform various functional applications and data processing of the computer. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the computer (such as audio data, phonebooks, etc.), and the like. In addition, memory 1120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The processor 1180 is a control center of the computer, connects various parts of the entire computer using various interfaces and lines, and performs various functions of the computer and processes data by running or executing software programs and/or modules stored in the memory 1120, and calling data stored in the memory 1120. In the alternative, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1180.
In the embodiment of the present application, the processor 1180 included in the terminal further has the following functions:
obtaining a graph structure of a product molecule and attribute characteristics of atoms in the product molecule, wherein the product molecule is an organic compound molecule;
predicting, by a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a characteristic feature of atoms in the product molecule;
Performing bond breaking treatment on the product molecules based on the broken chemical bonds to obtain at least one synthon;
and predicting the reactant molecules at least according to the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to the at least one synthons through a sequence learning model.
Optionally, the processor 1180 is further configured to perform steps of any implementation of the artificial intelligence-based inverse synthetic prediction method provided in the embodiments of the present application.
The embodiments of the present application further provide a computer readable storage medium storing a computer program for executing any one of the methods for artificial intelligence-based inverse synthetic prediction according to the foregoing embodiments.
The present embodiments also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any one of the methods of artificial intelligence based inverse synthetic prediction described in the various embodiments above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc. various media for storing computer program.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (9)

1. An artificial intelligence based inverse synthetic prediction method, the method comprising:
obtaining a graph structure of a product molecule and attribute characteristics of atoms in the product molecule, wherein the product molecule is an organic compound molecule;
predicting, by a graph neural network model, a broken chemical bond in the product molecule based at least on a graph structure of the product molecule and a property feature of atoms in the product molecule, specifically including: predicting the breaking probability of each chemical bond in the product molecule and the number of broken chemical bonds in the product molecule according to at least the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule through the graph neural network model, sorting the breaking probability of each chemical bond in the product molecule in a descending order, and determining the chemical bonds corresponding to the breaking probability of the number of broken chemical bonds with the front sorting as the breaking chemical bonds;
Performing bond breaking treatment on the product molecules based on the broken chemical bonds to obtain at least one synthon;
predicting a reactant molecule according to at least a character string corresponding to an inverse synthetic reaction type, a character string corresponding to the product molecule and a character string corresponding to the at least one synthesizer through a sequence learning model;
wherein the sequence learning model is trained by:
acquiring a second training sample;
if the second training sample only comprises a first character string and a second character string corresponding to the same inverse synthetic reaction, determining a first reactant prediction character string according to the first character string in the second training sample through a sequence learning model to be trained; determining a second target loss function based on an error between the first reactant predictive string and the second string in the second training sample; the first character string is obtained by combining a character string corresponding to the type of the inverse synthetic reaction, a character string corresponding to a sample product molecule and a character string corresponding to a synthon of the sample product molecule, the second character string is obtained by combining a character string corresponding to a reactant molecule of the inverse synthetic reaction, and the sample product molecule is an organic compound molecule; wherein the second objective loss function is represented by the formula:
Wherein N is the total number of second training samples used for training, S is the index of the second training sample currently used, S s Predicting a character string for a first reactant determined based on the first character string for a sequence learning model to be trained, V s A second character string in a second training sample, and l is a loss function;
if the second training sample comprises a first character string, a second character string and a third character string which correspond to the same inverse synthetic reaction, determining a second reactant prediction character string according to the third character string in the second training sample through the sequence learning model to be trained; determining the second target loss function based on an error between the first reactant predicted string and the second string in the second training sample, and an error between the second reactant predicted string and the second string in the second training sample; the third character string is obtained by combining a character string corresponding to the inverse synthetic reaction type, a character string corresponding to the sample product molecule and a character string corresponding to a prediction synthon, and the prediction synthon is obtained by predicting according to the graph structure of the sample product molecule and the attribute characteristics of atoms in the sample product molecule based on the graph neural network model; wherein the second objective loss function is represented by the formula:
Wherein N is the total number of second training samples used for training, S is the index of the second training sample currently used, S s Predicting a character string for a first reactant determined based on the first character string for a sequence learning model to be trained,predicting a string for a second reactant determined based on the third string for a sequence learning model to be trained, V s A second character string in a second training sample, and l is a loss function;
training the sequence learning model to be trained based on the second target loss function;
wherein the graph neural network model is trained by:
acquiring a first training sample; the first training sample comprises a graph structure of a sample product molecule, an attribute characteristic of atoms in the sample product molecule and a real fracture chemical bond in the sample product molecule, wherein the sample product molecule is an organic compound molecule;
determining the predicted breaking probability of each chemical bond in the sample product molecule and the predicted number of chemical bond breaks in the sample product molecule according to the graph structure of the sample product molecule in the first training sample and the attribute characteristics of atoms in the sample product molecule through a graph neural network model to be trained;
Determining a first target loss function according to the predicted breaking probability of each chemical bond in the sample product molecule, the predicted number of chemical bond breaks in the sample product molecule and the actual broken chemical bonds in the sample product molecule; wherein the first target loss is represented by the following formula:
wherein N is the total number of first training samples used for training, s is the index of the first training samples currently used, k is the predicted number of chemical bond breaks in sample product molecules determined by the graph neural network model to be trained,for the number of real chemical bond breaks in the sample product molecule, < >>Chemical bonds in sample product molecules determined for a graphic neural network model to be trained ( x i ,x j ) Is predicted fracture probability of->For inclusion in the first training sample for characterizing chemical bonds in sample product molecules ( x i ,x j ) Whether a genuine label is broken; />Is 1 represents a chemical bond ( x i ,x j ) Will break; />Is 0 represents a chemical bond ( x i ,x j ) Will not break; l (L) 1 And l 2 Is a cross entropy loss function;
and training the graph neural network model to be trained based on the first target loss function.
2. The method according to claim 1, wherein the method further comprises:
Acquiring attribute characteristics of chemical bonds in the product molecules;
predicting, by the graph neural network model, a broken chemical bond in the product molecule based at least on the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule, further comprising:
predicting a broken chemical bond in the product molecule according to the graph structure of the product molecule, the attribute characteristics of atoms in the product molecule and the attribute characteristics of chemical bonds in the product molecule through the graph neural network model.
3. The method of claim 2, wherein the attribute characteristics of atoms in the product molecule include any one or more of:
atom type, number of linkages, formal charge, chirality, number of attached hydrogen atoms, atomic hybridization, aromaticity, atomic weight, high frequency reaction center characteristics, reverse synthetic reaction type;
the attribute characteristics of the chemical bonds in the product molecules include any one or more of the following:
bond type, conjugated character, cyclic bond character, molecular stereochemistry character.
4. The method of claim 1, wherein predicting, by the sequence learning model, the reactant molecules based at least on the string corresponding to the type of inverse synthetic reaction, the string corresponding to the product molecule, and the string corresponding to the at least one synthon, comprises:
Combining the character string corresponding to the reverse synthesis reaction type, the character string corresponding to the product molecule and the character string corresponding to the at least one synthesizer to obtain a character string to be processed;
predicting a target character string according to the character string to be processed through the sequence learning model;
dividing the target character string into at least one target substring based on separators in the target character string;
converting the target substring into the corresponding reactant molecule.
5. The method according to claim 1, wherein the method further comprises:
determining a graph structure of the at least one synthon;
the predicting, by the sequence learning model, the reactant molecule at least according to the string corresponding to the type of the inverse synthetic reaction, the string corresponding to the product molecule, and the string corresponding to the at least one synthon, includes:
and predicting the reactant molecules according to the graph structure of the product molecules, the graph structure of the at least one synthon, the character string corresponding to the reverse synthesis reaction type, the character string corresponding to the product molecules and the character string corresponding to the at least one synthon through the sequence learning model.
6. An artificial intelligence based inverse synthetic prediction apparatus, the apparatus comprising:
the acquisition module is used for acquiring the graph structure of a product molecule and the attribute characteristics of atoms in the product molecule, wherein the product molecule is an organic compound molecule;
the first prediction module is configured to predict, through a graph neural network model, a broken chemical bond in the product molecule according to at least a graph structure of the product molecule and an attribute characteristic of an atom in the product molecule, and specifically includes: predicting the breaking probability of each chemical bond in the product molecule and the number of broken chemical bonds in the product molecule according to at least the graph structure of the product molecule and the attribute characteristics of atoms in the product molecule through the graph neural network model, sorting the breaking probability of each chemical bond in the product molecule in a descending order, and determining the chemical bonds corresponding to the breaking probability of the number of broken chemical bonds with the front sorting as the breaking chemical bonds;
the bond breaking module is used for breaking bonds of the product molecules based on the broken chemical bonds to obtain at least one synthon;
the second prediction module is used for predicting reactant molecules at least according to the character strings corresponding to the reverse synthesis reaction types, the character strings corresponding to the product molecules and the character strings corresponding to the at least one synthons through a sequence learning model;
Wherein the sequence learning model is trained by:
acquiring a second training sample;
if the second training sample only comprises a first character string and a second character string corresponding to the same inverse synthetic reaction, determining a first reactant prediction character string according to the first character string in the second training sample through a sequence learning model to be trained; determining a second target loss function based on an error between the first reactant predictive string and the second string in the second training sample; the first character string is obtained by combining a character string corresponding to the type of the inverse synthetic reaction, a character string corresponding to a sample product molecule and a character string corresponding to a synthon of the sample product molecule, the second character string is obtained by combining a character string corresponding to a reactant molecule of the inverse synthetic reaction, and the sample product molecule is an organic compound molecule; wherein the second objective loss function is represented by the formula:
wherein N is the total number of second training samples used for training, S is the index of the second training sample currently used, S s Predicting a character string for a first reactant determined based on the first character string for a sequence learning model to be trained, V s A second character string in a second training sample, and l is a loss function;
if the second training sample comprises a first character string, a second character string and a third character string which correspond to the same inverse synthetic reaction, determining a second reactant prediction character string according to the third character string in the second training sample through the sequence learning model to be trained; determining the second target loss function based on an error between the first reactant predicted string and the second string in the second training sample, and an error between the second reactant predicted string and the second string in the second training sample; the third character string is obtained by combining a character string corresponding to the inverse synthetic reaction type, a character string corresponding to the sample product molecule and a character string corresponding to a prediction synthon, and the prediction synthon is obtained by predicting according to the graph structure of the sample product molecule and the attribute characteristics of atoms in the sample product molecule based on the graph neural network model; wherein the second objective loss function is represented by the formula:
wherein N is the total number of second training samples used for training, S is the index of the second training sample currently used, S s Predicting a character string for a first reactant determined based on the first character string for a sequence learning model to be trained,predicting a string for a second reactant determined based on the third string for a sequence learning model to be trained, V s A second character string in a second training sample, and l is a loss function;
training the sequence learning model to be trained based on the second target loss function;
wherein the graph neural network model is trained by:
acquiring a first training sample; the first training sample comprises a graph structure of a sample product molecule, an attribute characteristic of atoms in the sample product molecule and a real fracture chemical bond in the sample product molecule, wherein the sample product molecule is an organic compound molecule;
determining the predicted breaking probability of each chemical bond in the sample product molecule and the predicted number of chemical bond breaks in the sample product molecule according to the graph structure of the sample product molecule in the first training sample and the attribute characteristics of atoms in the sample product molecule through a graph neural network model to be trained;
determining a first target loss function according to the predicted breaking probability of each chemical bond in the sample product molecule, the predicted number of chemical bond breaks in the sample product molecule and the actual broken chemical bonds in the sample product molecule; wherein the first target loss is represented by the following formula:
Wherein N is the total number of first training samples used for training, s is the index of the first training samples currently used, k is the predicted number of chemical bond breaks in sample product molecules determined by the graph neural network model to be trained,for the number of real chemical bond breaks in the sample product molecule, < >>Chemical bonds in sample product molecules determined for a graphic neural network model to be trained ( x i ,x j ) Is predicted fracture probability of->For inclusion in the first training sample for characterizing chemical bonds in sample product molecules ( x i ,x j ) Whether a genuine label is broken; />Is 1 represents a chemical bond ( x i ,x j ) Will break; />Is 0 represents a chemical bond ( x i ,x j ) Will not break; l (L) 1 And l 2 Is a cross entropy loss function;
and training the graph neural network model to be trained based on the first target loss function.
7. The apparatus of claim 6, wherein the second prediction module is specifically configured to:
combining the character string corresponding to the reverse synthesis reaction type, the character string corresponding to the product molecule and the character string corresponding to the at least one synthesizer to obtain a character string to be processed;
predicting a target character string according to the character string to be processed through the sequence learning model;
Dividing the target character string into at least one target substring based on separators in the target character string;
converting the target substring into the corresponding reactant molecule.
8. An apparatus for performing a predictive reverse synthetic reaction, the apparatus comprising a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to perform the artificial intelligence based inverse synthetic prediction method of any one of claims 1 to 5 according to the computer program.
9. A computer readable storage medium for storing a computer program for executing the artificial intelligence based inverse synthetic prediction method according to any one of claims 1 to 5.
CN202010332227.1A 2020-04-24 2020-04-24 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence Active CN111524557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010332227.1A CN111524557B (en) 2020-04-24 2020-04-24 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010332227.1A CN111524557B (en) 2020-04-24 2020-04-24 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN111524557A CN111524557A (en) 2020-08-11
CN111524557B true CN111524557B (en) 2024-04-05

Family

ID=71904489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010332227.1A Active CN111524557B (en) 2020-04-24 2020-04-24 Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN111524557B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199884A (en) * 2020-09-07 2021-01-08 深圳先进技术研究院 Article molecule generation method, device, equipment and storage medium
CN111933225B (en) * 2020-09-27 2021-01-05 平安科技(深圳)有限公司 Drug classification method and device, terminal equipment and storage medium
CN112037868B (en) * 2020-11-04 2021-02-12 腾讯科技(深圳)有限公司 Training method and device for neural network for determining molecular reverse synthetic route
CN112530516B (en) * 2020-12-18 2023-12-26 深圳先进技术研究院 Metabolic pathway prediction method, system, terminal equipment and readable storage medium
CN112509644A (en) * 2020-12-18 2021-03-16 深圳先进技术研究院 Molecular optimization method, system, terminal equipment and readable storage medium
CN114822703A (en) * 2021-01-27 2022-07-29 腾讯科技(深圳)有限公司 Inverse synthesis prediction method of compound molecule and related device
CN113255769B (en) * 2021-05-26 2024-03-29 北京百度网讯科技有限公司 Training method of compound attribute prediction model and compound attribute prediction method
CN113838536B (en) * 2021-09-13 2022-06-10 烟台国工智能科技有限公司 Translation model construction method, product prediction model construction method and prediction method
CN115206451A (en) * 2022-07-14 2022-10-18 腾讯科技(深圳)有限公司 Prediction of reactant molecules, training method of model, device, equipment and medium
CN115240786A (en) * 2022-08-09 2022-10-25 腾讯科技(深圳)有限公司 Method for predicting reactant molecules, method for training reactant molecules, device for performing the method, and electronic apparatus
CN115761250B (en) * 2022-11-21 2023-10-10 北京科技大学 Compound reverse synthesis method and device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514359A (en) * 2003-07-16 2004-07-21 中国科学院上海有机化学研究所 Classfication of chemical reaction and knowledge stratification model establishment and its visible method
CN101789047A (en) * 2010-02-05 2010-07-28 四川大学 Method for evaluating synthesization of organic small-molecule compounds based on reverse synthesis
CN102117370A (en) * 2011-03-25 2011-07-06 西安近代化学研究所 Method for virtually synthesizing azacyclo-energetic compound based on MOL (machine-oriented language) file format
CN102198388A (en) * 2011-04-02 2011-09-28 楚士晋 Method and device for synthesizing compound by solid phase reaction
WO2019055499A1 (en) * 2017-09-12 2019-03-21 Massachusetts Institute Of Technology Systems and methods for predicting chemical reactions
WO2019085329A1 (en) * 2017-11-02 2019-05-09 平安科技(深圳)有限公司 Recurrent neural network-based personal character analysis method, device, and storage medium
CN109872780A (en) * 2019-03-14 2019-06-11 北京深度制耀科技有限公司 A kind of determination method and device of chemical synthesis route
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN110598845A (en) * 2019-08-13 2019-12-20 中国平安人寿保险股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN110600085A (en) * 2019-06-01 2019-12-20 重庆大学 Organic matter physicochemical property prediction method based on Tree-LSTM
WO2020023650A1 (en) * 2018-07-25 2020-01-30 Wuxi Nextcode Genomics Usa, Inc. Retrosynthesis prediction using deep highway networks and multiscale reaction classification
CN110827925A (en) * 2018-08-07 2020-02-21 国际商业机器公司 Intelligent personalized chemical synthesis planning
CN110841577A (en) * 2019-12-06 2020-02-28 大连海事大学 Device for simultaneously preparing hydrogen-rich synthesis gas and carbon nanoparticles
CN110918251A (en) * 2019-10-31 2020-03-27 昆明理工大学 Method and device for removing impurities in phosphogypsum by high gradient magnetic field

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10726944B2 (en) * 2016-10-04 2020-07-28 International Business Machines Corporation Recommending novel reactants to synthesize chemical products

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514359A (en) * 2003-07-16 2004-07-21 中国科学院上海有机化学研究所 Classfication of chemical reaction and knowledge stratification model establishment and its visible method
CN101789047A (en) * 2010-02-05 2010-07-28 四川大学 Method for evaluating synthesization of organic small-molecule compounds based on reverse synthesis
CN102117370A (en) * 2011-03-25 2011-07-06 西安近代化学研究所 Method for virtually synthesizing azacyclo-energetic compound based on MOL (machine-oriented language) file format
CN102198388A (en) * 2011-04-02 2011-09-28 楚士晋 Method and device for synthesizing compound by solid phase reaction
WO2019055499A1 (en) * 2017-09-12 2019-03-21 Massachusetts Institute Of Technology Systems and methods for predicting chemical reactions
WO2019085329A1 (en) * 2017-11-02 2019-05-09 平安科技(深圳)有限公司 Recurrent neural network-based personal character analysis method, device, and storage medium
WO2020023650A1 (en) * 2018-07-25 2020-01-30 Wuxi Nextcode Genomics Usa, Inc. Retrosynthesis prediction using deep highway networks and multiscale reaction classification
CN110827925A (en) * 2018-08-07 2020-02-21 国际商业机器公司 Intelligent personalized chemical synthesis planning
CN109872780A (en) * 2019-03-14 2019-06-11 北京深度制耀科技有限公司 A kind of determination method and device of chemical synthesis route
CN110600085A (en) * 2019-06-01 2019-12-20 重庆大学 Organic matter physicochemical property prediction method based on Tree-LSTM
CN110598845A (en) * 2019-08-13 2019-12-20 中国平安人寿保险股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN110918251A (en) * 2019-10-31 2020-03-27 昆明理工大学 Method and device for removing impurities in phosphogypsum by high gradient magnetic field
CN110841577A (en) * 2019-12-06 2020-02-28 大连海事大学 Device for simultaneously preparing hydrogen-rich synthesis gas and carbon nanoparticles

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"I NTERPRETABLE R ETROSYNTHESIS P REDICTION IN T WO S TEPS";Chaochao Yan等;《ChemRxiv》;20200224;第2-7页第2-3节 *
"Predicting Retrosynthetic Reactions Using Self-Corrected Transformer Neural Networks";Shuangjia Zheng等;《Journal of Chemical Information and Modeling》;20191231;第60卷(第1期);第1-15页 *
Chaochao Yan等."I NTERPRETABLE RETROSYNTHESIS PREDICTION IN TWO STEPS".《ChemRxiv》.2020,第2-7页第2-3节. *

Also Published As

Publication number Publication date
CN111524557A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
CN111524557B (en) Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence
Zeng et al. Graphsaint: Graph sampling based inductive learning method
CN113707235B (en) Drug micromolecule property prediction method, device and equipment based on self-supervision learning
Ding et al. Predicting protein-protein interactions via multivariate mutual information of protein sequences
US11087861B2 (en) Creation of new chemical compounds having desired properties using accumulated chemical data to construct a new chemical structure for synthesis
CN113160894A (en) Method, device, equipment and storage medium for predicting interaction between medicine and target
Mamani Machine Learning techniques and Polygenic Risk Score application to prediction genetic diseases
CN112395487B (en) Information recommendation method and device, computer readable storage medium and electronic equipment
CN115240786A (en) Method for predicting reactant molecules, method for training reactant molecules, device for performing the method, and electronic apparatus
EP3869513A1 (en) De novo generation of molecules using manifold traversal
CN111428078B (en) Audio fingerprint coding method, device, computer equipment and storage medium
Park et al. Transfer learning compensates limited data, batch effects and technological heterogeneity in single-cell sequencing
Younis et al. A new sequential forward feature selection (SFFS) algorithm for mining best topological and biological features to predict protein complexes from protein–protein interaction networks (PPINs)
CN112905809B (en) Knowledge graph learning method and system
US20200279148A1 (en) Material structure analysis method and material structure analyzer
Nakariyakul Suboptimal branch and bound algorithms for feature subset selection: A comparative study
Sidorova et al. NLP-inspired structural pattern recognition in chemical application
Hamla et al. Comparative study of embedded feature selection methods on microarray data
WO2023132029A1 (en) Information processing device, information processing method, and program
Subramanian et al. Musical Instrument Identification using Supervised Learning
Wang et al. Sparse imbalanced drug-target interaction prediction via heterogeneous data augmentation and node similarity
Meher et al. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition
Yusup et al. Feature selection with harmony search for classification: A review
Peterson et al. Ranked sparsity: a cogent regularization framework for selecting and estimating feature interactions and polynomials
Qiang et al. Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40028372

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant