CN111986740B - Method for classifying compounds and related equipment - Google Patents

Method for classifying compounds and related equipment Download PDF

Info

Publication number
CN111986740B
CN111986740B CN202010917059.2A CN202010917059A CN111986740B CN 111986740 B CN111986740 B CN 111986740B CN 202010917059 A CN202010917059 A CN 202010917059A CN 111986740 B CN111986740 B CN 111986740B
Authority
CN
China
Prior art keywords
vector
atom
compound
atomic
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010917059.2A
Other languages
Chinese (zh)
Other versions
CN111986740A (en
Inventor
李恬静
朱威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Saiante Technology Service Co Ltd
Original Assignee
Shenzhen Saiante Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Saiante Technology Service Co Ltd filed Critical Shenzhen Saiante Technology Service Co Ltd
Priority to CN202010917059.2A priority Critical patent/CN111986740B/en
Publication of CN111986740A publication Critical patent/CN111986740A/en
Application granted granted Critical
Publication of CN111986740B publication Critical patent/CN111986740B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides a compound classification method and related equipment. The compound classification method comprises the following steps: obtaining a first tag vector of the sample compound based on the compound property; converting a first atomic representation of the sample compound into a first atomic vector sequence, converting a corresponding missing atom of the first atomic representation into a second tag vector of the first atomic representation; training a property classification model formed by a feature extraction model and a first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector; and classifying the target compound by using the trained property classification model and taking a second atomic vector of the target compound as input. The invention improves the efficiency of classifying the compounds.

Description

Method for classifying compounds and related equipment
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a compound classification method, a device, computer equipment and a computer readable storage medium.
Background
Compound classification is the basis for many biological, chemical works. In the conventional compound classification method, biologists and chemists are required to classify compounds by using expert knowledge.
How to classify compounds based on artificial intelligence to improve classification efficiency is a problem to be solved.
Disclosure of Invention
In view of the foregoing, there is a need for a method, apparatus, computer device, and computer-readable storage medium for classifying compounds that can classify compounds with improved efficiency.
A first aspect of the present application provides a compound classification method comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
converting the first atomic representation into a first atomic vector sequence, and converting the missing atom into a second tag vector of the first atomic representation;
Extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
Calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
Acquiring a second atomic representation of the target compound to be classified;
Converting the second atomic representation into a second atomic vector sequence;
And classifying the target compound by taking the second atomic vector as input through the trained property classification model.
In another possible implementation, the obtaining the first atomic representation of the sample compound includes:
Obtaining a reduced molecular linear input canonical representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
In another possible implementation, the converting the first atomic representation into the first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
In another possible implementation, the feature extraction model includes a BERT model, an RNN model, or a transducer model.
In another possible implementation manner, the calculating, by the first classification model, the characteristic feature vector of the sample compound according to the feature vector sequence includes:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
In another possible implementation manner, the training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector includes:
calculating a first difference vector between the first tag vector and the property feature vector;
Calculating a second difference vector between the second tag vector and the missing atom vector;
Splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
In another possible implementation manner, the optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or (b)
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm.
A second aspect of the present application provides a compound classification device comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, and obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
the first conversion module is used for converting the first atom representation into a first atom vector sequence and converting the missing atom into a second label vector of the first atom representation;
The extraction module is used for extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
A calculation module, configured to calculate a property feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
The training module is used for training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
A second acquisition module for acquiring a second atomic representation of the target compound to be classified;
A second conversion module for converting the second atomic representation into a second atomic vector sequence;
And the classification module is used for classifying the target compound by taking the second atomic vector as input through the trained property classification model.
A third aspect of the application provides a computer device comprising a processor for implementing the compound classification method when executing computer readable instructions stored in a memory.
A fourth aspect of the application provides a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the method of compound classification.
The method and the device pretrain the feature extraction model shared in the two models through the missing atom prediction model and the property classification model so as to improve the extraction effect of the feature extraction model on the atomic features of the compound and further improve the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model. Meanwhile, the trained property classification model takes the second atomic vector as input to classify the target compound, so that the classification of the target compound by an expert is avoided, and the efficiency of classifying the compound is improved.
Drawings
FIG. 1 is a flow chart of a method for classifying compounds according to an embodiment of the present invention.
Fig. 2 is a block diagram of a device for classifying compounds according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, and the described embodiments are merely some, rather than all, embodiments of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Preferably, the compound classification method of the present invention is applied in one or more computer devices. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
Example 1
Fig. 1 is a flow chart of a method for classifying compounds according to an embodiment of the present invention. The compound classification method is applied to computer equipment and is used for classifying the compounds, so that the efficiency of classifying the compounds is improved.
As shown in fig. 1, the compound classification method includes:
101, obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom.
In a specific embodiment, the obtaining a first atomic representation of a sample compound comprises:
obtaining a reduced molecular linear input specification (SIMPLIFIED MOLECULAR INPUT LINE ENTRYSPECIFICATION, SMILE) representation of the sample compound; or (b)
Acquiring a molecular fingerprint (Extended Connectivity Fingerprint, FECP) representation of the sample compound; or (b)
An International compound identity (International CHEMICALIDENTIFIER, INCHI) based representation of the sample compound was obtained.
The sample compound is a compound with randomly missing atoms, which are represented by mask tags. For example, the complete compound is "CCCC (=o) NC 1=cc=c (OCC (O) CNC (C) C (=c1) C (C) =o", the SMILE of the sample compound is expressed as "[ cls ] CCCC (= [ mask ]) NC 1=cc=c (OCC (O) CNC (C) C (=c 1) C (C) =o [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
102, Converting the first atomic representation into a first atomic vector sequence, and converting the missing atoms into a second tag vector of the first atomic representation.
The first atomic representation and the missing atoms are converted into a vector sequence for processing and feature extraction by vector conversion.
In a specific embodiment, the converting the first atomic representation into a first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
The coded sub-vector of each atom, which is a unique identification of the atom, may be queried by a preset coding table. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The graph structure subvector for each atom may include atom structure information and/or connection information of atoms in the sample compound.
And inquiring the coding sub-vector of the missing atom through the preset coding table, and determining the coding sub-vector of the missing atom as the second tag vector.
And 103, extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound.
In a specific embodiment, the feature extraction model includes a BERT model, an RNN model, or a transducer model.
BERT is a deep bi-directional language characterization model based on a transducer, which utilizes a transducer structure to construct a multi-layer bi-directional Encoder network. The transducer is a deep model based on Self-attention mechanism (Self-attention) for processing NLP tasks. The processing effect of the transducer on NLP task is better than that of RNN, and the training speed is faster.
Specifically, the BERT model includes a plurality of neural network layers, each including a preset number of coding modules, each coding module including an Encoding structure of a bidirectional Transformer. Each Encoding structure includes a multi-headed attention network, a first residual network, a first feedforward neural network, and a second residual network.
104, Calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model.
In another embodiment, the computing, by a first classification model, a property feature vector of the sample compound from the sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
The water-soluble classification sub-model comprises a first fully-connected network, and is used for fine-tuning the feature vector sequence based on parameters in the first fully-connected network on the basis that the feature vector sequence has the atomic characteristics of the compound, so as to obtain the characteristic feature vector, and the sample compound is classified by the characteristic feature vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification sub-model may also include a second feedforward neural network, a first convolutional neural network, and the like.
The toxicity classification sub-model comprises a second full-connection network, and is used for fine-tuning the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound, so as to obtain the toxicity characteristic vector, and the sample compound is classified by the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity profile. The toxicity classification sub-model may also include a third feedforward neural network, a second convolutional neural network, and so on.
The first classification model also includes a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
And 105, training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector.
In a specific embodiment, the training a feature classification model composed of the feature extraction model and the first classification model according to the first tag vector and the feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector includes:
calculating a first difference vector between the first tag vector and the property feature vector;
Calculating a second difference vector between the second tag vector and the missing atom vector;
Splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated from a cross entropy loss function; and calculating a second difference value vector of the second label vector and the missing atom vector according to the cross entropy loss function.
In a specific embodiment, said optimizing parameters in said property classification model and said missing atom prediction model according to said first difference vector, said second difference vector, and said third difference vector using a back propagation algorithm comprises:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or (b)
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm.
The first difference vector expresses the distance between the first label vector and the property feature vector, and the second difference vector expresses the distance between the second label vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, the first label vector and the second label vector are spliced to be used as integral label vectors of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the overall label vector from the overall output vector.
And after synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm, recalculating the integral output vector of the integral model, wherein the distance between the recalculated integral output vector of the integral model and the integral label vector is smaller, namely the classification accuracy of the integral model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference value by adopting a back propagation algorithm, wherein two times of optimization are performed asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between the property characteristic vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property characteristic classification model classifies the compound more accurately based on the compound physical property. And optimizing parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model predicts missing atoms in an input compound more accurately.
106, Obtaining a second atomic representation of the target compound to be classified.
A second atomic representation of the target compound may be obtained from a database; or (b)
A second atomic representation of the target compound is received as input by a user.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining an SMILE representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
107 Converting the second atomic representation into a second atomic vector sequence.
The second atomic representation is converted by vector conversion into a vector sequence that facilitates processing and extraction of features.
The converting the second atomic representation into a second atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining a second atom vector of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
108, Classifying the target compound by taking the second atomic vector as an input through the trained property classification model.
After optimizing parameters in the property classification model by training, the trained property classification model may classify the target compound based on compound property characteristics.
The trained property classification model may include a water-soluble classification sub-model, a toxicity classification sub-model, a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model comprises a water-soluble classification sub-model and a toxicity classification sub-model, and the trained property classification model can classify the target compound according to the water solubility and toxicity of the compound, so that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The method for classifying compounds of the first embodiment obtains a first atomic representation of a sample compound, obtains a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom; converting the first atomic representation into a first atomic vector sequence, and converting the missing atom into a second tag vector of the first atomic representation; extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector; acquiring a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atomic vector sequence; and classifying the target compound by taking the second atomic vector as input through the trained property classification model. In the first embodiment, the feature extraction model shared in the two models is pre-trained through the missing atom prediction model and the property classification model, so that the extraction effect of the feature extraction model on the atomic features of the compound is improved, and the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model is further improved. Meanwhile, the trained property classification model takes the second atomic vector as input to classify the target compound, so that the classification of the target compound by an expert is avoided, and the efficiency of classifying the compound is improved.
Example two
Fig. 2 is a block diagram of a device for classifying compounds according to a second embodiment of the present invention. The compound classifying device 20 is applied to a computer apparatus. The compound classification device 20 is used for classifying the compounds, and improves the efficiency of classifying the compounds.
As shown in fig. 2, the compound classification device 20 may include a first acquisition module 201, a first conversion module 202, an extraction module 203, a calculation module 204, a training module 205, a second acquisition module 206, a second conversion module 207, and a classification module 208.
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property, and obtain a corresponding missing atom of the first atomic representation.
In a specific embodiment, the obtaining a first atomic representation of a sample compound comprises:
obtaining a reduced molecular linear input specification (SIMPLIFIED MOLECULAR INPUT LINE ENTRYSPECIFICATION, SMILE) representation of the sample compound; or (b)
Acquiring a molecular fingerprint (Extended Connectivity Fingerprint, FECP) representation of the sample compound; or (b)
An International compound identity (International CHEMICALIDENTIFIER, INCHI) based representation of the sample compound was obtained.
The sample compound is a compound with randomly missing atoms, which are represented by mask tags. For example, the complete compound is "CCCC (=o) NC 1=cc=c (OCC (O) CNC (C) C (=c1) C (C) =o", the SMILE of the sample compound is expressed as "[ cls ] CCCC (= [ mask ]) NC 1=cc=c (OCC (O) CNC (C) C (=c 1) C (C) =o [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
A first conversion module 202, configured to convert the first atomic representation into a first atomic vector sequence and convert the missing atom into a second tag vector of the first atomic representation.
The first atomic representation and the missing atoms are converted into a vector sequence for processing and feature extraction by vector conversion.
In a specific embodiment, the converting the first atomic representation into a first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
The coded sub-vector of each atom, which is a unique identification of the atom, may be queried by a preset coding table. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The graph structure subvector for each atom may include atom structure information and/or connection information of atoms in the sample compound.
And inquiring the coding sub-vector of the missing atom through the preset coding table, and determining the coding sub-vector of the missing atom as the second tag vector.
And the extracting module 203 is configured to extract the atomic characteristics of the compound by using the first atomic vector sequence as input through a characteristic extraction model, so as to obtain a characteristic vector sequence of the sample compound.
In a specific embodiment, the feature extraction model includes a BERT model, an RNN model, or a transducer model.
BERT is a deep bi-directional language characterization model based on a transducer, which utilizes a transducer structure to construct a multi-layer bi-directional Encoder network. The transducer is a deep model based on Self-attention mechanism (Self-attention) for processing NLP tasks. The processing effect of the transducer on NLP task is better than that of RNN, and the training speed is faster.
Specifically, the BERT model includes a plurality of neural network layers, each including a preset number of coding modules, each coding module including an Encoding structure of a bidirectional Transformer. Each Encoding structure includes a multi-headed attention network, a first residual network, a first feedforward neural network, and a second residual network.
A calculation module 204, configured to calculate, by using a first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence.
In another embodiment, the computing, by a first classification model, a property feature vector of the sample compound from the sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
The water-soluble classification sub-model comprises a first fully-connected network, and is used for fine-tuning the feature vector sequence based on parameters in the first fully-connected network on the basis that the feature vector sequence has the atomic characteristics of the compound, so as to obtain the characteristic feature vector, and the sample compound is classified by the characteristic feature vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification sub-model may also include a second feedforward neural network, a first convolutional neural network, and the like.
The toxicity classification sub-model comprises a second full-connection network, and is used for fine-tuning the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound, so as to obtain the toxicity characteristic vector, and the sample compound is classified by the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity profile. The toxicity classification sub-model may also include a third feedforward neural network, a second convolutional neural network, and so on.
The first classification model also includes a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
A training module 205, configured to train a feature classification model formed by the feature extraction model and the first classification model according to the first tag vector and the feature vector, and train a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector.
In a specific embodiment, the training a feature classification model composed of the feature extraction model and the first classification model according to the first tag vector and the feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector includes:
calculating a first difference vector between the first tag vector and the property feature vector;
Calculating a second difference vector between the second tag vector and the missing atom vector;
Splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated from a cross entropy loss function; and calculating a second difference value vector of the second label vector and the missing atom vector according to the cross entropy loss function.
In a specific embodiment, said optimizing parameters in said property classification model and said missing atom prediction model according to said first difference vector, said second difference vector, and said third difference vector using a back propagation algorithm comprises:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or (b)
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm.
The first difference vector expresses the distance between the first label vector and the property feature vector, and the second difference vector expresses the distance between the second label vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, the first label vector and the second label vector are spliced to be used as integral label vectors of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the overall label vector from the overall output vector.
And after synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm, recalculating the integral output vector of the integral model, wherein the distance between the recalculated integral output vector of the integral model and the integral label vector is smaller, namely the classification accuracy of the integral model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference value by adopting a back propagation algorithm, wherein two times of optimization are performed asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between the property characteristic vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property characteristic classification model classifies the compound more accurately based on the compound physical property. And optimizing parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model predicts missing atoms in an input compound more accurately.
A second obtaining module 206 is configured to obtain a second atomic representation of the target compound to be classified.
A second atomic representation of the target compound may be obtained from a database; or (b)
A second atomic representation of the target compound is received as input by a user.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining an SMILE representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
A second conversion module 207 for converting the second atomic representation into a second atomic vector sequence.
The second atomic representation is converted by vector conversion into a vector sequence that facilitates processing and extraction of features.
The converting the second atomic representation into a second atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining a second atom vector of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
And a classification module 208, configured to classify the target compound by using the trained property classification model and the second atomic vector as input.
After optimizing parameters in the property classification model by training, the trained property classification model may classify the target compound based on compound property characteristics.
The trained property classification model may include a water-soluble classification sub-model, a toxicity classification sub-model, a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model comprises a water-soluble classification sub-model and a toxicity classification sub-model, and the trained property classification model can classify the target compound according to the water solubility and toxicity of the compound, so that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The compound classification device 20 of embodiment two obtains a first atomic representation of a sample compound, obtains a first tag vector of the sample compound based on a compound property, and the first atomic representation corresponding to a missing atom; converting the first atomic representation into a first atomic vector sequence, and converting the missing atom into a second tag vector of the first atomic representation; extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector; acquiring a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atomic vector sequence; and classifying the target compound by taking the second atomic vector as input through the trained property classification model. In the embodiment, the feature extraction model shared in the two models is pre-trained through the missing atom prediction model and the property classification model, so that the extraction effect of the feature extraction model on the atomic features of the compound is improved, and the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model is further improved. Meanwhile, the trained property classification model takes the second atomic vector as input to classify the target compound, so that the classification of the target compound by an expert is avoided, and the efficiency of classifying the compound is improved.
Example III
The present embodiment provides a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of the above-described embodiment of a method for classifying compounds, for example, steps 101-108 shown in fig. 1:
101, acquiring a first atomic representation of a sample compound, and acquiring a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second label vector of the first atom representation;
103, extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model composed of the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector 105;
106, obtaining a second atomic representation of the target compound to be classified;
107 converting the second atomic representation into a second atomic vector sequence;
108, classifying the target compound by taking the second atomic vector as an input through the trained property classification model.
Or which when executed by a processor perform the functions of the modules of the apparatus embodiments described above, such as modules 201-208 in fig. 2:
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property, and obtain a corresponding missing atom of the first atomic representation;
A first conversion module 202, configured to convert the first atomic representation into a first atomic vector sequence, and convert the missing atom into a second tag vector of the first atomic representation;
The extracting module 203 is configured to extract, by using the feature extraction model and using the first atomic vector sequence as input, an atomic feature of the compound, to obtain a feature vector sequence of the sample compound;
a calculation module 204, configured to calculate, by using a first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
A training module 205, configured to train a feature classification model formed by the feature extraction model and the first classification model according to the first tag vector and the feature vector, and train a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
a second obtaining module 206, configured to obtain a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
And a classification module 208, configured to classify the target compound by using the trained property classification model and the second atomic vector as input.
Example IV
Fig. 3 is a schematic diagram of a computer device according to a third embodiment of the present invention. The computer device 30 includes a memory 301, a processor 302, and computer readable instructions, such as a compound classification program, stored in the memory 301 and executable on the processor 302. The processor 302, when executing the computer-readable instructions, implements the steps of the compound classification method embodiments described above, such as 101-108 shown in fig. 1:
101, acquiring a first atomic representation of a sample compound, and acquiring a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second label vector of the first atom representation;
103, extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model composed of the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector 105;
106, obtaining a second atomic representation of the target compound to be classified;
107 converting the second atomic representation into a second atomic vector sequence;
108, classifying the target compound by taking the second atomic vector as an input through the trained property classification model.
Or which when executed by a processor perform the functions of the modules of the apparatus embodiments described above, such as modules 201-208 in fig. 2:
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property, and obtain a corresponding missing atom of the first atomic representation;
A first conversion module 202, configured to convert the first atomic representation into a first atomic vector sequence, and convert the missing atom into a second tag vector of the first atomic representation;
The extracting module 203 is configured to extract, by using the feature extraction model and using the first atomic vector sequence as input, an atomic feature of the compound, to obtain a feature vector sequence of the sample compound;
a calculation module 204, configured to calculate, by using a first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
A training module 205, configured to train a feature classification model formed by the feature extraction model and the first classification model according to the first tag vector and the feature vector, and train a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
a second obtaining module 206, configured to obtain a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
And a classification module 208, configured to classify the target compound by using the trained property classification model and the second atomic vector as input.
Illustratively, the computer readable instructions may be partitioned into one or more modules that are stored in the memory 301 and executed by the processor 302 to perform the present methods. The one or more modules may be a series of computer readable instructions capable of performing a particular function, the instruction describing the execution of the computer readable instructions in the computer device 30. For example, the computer readable instructions may be divided into a first acquisition module 201, a first conversion module 202, an extraction module 203, a calculation module 204, a training module 205, a second acquisition module 206, a second conversion module 207, and a classification module 208 in fig. 2, where each module has a specific function, see embodiment two.
Those skilled in the art will appreciate that the schematic diagram 3 is merely an example of the computer device 30 and is not meant to be limiting of the computer device 30, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the computer device 30 may also include input and output devices, network access devices, buses, etc.
The Processor 302 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being the control center of the computer device 30, with various interfaces and lines connecting the various parts of the overall computer device 30.
The memory 301 may be used to store the computer readable instructions and the processor 302 implements the various functions of the computer device 30 by executing or executing the computer readable instructions or modules stored in the memory 301 and invoking data stored in the memory 301. The memory 301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device 30, or the like. In addition, the Memory 301 may include a hard disk, memory, a plug-in hard disk, a smart Memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash Memory card (FLASH CARD), at least one magnetic disk storage device, a flash Memory device, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or other non-volatile/volatile storage device.
The modules integrated by the computer device 30 may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the present invention may also be implemented by implementing all or part of the processes in the methods of the embodiments described above, by instructing the associated hardware by means of computer readable instructions, which may be stored in a computer readable storage medium, the computer readable instructions, when executed by a processor, implementing the steps of the respective method embodiments described above. Wherein the computer readable instructions comprise computer readable instruction code which may be in the form of source code, object code, executable files, or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read Only Memory (ROM), a Random Access Memory (RAM), and so forth.
In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in hardware plus software functional modules.
The integrated modules, which are implemented in the form of software functional modules, may be stored in a computer readable storage medium. The software functional module is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the compound classification method according to the embodiments of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other modules or steps, and that the singular does not exclude a plurality. A plurality of modules or means recited in the system claims can also be implemented by means of one module or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (8)

1. A method of classifying a compound, the method comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
Converting the first atom representation into a first atom vector sequence, wherein the code sub-vector of each atom is obtained from a preset code table, the first atom vector of each atom comprises the code sub-vector, the position sub-vector and the graph structure sub-vector of each atom, the missing atom is converted into a second label vector of the first atom representation, the code sub-vector of the missing atom is obtained from the preset code table, and the code sub-vector of the missing atom is determined to be the second label vector;
Extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
Calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model composed of the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector, comprising: calculating a first difference vector between the first tag vector and the property feature vector; calculating a second difference vector between the second tag vector and the missing atom vector; splicing the first difference vector and the second difference vector to obtain a third difference vector; synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or, optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm;
Acquiring a second atomic representation of the target compound to be classified;
Converting the second atomic representation into a second atomic vector sequence;
and classifying the target compound by taking the second atomic vector as input through the trained property classification model and the trained missing atom prediction model.
2. The method of classifying compounds according to claim 1, wherein said obtaining a first atomic representation of a sample compound comprises:
Obtaining a reduced molecular linear input canonical representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
3. The method of classifying a compound as recited in claim 1, wherein said converting said first atomic representation into a first sequence of atomic vectors comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
4. The method of compound classification according to claim 1, wherein the feature extraction model comprises a BERT model, an RNN model, or a transducer model.
5. The method of classifying compounds according to claim 1, wherein said calculating, by a first classification model, a characteristic feature vector of said sample compound from said sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
6. A compound classification apparatus for implementing the compound classification method according to any one of claims 1 to 5, comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, and obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
the first conversion module is used for converting the first atom representation into a first atom vector sequence and converting the missing atom into a second label vector of the first atom representation;
The extraction module is used for extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
A calculation module, configured to calculate a property feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
The training module is used for training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
A second acquisition module for acquiring a second atomic representation of the target compound to be classified;
A second conversion module for converting the second atomic representation into a second atomic vector sequence;
And the classification module is used for classifying the target compound by taking the second atomic vector as input through the trained property classification model.
7. A computer device comprising a processor for executing computer readable instructions stored in a memory to implement the compound classification method of any one of claims 1 to 5.
8. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the method of classifying a compound according to any one of claims 1 to 5.
CN202010917059.2A 2020-09-03 2020-09-03 Method for classifying compounds and related equipment Active CN111986740B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010917059.2A CN111986740B (en) 2020-09-03 2020-09-03 Method for classifying compounds and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010917059.2A CN111986740B (en) 2020-09-03 2020-09-03 Method for classifying compounds and related equipment

Publications (2)

Publication Number Publication Date
CN111986740A CN111986740A (en) 2020-11-24
CN111986740B true CN111986740B (en) 2024-05-14

Family

ID=73448044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010917059.2A Active CN111986740B (en) 2020-09-03 2020-09-03 Method for classifying compounds and related equipment

Country Status (1)

Country Link
CN (1) CN111986740B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018098588A1 (en) * 2016-12-02 2018-06-07 Lumiant Corporation Computer systems for and methods of identifying non-elemental materials based on atomistic properties
CN109493922A (en) * 2018-11-19 2019-03-19 大连思利科环境科技有限公司 A method of prediction chemicals molecular structural parameter
CN109658989A (en) * 2018-11-14 2019-04-19 国网新疆电力有限公司信息通信公司 Class drug compound toxicity prediction method based on deep learning
CN110428864A (en) * 2019-07-17 2019-11-08 大连大学 Method for constructing the affinity prediction model of protein and small molecule
CN110751230A (en) * 2019-10-30 2020-02-04 深圳市太赫兹科技创新研究院有限公司 Substance classification method, substance classification device, terminal device and storage medium
CN110767271A (en) * 2019-10-15 2020-02-07 腾讯科技(深圳)有限公司 Compound property prediction method, device, computer device and readable storage medium
CN110867254A (en) * 2019-11-18 2020-03-06 北京市商汤科技开发有限公司 Prediction method and device, electronic device and storage medium
CN110957012A (en) * 2019-11-28 2020-04-03 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for analyzing properties of compound

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU1067699A (en) * 1997-10-07 1999-04-27 New England Medical Center Hospitals, Inc., The Structure-based rational design of compounds to inhibit papillomavirus infe ction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018098588A1 (en) * 2016-12-02 2018-06-07 Lumiant Corporation Computer systems for and methods of identifying non-elemental materials based on atomistic properties
CN109658989A (en) * 2018-11-14 2019-04-19 国网新疆电力有限公司信息通信公司 Class drug compound toxicity prediction method based on deep learning
CN109493922A (en) * 2018-11-19 2019-03-19 大连思利科环境科技有限公司 A method of prediction chemicals molecular structural parameter
CN110428864A (en) * 2019-07-17 2019-11-08 大连大学 Method for constructing the affinity prediction model of protein and small molecule
CN110767271A (en) * 2019-10-15 2020-02-07 腾讯科技(深圳)有限公司 Compound property prediction method, device, computer device and readable storage medium
CN110751230A (en) * 2019-10-30 2020-02-04 深圳市太赫兹科技创新研究院有限公司 Substance classification method, substance classification device, terminal device and storage medium
CN110867254A (en) * 2019-11-18 2020-03-06 北京市商汤科技开发有限公司 Prediction method and device, electronic device and storage medium
CN110957012A (en) * 2019-11-28 2020-04-03 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for analyzing properties of compound

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Representation of compounds for machine-learning prediction of physical properties;Atsuto Seko et al.;《PHYSICAL REVIEW B》;第95卷(第14期);144110(1-11) *
基于机器学习的化合物分析;安强强;《当代化工》;第47卷(第1期);38-40, 52 *
有机化合物水生毒性作用模式的支持向量机分类研究;易忠胜等;《广西科学》;第13卷卷(第1期);31-34页 *

Also Published As

Publication number Publication date
CN111986740A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN111259142B (en) Specific target emotion classification method based on attention coding and graph convolution network
CN112131383B (en) Specific target emotion polarity classification method
CN111461168A (en) Training sample expansion method and device, electronic equipment and storage medium
CN112086144A (en) Molecule generation method, molecule generation device, electronic device, and storage medium
CN111639500A (en) Semantic role labeling method and device, computer equipment and storage medium
US20210374517A1 (en) Continuous Time Self Attention for Improved Computational Predictions
CN113111190A (en) Knowledge-driven dialog generation method and device
CN113239702A (en) Intention recognition method and device and electronic equipment
Shiloh-Perl et al. Introduction to deep learning
CN114528835A (en) Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN114896067A (en) Automatic generation method and device of task request information, computer equipment and medium
CN115222443A (en) Client group division method, device, equipment and storage medium
WO2022068314A1 (en) Neural network training method, neural network compression method and related devices
CN112036439B (en) Dependency relationship classification method and related equipment
CN114428860A (en) Pre-hospital emergency case text recognition method and device, terminal and storage medium
CN117634459A (en) Target content generation and model training method, device, system, equipment and medium
CN117829122A (en) Text similarity model training method, device and medium based on conditions
CN117094451A (en) Power consumption prediction method, device and terminal
CN111986740B (en) Method for classifying compounds and related equipment
CN111931841A (en) Deep learning-based tree processing method, terminal, chip and storage medium
CN114998041A (en) Method and device for training claim settlement prediction model, electronic equipment and storage medium
KR20210035622A (en) Time series data similarity calculation system and method
Sevakula et al. Improving Classifier Generalization: Real-Time Machine Learning Based Applications
CN113284256B (en) MR (magnetic resonance) mixed reality three-dimensional scene material library generation method and system
CN113160795B (en) Language feature extraction model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210202

Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant after: Shenzhen saiante Technology Service Co.,Ltd.

Address before: 1-34 / F, Qianhai free trade building, 3048 Xinghai Avenue, Mawan, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong 518000

Applicant before: Ping An International Smart City Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant