CN114548308A - Deep learning method and device for identifying persistent organic pollutants - Google Patents

Deep learning method and device for identifying persistent organic pollutants Download PDF

Info

Publication number
CN114548308A
CN114548308A CN202210190955.2A CN202210190955A CN114548308A CN 114548308 A CN114548308 A CN 114548308A CN 202210190955 A CN202210190955 A CN 202210190955A CN 114548308 A CN114548308 A CN 114548308A
Authority
CN
China
Prior art keywords
persistent organic
layer
compound
identified
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210190955.2A
Other languages
Chinese (zh)
Other versions
CN114548308B (en
Inventor
孙翔飞
曾永平
麦磊
谢梦仪
江瑞芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202210190955.2A priority Critical patent/CN114548308B/en
Publication of CN114548308A publication Critical patent/CN114548308A/en
Application granted granted Critical
Publication of CN114548308B publication Critical patent/CN114548308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of environmental monitoring, and discloses a deep learning method and device for identifying persistent organic pollutants. The method comprises the following steps: extracting a plurality of molecular descriptors aiming at a compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold value, and the molecular descriptor threshold value is 2201; arranging a plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix; and processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model to determine whether the compound to be identified is a persistent organic pollutant or not. The device comprises an extraction module, an obtaining module and a determining module. Through the scheme, the identification precision of potential persistent organic pollutants in commercial chemicals can be improved, the robustness of the deep convolutional neural network model is greatly expanded, and complex organic compounds with different chemical structures and element compositions can be identified more quickly and effectively.

Description

Deep learning method and device for identifying persistent organic pollutants
Technical Field
The invention belongs to the field of environmental monitoring, and particularly relates to a deep learning method and device for identifying persistent organic pollutants.
Background
Since the past fifty years, persistent organic pollutants as industrial additives (such as flame retardants, surface treatment agents, stabilizers and the like) and pesticides (such as lindane, DDTs, mirex and the like) are produced and used in large quantities, a series of regional environmental pollution events are caused all over the world, and the harmfulness of the chemicals to the environment and human health is gradually recognized. The compound has strong durability, bioaccumulation and toxicity.
The traditional laboratory identification method is expensive and time-consuming, and cannot meet the rapid identification requirement of massive chemicals in practical application. While the model-based prediction method is mainly a weighted scoring algorithm, the method is relatively coarse, and the training data usually only contains hundreds of chemicals and only uses a few carefully selected molecular descriptors. Since a single molecular descriptor can only identify the specific chemical structure extracted, there is a significant degradation in the ability to identify in the face of non-similar structured chemicals, particularly new alternatives.
Disclosure of Invention
The invention aims to provide a deep learning method and a deep learning device for identifying persistent organic pollutants, which can effectively improve the identification precision of the persistent organic pollutants, greatly expand the robustness of a model and enable the model to more quickly and effectively identify complex organic compounds with different chemical structures and element compositions.
In order to achieve the above object, the present invention provides a deep learning method for identifying persistent organic pollutants, which comprises the following steps:
extracting a plurality of molecular descriptors aiming at a compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold value, and the molecular descriptor threshold value is 2201; arranging the plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix; processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model to determine whether a compound to be identified is a persistent organic pollutant or not; the preset mode is the same as the arrangement mode of all the molecular descriptors extracted from each sample in the training data set for training the deep convolutional neural network model.
In the deep learning method as described above, optionally, the deep neural network model includes: an input layer, a hidden layer and an output layer; along the processing direction of data, the hidden layer sequentially comprises: the first convolution layer, the second convolution layer, the first pooling layer, the third convolution layer, the fourth convolution layer, the second pooling layer and the full connection layer are connected with a first forgetting door, the second pooling layer is connected with a second forgetting door, and the full connection layer is connected with a third forgetting door.
In the deep learning method as described above, optionally, before the extracting a plurality of molecular descriptors for the compound to be identified, the deep learning method further includes: constructing the training data set, the training data set comprising: a plurality of persistent organic pollutants and a plurality of non-persistent organic pollutants.
In the deep learning method as described above, optionally, the kind of the persistent organic pollutant is 1309, and the kind of the non-persistent organic pollutant is 9990; the number of the molecular descriptors is 2201.
In the deep learning method as described above, optionally, the processing the two-dimensional structural feature description matrix using a deep convolutional neural network model trained in advance to determine whether the compound to be identified is a persistent organic pollutant includes: judging whether the target compound is lower than a set threshold value or not according to the two-dimensional structural feature description matrix, and if so, determining that the compound to be identified is not a persistent organic pollutant; if not, determining that the compound to be identified is a persistent organic pollutant.
In another aspect, a deep learning apparatus for identifying persistent organic pollutants is provided, which includes: the extraction module is used for extracting a plurality of molecular descriptors aiming at a compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold value, and the molecular descriptor threshold value is 2201; the obtaining module is used for arranging the plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix; and the determining module is used for processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model, determining whether a compound to be identified is a persistent organic pollutant or not, and training the deep convolutional neural network model so that the arrangement mode of all molecular descriptors extracted from each sample in a training data set is the same as the preset mode.
In the deep learning apparatus as described above, optionally, the deep neural network model in the determination module includes: an input layer, a hidden layer and an output layer; along the processing direction of data, the hidden layer sequentially comprises: the first convolution layer, the second convolution layer, the first pooling layer, the third convolution layer, the fourth convolution layer, the second pooling layer and the full connection layer are connected with a first forgetting door, the second pooling layer is connected with a second forgetting door, and the full connection layer is connected with a third forgetting door.
In the deep learning apparatus as described above, optionally, the deep learning apparatus further includes: a training data set construction module for constructing a training data set, the training data set comprising: a plurality of persistent organic pollutants and a plurality of non-persistent organic pollutants.
In the deep learning apparatus as described above, optionally, the kind of the persistent organic pollutant is 1309, and the kind of the non-persistent organic pollutant is 9990; the number of the molecular descriptors is 2201.
In the deep learning apparatus as described above, optionally, the determining module is configured to determine whether the target compound is lower than a set threshold according to the two-dimensional structural feature description matrix, and if so, determine that the compound to be identified is not a persistent organic pollutant; if not, determining that the compound to be identified is a persistent organic pollutant.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the method comprises the steps of extracting a plurality of molecular descriptors aiming at a compound to be identified, arranging the molecular descriptors in a preset mode, obtaining a two-dimensional structural feature description matrix, processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model, and determining whether the compound to be identified is a persistent organic pollutant or not, so that the identification accuracy of the potential persistent organic pollutant in the commercial chemicals can be improved, the robustness of the deep convolutional neural network model is greatly expanded, and the complex organic compounds with different chemical structures and element compositions can be identified more quickly and effectively.
Drawings
Fig. 1 is a schematic flowchart of a deep learning method for identifying persistent organic pollutants according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a deep learning method for identifying persistent organic pollutants according to another embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a deep learning apparatus for identifying persistent organic pollutants according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a deep convolutional neural network model according to an embodiment of the present invention;
FIG. 5 is a schematic representation of a two-dimensional structural characterization matrix for certain compounds provided by an embodiment of the present invention;
fig. 6 is a curve of a training convergence process of a deep convolutional neural network model according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. The various examples are provided by way of explanation of the invention, and not limitation of the invention.
Referring to fig. 1, an embodiment of the present invention provides a deep learning method for identifying persistent organic pollutants, which includes the following steps:
step 101, extracting a plurality of molecular descriptors aiming at a compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold, and the molecular descriptor threshold is 2201.
And 102, arranging the plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix.
And 103, processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model to obtain whether a compound to be identified is a persistent organic pollutant or not, wherein the arrangement mode of all molecular descriptors extracted from each sample in a training data set is the same as a preset mode, and the training data set is a data set used in the deep convolutional neural network training.
Extracting a plurality of molecular descriptors aiming at a compound to be identified, wherein the number of the molecular descriptors is larger than or equal to a molecular descriptor threshold value, the molecular descriptor threshold value is 2201, arranging a plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix, processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model to determine whether a compound to be identified is a persistent organic pollutant or not, thereby leading the model to use a large number of molecular descriptors, improving the accuracy and the effectiveness of theoretical prediction of persistent organic pollutants, improving the identification precision of potential persistent organic pollutants in commercial chemicals, and the robustness of the deep convolutional neural network model is greatly expanded, so that the complex organic compounds with different chemical structures and element compositions can be more quickly and effectively identified.
Referring to fig. 2, another embodiment of the present invention provides a deep learning method for identifying persistent organic pollutants, which includes the following steps:
step 201, a training data set is constructed.
Specifically, a list of industrial chemicals is collected, and a training data set is constructed by cross-matching a plurality of pollutant databases and a control list of POPs (Persistent Organic Pollutants)/PBTs (Persistent biological additive toxicants) chemicals. The training data set includes: a plurality of persistent organic pollutants (or positive samples) and a plurality of non-persistent organic pollutants (or negative samples). In this embodiment, the types of the positive samples may be 1309, the types of the negative samples may be 9990, and in other embodiments, the number of the positive samples may be other, which is not limited in this embodiment. When the type of the positive sample is smaller than that of the negative sample, namely the ratio of the positive sample to the negative sample is smaller than 1, the weight of the positive sample to the negative sample is adjusted by adopting a downsampling method, so that the sensitivity of the deep convolution neural network model to a small-proportion data sample is improved, and the overall training effect of the model is improved.
The sample data is preferably from official industrial chemical databases published in the United states, China, European Union, Japan, Korea, Australia, Canada, New Zealand, Vietnam, Malaysia, Russia, etc., and third party chemical databases of NORMAN (Network of reference laboratories, research centers and related organizations for monitoring emerging environmental pollutants), and the like.
Step 202, constructing a deep convolutional neural network model.
Specifically, a Deep Convolutional Neural Network (Deep Convolutional Neural Network) is a type of feed-forward Neural Network that includes convolution calculation and has a Deep structure, and is one of the representative algorithms of Deep Learning (Deep Learning). The deep convolution neural network has strong representation learning capability, and can learn and store a large number of input-output mode mapping relations without revealing mathematical equations describing the mapping relations in advance. The learning method uses the steepest descent method to continuously adjust the weight and the threshold value of the network through back propagation so as to minimize the error square sum of the network.
Referring to fig. 4, a twelve-layer deep convolutional neural network model is established, which includes: input layer, hidden layer and output layer, wherein, hidden layer includes: the system comprises four convolutional neural network layers, two pooling layers, three forgetting neural gates (or forgetting gates) and a full connection layer. The four convolutional neural network layers include: the first convolutional neural network layer (or convolutional layer 1/first convolutional layer), the second convolutional neural network layer (or convolutional layer 2/second convolutional layer), the third convolutional neural network layer (or convolutional layer 3/third convolutional layer) and the fourth convolutional neural network layer (or convolutional layer 4/fourth convolutional layer), wherein the two pooling layers comprise a first pooling layer (or pooling layer 1) and a second pooling layer (or pooling layer 2), the three forgetting neural gates comprise a first forgetting neural gate (or forgetting gate 1/first forgetting gate), a second forgetting neural gate (or forgetting gate 2/second forgetting gate) and a third forgetting neural gate (or forgetting gate 3/third forgetting gate), wherein the first forgetting gate is connected with the first pooling layer, the second forgetting gate is connected with the second pooling layer, the third forgetting gate is connected with the full-connection layer, and the convolutional layer is used for feature extraction, the pooling layer is used for summarizing and generalizing features, and the forgetting gate is used for preventing the model from being over-fitted and keeping focus on overall differences. The structure can improve the recognition effect of the model on complex parameters, and can also avoid the problems of model distortion, gradient disappearance or gradient explosion.
Step 203, training a deep convolutional neural network model.
Specifically, a two-dimensional structural feature description matrix is obtained first.
The chemical structure characteristics of all chemicals in the training data set are analyzed by using molecular structure analysis software, such as Dragon, alvaDesc, MoDEL, MolGen, Cerius2 and the like, and for each chemical, a plurality of compound structure characteristics, namely molecular descriptors are extracted and used for representing quantitative data of specific structure characteristics in compound molecules, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold value, and the molecular descriptor threshold value is 2021, so that the chemical structure characteristics of the chemicals can be fully mined, and the deep convolutional neural network MoDEL can be used for feature mining. The molecular descriptor structure data includes: chemical element types, bonding modes, spatial topological structure characteristics and connection modes. It should be noted that the molecular descriptor threshold can be reduced by a proper amount, and reducing a part of the molecular descriptors (such as tens of molecular descriptors) does not affect the best performance of the deep convolutional neural network model, but reduces the stability of the model as a whole. Preferably, the molecular descriptor threshold is 2021.
And then arranging all the molecular descriptors of the chemicals in a preset mode, thereby constructing a two-dimensional structural feature description matrix, namely converting the two-dimensional structural feature description matrix into the two-dimensional structural feature description matrix. Before each training, the arrangement mode of the molecular descriptors is randomly arranged, the precision of the deep convolutional neural network model is not influenced by the arrangement mode, but the arrangement mode of the molecular descriptors of all the compounds is required to be consistent. Specifically, on the basis of the initial molecular descriptors, the molecular descriptors are randomly arranged, the sequence of the random arrangement is recorded, and the subsequent molecular descriptors are uniformly reordered according to the sequence, wherein the sequence of the arrangement is a preset mode.
The two-dimensional structural feature description matrix is the input to the deep convolutional neural network model. Referring to fig. 5, a two-dimensional structural characterization matrix for a typical compound is illustrated. Typical compounds include: typical persistent organic pollutants and typical non-persistent organic pollutants, wherein the typical persistent organic pollutants, from left to right, comprise: benzo (a) pyrene, PCB-126, o, p' -droppings. Typical non-persistent organic contaminants include, from left to right in sequence: fenoxaprop-P-ethyl, timolol maleate and 1, 5-pentanethiol.
For convenience of calculation, after the extracted molecular descriptors are arranged into a two-dimensional structural feature description matrix according to a preset mode and before the two-dimensional structural feature description matrix is input into a deep convolutional neural network model, all data are subjected to standardization operation, namely all values are normalized to be between 0 and 1 through averaging and variance.
Taking the obtained two-dimensional structural feature description matrix as an input of a deep convolution neural network model, wherein the output of the model is 0-1 similarity, generally expressed as percentage, and is used for judging whether a target compound is lower than a set threshold value or not, and if so, outputting the compound which is not a persistent organic pollutant; if not, the output compound is a persistent organic pollutant, the higher the similarity, the higher the likelihood that it is actually a persistent organic pollutant.
Parameters of the deep convolutional neural network model, such as: the connection weight and the node threshold can be obtained by repeatedly training the training data set. The training endpoint depends on the model's performance on the test data set. After the test accuracy of the test data set is greater than 90%, the training process may be terminated. Typically requiring more than 300 cycles. The deep convolutional neural network model can be trained by adopting a supervised learning method to obtain a parameter optimized deep convolutional neural network model, namely the trained deep convolutional neural network model. The training convergence process curve is shown in fig. 6.
Step 204, extracting a plurality of molecular descriptors aiming at the compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold, and the molecular descriptor threshold is 2201.
The molecular structure analysis software that extracts the analysis descriptors may be alvaDesc, as shown in fig. 4.
Step 205, arranging a plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix.
The preset mode is the same as the arrangement mode of all the molecular descriptors extracted from each sample in the training data set for training the deep convolutional neural network model.
And step 206, processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model to determine whether the compound to be identified is a persistent organic pollutant or not, wherein the preset mode is the same as the arrangement mode of all the molecular descriptors extracted from each sample in the training data set.
The depth convolution neural network model trained in advance is the depth convolution neural network model trained in the previous step. In each convolutional neural network layer, the deep convolutional neural network model scans the molecular descriptor segments with preset length and width in sequence according to a preset route, such as a route from top to bottom and from left to right, so as to obtain the general structural characteristics of the persistent organic pollutants. And outputting the general structural features obtained by analysis to a full connection layer, wherein the full connection layer judges the possibility that the target compound is the persistent organic pollutant by taking the difference between the positive sample and the negative sample in the training set as reference, and outputs a similarity score. The similarity score is between 0 and 1, typically expressed as a percentage, which determines the degree of similarity of the target compound to persistent organic pollutants. Specifically, the method judges whether the compound to be identified is lower than a set threshold value, if so, the compound to be identified is determined not to be persistent organic pollutants (or called non-POPs); if not, the compounds to be identified are determined to be persistent organic pollutants (or POPs). The higher the similarity is, the higher the possibility that the organic pollutant is actually persistent, in practical application, the set threshold may be 0.95, and in other embodiments, may also be 0.93, which is not specifically limited in this embodiment.
Extracting a plurality of molecular descriptors aiming at the compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold value which is 2201, arranging a plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix, processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model to determine whether a compound to be identified is a persistent organic pollutant or not, thereby leading the model to use a large number of molecular descriptors, improving the accuracy and the effectiveness of theoretical prediction of persistent organic pollutants, improving the identification precision of potential persistent organic pollutants in commercial chemicals, and the robustness of the deep convolutional neural network model is greatly expanded, so that the complex organic compounds with different chemical structures and element compositions can be more quickly and effectively identified.
Referring to fig. 3, an embodiment of the present invention provides a deep learning apparatus for identifying persistent organic pollutants, which includes: an extraction module 301, a derivation module 302, and a determination module 303.
Specifically, the extraction module 301 is configured to extract a plurality of molecular descriptors for a compound to be identified, where the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold, and the molecular descriptor threshold is 2201. The obtaining module 302 is configured to arrange the plurality of molecular descriptors in a preset manner to obtain a two-dimensional structural feature description matrix. The determining module 303 is configured to process the two-dimensional structural feature description matrix by using a deep convolutional neural network model which is trained in advance, determine whether a compound to be identified is a persistent organic pollutant, and train all the molecular descriptors extracted from each sample in the data set in the same arrangement manner as a preset manner.
Optionally, the determining the deep neural network model in the module includes: an input layer, a hidden layer and an output layer; the hidden layer includes: four convolutional neural network layers, two pooling layers, three forgetting neural gate and a full tie layer, along the processing direction of data, the hidden layer includes in proper order: the first coiling layer, the second coiling layer, the first pooling layer, the third coiling layer, the fourth coiling layer, the second pooling layer and the full connecting layer are connected with a first forgetting door, a second forgetting door and a third forgetting door respectively.
Optionally, the deep learning apparatus further comprises: a training data set construction module for constructing a training data set, the training data set comprising: a plurality of persistent organic pollutants and a plurality of non-persistent organic pollutants.
Optionally, the species of persistent organic contaminant is 1309, and the species of non-persistent organic contaminant is 9990; the type of the molecular descriptor is 2201.
Optionally, the determining module is configured to determine whether the target compound is lower than a set threshold according to the two-dimensional structural feature description matrix, and if so, determine that the compound to be identified is not a persistent organic pollutant; if not, determining that the compound to be identified is a persistent organic pollutant.
It should be noted that, for specific descriptions of the extracting module 301, the obtaining module 302, and the determining module 303, reference may be made to the related contents of steps 101 and 103 and the related contents of steps 201 and 206 in the foregoing embodiment, which are not described in detail herein.
An embodiment of the present invention provides a deep learning apparatus for identifying persistent organic pollutants, including: a memory and a processor. The processor is coupled to the memory and configured to execute the above deep learning method for identifying persistent organic pollutants based on instructions stored in the memory.
An embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described deep learning method for identifying persistent organic pollutants.
The description and applications of the invention herein are illustrative and are not intended to limit the scope of the invention to the embodiments described above. Variations and modifications of the embodiments disclosed herein are possible, and alternative and equivalent various components of the embodiments will be apparent to those skilled in the art. It will be clear to those skilled in the art that the present invention may be embodied in other forms, structures, arrangements, proportions, and with other components, materials, and parts, without departing from the spirit or essential characteristics thereof. Other variations and modifications of the embodiments disclosed herein may be made without departing from the scope and spirit of the invention.

Claims (10)

1. A deep learning method for identifying persistent organic pollutants, the deep learning method comprising:
extracting a plurality of molecular descriptors aiming at a compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold value, and the molecular descriptor threshold value is 2201;
arranging the plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix;
processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model to determine whether a compound to be identified is a persistent organic pollutant or not;
the preset mode is the same as the arrangement mode of all the molecular descriptors extracted from each sample in the training data set for training the deep convolutional neural network model.
2. The deep learning method of claim 1, wherein the deep neural network model comprises: an input layer, a hidden layer and an output layer;
along the processing direction of data, the hidden layer sequentially comprises: the first convolution layer, the second convolution layer, the first pooling layer, the third convolution layer, the fourth convolution layer, the second pooling layer and the full connection layer are connected with a first forgetting door, the second pooling layer is connected with a second forgetting door, and the full connection layer is connected with a third forgetting door.
3. The deep learning method of claim 1, wherein prior to the extracting a plurality of molecular descriptors for the compound to be identified, the deep learning method further comprises:
constructing the training data set, the training data set comprising: a plurality of persistent organic pollutants and a plurality of non-persistent organic pollutants.
4. The deep learning method of claim 3, wherein the category of persistent organic pollutants is 1309, and the category of non-persistent organic pollutants is 9990;
the number of the molecular descriptors is 2201.
5. The deep learning method of claim 1, wherein the processing the two-dimensional structural feature description matrix using the pre-trained deep convolutional neural network model to determine whether the compound to be identified is a persistent organic pollutant comprises:
judging whether the target compound is lower than a set threshold value or not according to the two-dimensional structural feature description matrix, and if so, determining that the compound to be identified is not a persistent organic pollutant; if not, determining that the compound to be identified is a persistent organic pollutant.
6. A deep learning apparatus for identifying persistent organic pollutants, the deep learning apparatus comprising:
the extraction module is used for extracting a plurality of molecular descriptors aiming at a compound to be identified, wherein the number of the molecular descriptors is greater than or equal to a molecular descriptor threshold value, and the molecular descriptor threshold value is 2201;
the obtaining module is used for arranging the plurality of molecular descriptors in a preset mode to obtain a two-dimensional structural feature description matrix;
and the determining module is used for processing the two-dimensional structural feature description matrix by using a pre-trained deep convolutional neural network model, determining whether a compound to be identified is a persistent organic pollutant or not, and training the deep convolutional neural network model so that the arrangement mode of all molecular descriptors extracted from each sample in the training data set is the same as the preset mode.
7. The deep learning apparatus of claim 6, wherein the deep neural network model in the determining module comprises: an input layer, a hidden layer and an output layer;
along the processing direction of data, the hidden layer sequentially comprises: the first convolution layer, the second convolution layer, the first pooling layer, the third convolution layer, the fourth convolution layer, the second pooling layer and the full connection layer are connected with a first forgetting door, the second pooling layer is connected with a second forgetting door, and the full connection layer is connected with a third forgetting door.
8. The deep learning apparatus according to claim 6, further comprising:
a training data set construction module for constructing a training data set, the training data set comprising: a plurality of persistent organic pollutants and a plurality of non-persistent organic pollutants.
9. The deep learning apparatus of claim 8, wherein the persistent organic pollutant is 1309 and the non-persistent organic pollutant is 9990;
the number of the molecular descriptors is 2201.
10. The deep learning apparatus of claim 6, wherein the determination module is configured to:
judging whether the target compound is lower than a set threshold value or not according to the two-dimensional structural feature description matrix, and if so, determining that the compound to be identified is not a persistent organic pollutant; if not, determining that the compound to be identified is a persistent organic pollutant.
CN202210190955.2A 2022-02-25 2022-02-25 Deep learning method and device for identifying persistent organic pollutants Active CN114548308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210190955.2A CN114548308B (en) 2022-02-25 2022-02-25 Deep learning method and device for identifying persistent organic pollutants

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210190955.2A CN114548308B (en) 2022-02-25 2022-02-25 Deep learning method and device for identifying persistent organic pollutants

Publications (2)

Publication Number Publication Date
CN114548308A true CN114548308A (en) 2022-05-27
CN114548308B CN114548308B (en) 2024-07-16

Family

ID=81661596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210190955.2A Active CN114548308B (en) 2022-02-25 2022-02-25 Deep learning method and device for identifying persistent organic pollutants

Country Status (1)

Country Link
CN (1) CN114548308B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115274002A (en) * 2022-06-13 2022-11-01 中国科学院广州地球化学研究所 Compound persistence screening method based on machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477597A (en) * 2009-01-15 2009-07-08 浙江大学 Natural product active ingredient computation and recognition method based compound characteristic
CN108062556A (en) * 2017-11-10 2018-05-22 广东药科大学 A kind of drug-disease relationship recognition methods, system and device
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery
US20210065913A1 (en) * 2019-09-04 2021-03-04 University Of Central Florida Research Foundation, Inc. Artificial intelligence-based methods for early drug discovery and related training methods
KR20210026543A (en) * 2019-08-30 2021-03-10 주식회사 에일론 A system of predicting biological activity for compound with target protein using geometry images and artificial neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477597A (en) * 2009-01-15 2009-07-08 浙江大学 Natural product active ingredient computation and recognition method based compound characteristic
CN108062556A (en) * 2017-11-10 2018-05-22 广东药科大学 A kind of drug-disease relationship recognition methods, system and device
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery
KR20210026543A (en) * 2019-08-30 2021-03-10 주식회사 에일론 A system of predicting biological activity for compound with target protein using geometry images and artificial neural network
US20210065913A1 (en) * 2019-09-04 2021-03-04 University Of Central Florida Research Foundation, Inc. Artificial intelligence-based methods for early drug discovery and related training methods

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张瑞林;丁彦蕊;: "3D卷积神经网络的结构优化及中枢神经***药物的识别", 西北大学学报(自然科学版), no. 01, 9 January 2020 (2020-01-09) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115274002A (en) * 2022-06-13 2022-11-01 中国科学院广州地球化学研究所 Compound persistence screening method based on machine learning

Also Published As

Publication number Publication date
CN114548308B (en) 2024-07-16

Similar Documents

Publication Publication Date Title
CN111753985B (en) Image deep learning model testing method and device based on neuron coverage rate
Beukema et al. Environmental context and differences between native and invasive observed niches of Batrachochytrium salamandrivorans affect invasion risk assessments in the Western Palaearctic
CN107104978B (en) Network risk early warning method based on deep learning
CN104077396B (en) Method and device for detecting phishing website
Clarke et al. Change in marine communities
Anderson et al. Multivariate dispersion as a measure of beta diversity
Kierepka et al. Performance of partial statistics in individual‐based landscape genetics
Guillot et al. Dismantling the Mantel tests
Diniz‐Filho et al. On the selection of phylogenetic eigenvectors for ecological analyses
CN107092829A (en) A kind of malicious code detecting method based on images match
CN105072214B (en) C&C domain name recognition methods based on domain name feature
CN113422761B (en) Malicious social user detection method based on counterstudy
Fayle et al. Reducing over-reporting of deterministic co-occurrence patterns in biotic communities
Braga et al. Integrating spatial and phylogenetic information in the fourth‐corner analysis to test trait–environment relationships
CN114548308B (en) Deep learning method and device for identifying persistent organic pollutants
CN110826056A (en) Recommendation system attack detection method based on attention convolution self-encoder
CN115630298A (en) Network flow abnormity detection method and system based on self-attention mechanism
CN116366313A (en) Small sample abnormal flow detection method and system
CN109617864B (en) Website identification method and website identification system
CN116962089B (en) Network monitoring method and system for information security
CN105701501A (en) Trademark image identification method
Xue et al. Identification of structural systems using particle swarm optimization
Ruiz‐Sánchez et al. Ecological niche variation in the Wilson's warbler Cardellina pusilla complex
CN116910753A (en) Malicious software detection and model construction method, device, equipment and medium
Oke et al. Integrating phylogenetic community structure with species distribution models: an example with plants of rock barrens

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant