CN110890137A

CN110890137A - Modeling method, device and application of compound toxicity prediction model

Info

Publication number: CN110890137A
Application number: CN201911128499.3A
Authority: CN
Inventors: 桑运霞; 史伟; 宋青芳; 臧卫东; 刘强
Original assignee: Shanghai Eryun Information Technology Co Ltd
Current assignee: Shanghai Eryun Information Technology Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-03-17

Abstract

The invention provides a modeling method of a compound toxicity prediction model, which at least comprises the following steps: step S101, providing toxicity classification labels of candidate modeling compounds; step S102, providing molecular descriptors of each candidate modeling compound; step S103, providing a target protein descriptor of each candidate modeling compound; step S104, providing quantitative high-throughput screening analysis descriptors of the candidate modeling compounds, wherein the quantitative high-throughput screening analysis descriptors refer to PubChem activity scores of quantitative high-throughput screening; and step S105, constructing and training a compound toxicity prediction model. The invention can fully utilize the physicochemical property, the biological activity and the target protein action property of the drug candidate compound, and simultaneously utilizes the statistical modeling advantage of the machine learning algorithm based on the ensemble learning to construct a prediction system of the drug toxicity, so that the model has interpretability and prediction performance, and has better physicochemical and biological significance and research value.

Description

Modeling method, device and application of compound toxicity prediction model

Technical Field

The invention relates to the field of chemical informatics and bioinformatics, in particular to a modeling method and device of a compound toxicity prediction model and application of the modeling method and device.

Background

The development process of modern medicine is a process of exploring compounds that interact with specific therapeutic targets and have good absorption, distribution, metabolism and excretion properties. Statistically, about 30% of new drug development failures are due to safety issues. In order to effectively improve the rate and success rate of drug development, it is necessary to evaluate the toxicity of a compound at an early stage of drug development and to exclude a compound having a strong toxicity as soon as possible. The traditional drug toxicity prediction method mainly uses a toxicology experiment method based on animal in vivo experiments to predict drug toxicity. The traditional method has the defects of long period, high cost, large consumption of living animals and the like because the actual application effect of the medicine needs to be verified on the living animals. In addition, the regulatory requirements on safety, environmental protection, animal protection and the like are increasingly strict, the drug development cycle is continuously shortened due to global market competition, and the resource investment of drug development is more and more increased. The shortcomings of the traditional drug toxicity prediction method and the characteristics reflected by the current social development provide challenges for drug developers. Therefore, the research of the high-efficiency and accurate drug toxicity prediction method based on the computer application technology has important significance for improving the success rate of new drug research and development and reducing the research and development cost, and has become a leading-edge proposition which is commonly concerned by a plurality of disciplines such as toxicology, pharmaceutical analysis, computational chemistry, system biology and the like at present.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a modeling method and device for a compound toxicity prediction model and application thereof.

The invention provides a modeling method of a compound toxicity prediction model, which at least comprises the following steps:

step S101, providing toxicity classification labels of candidate modeling compounds;

step S102, providing molecular descriptors of each candidate modeling compound;

step S103, providing a target protein descriptor of each candidate modeling compound;

step S104, providing quantitative high-throughput screening analysis descriptors of the candidate modeling compounds, wherein the quantitative high-throughput screening analysis descriptors refer to PubChem activity scores of quantitative high-throughput screening;

step S105, constructing and training a compound toxicity prediction model: reserving each candidate modeling compound simultaneously provided with all descriptors and toxicity classification labels as the modeling compound, constructing a model input training data set, wherein the input training data set comprises all descriptor characteristics and toxicity classification labels of the modeling compound, and constructing and training a compound toxicity prediction model by utilizing a machine learning algorithm based on ensemble learning; the overall descriptors refer to molecular descriptors, target protein descriptors, and quantitative high throughput screening assay descriptors.

The second aspect of the present invention provides a modeling apparatus for a compound toxicity prediction model, comprising at least the following modules:

toxicity classification label providing module: for providing a toxicity classification label for each candidate modeling compound;

a molecular descriptor providing module: a molecular descriptor for providing each candidate modeling compound;

target protein descriptor providing module: a target protein descriptor for providing each candidate modeling compound;

quantitative high throughput screening analysis descriptor providing module: a quantitative high-throughput screening assay descriptor for each candidate modeling compound, the quantitative high-throughput screening assay descriptor being a PubChem activity score for quantitative high-throughput screening;

the compound toxicity prediction model construction and training module comprises: the method comprises the steps of reserving candidate modeling compounds with all descriptors and toxicity classification labels simultaneously, using the candidate modeling compounds as modeling compounds, constructing a model input training data set, wherein the input training data set comprises all descriptor features and toxicity classification labels of the modeling compounds, and constructing and training a compound toxicity prediction model by using a machine learning algorithm based on ensemble learning; the overall descriptors refer to molecular descriptors, target protein descriptors, and quantitative high throughput screening assay descriptors.

A third aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the aforementioned modeling method for a compound toxicity prediction model.

A fourth aspect of the present invention provides a computer processing device, comprising a processor and the aforementioned computer readable storage medium, wherein the processor executes a computer program on the computer readable storage medium to implement the steps of the aforementioned modeling method for a compound toxicity prediction model.

A fifth aspect of the present invention provides an electronic terminal, comprising: a processor, a memory, and a communicator; the memory is used for storing a computer program, the communicator is used for being in communication connection with an external device, and the processor is used for executing the computer program stored by the memory so as to enable the terminal to execute the modeling method of the compound toxicity prediction model.

The sixth aspect of the present invention provides a method for predicting drug toxicity, comprising the steps of: and (3) carrying out toxicity prediction on the drug to be tested by using a compound toxicity prediction model, wherein the compound toxicity prediction model is obtained by constructing the modeling method of the drug toxicity prediction model or the modeling device of the compound toxicity prediction model.

As described above, the modeling method, device and application of the compound toxicity prediction model of the present invention have the following beneficial effects:

1) the invention can utilize molecular descriptors, target protein descriptors and quantitative high-throughput screening analysis characteristics as toxicity prediction evaluation indexes, can provide explanation for drug toxicity prediction results from multiple aspects of physicochemical properties, target protein action, biological activity and the like, and utilizes related personnel to understand drug toxicity mechanisms for subsequent improvement and research.

2) The invention evaluates the drug toxicity by three independent characteristic indexes, thereby obtaining a more reliable toxicity prediction result by estimation and reducing the false positive rate and the false negative rate.

3) The invention constructs a high-efficiency and high-performance drug toxicity prediction system by utilizing an advanced machine learning algorithm based on ensemble learning and by performing feature screening engineering and hyper-parameter optimization engineering of a prediction model.

The invention provides a method for predicting drug toxicity by utilizing various physicochemical and biological properties of an integrated compound and a machine learning algorithm based on integrated learning. The prediction efficiency and the prediction precision of the drug toxicity prediction system are improved, the future development requirements of the pharmaceutical industry are met, and the development cycle and the development cost can be more effectively controlled for the pharmaceutical industry.

The invention can fully utilize the physicochemical property, the biological activity and the target protein action property of the drug candidate compound, and simultaneously utilizes the statistical modeling advantage of the machine learning algorithm based on the ensemble learning to construct a prediction system of the drug toxicity, so that the model has interpretability and prediction performance, and has better physicochemical and biological significance and research value.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

FIG. 2 is a diagram of an apparatus according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an electronic terminal according to an embodiment of the present invention.

FIG. 4 is a statistical result of area values, average accuracy, specificity, sensitivity and correct classification rate under the operation characteristic curve of each molecular descriptor and the integrated prediction model receiver according to the inventive method.

FIG. 5 is a graph showing the average accuracy of the statistical analysis of the molecular descriptors and the integrated prediction model according to the inventive method.

FIG. 6 is a graph showing the results of statistical analysis of the area values under the operating characteristics of the individual molecular descriptors and the integrated receiver of the predictive model according to the inventive method.

FIG. 7 is a feature importance ranking of the drug toxicity prediction model according to the inventive method.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Furthermore, it is to be understood that one or more method steps mentioned in the present invention does not exclude that other method steps may also be present before or after the combined steps or that other method steps may also be inserted between these explicitly mentioned steps, unless otherwise indicated; it is also to be understood that a combination of one or more steps as referred to in the present invention does not exclude that further steps may be present before or after said combination step or that further steps may be inserted between these two explicitly referred to steps, unless otherwise indicated. Moreover, unless otherwise indicated, the numbering of the various method steps is merely a convenient tool for identifying the various method steps, and is not intended to limit the order in which the method steps are arranged or the scope of the invention in which the invention may be practiced, and changes or modifications in the relative relationship may be made without substantially changing the technical content.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the modeling method of the compound toxicity prediction model of the present invention at least includes the following steps:

step S102, providing molecular descriptors of each candidate modeling compound;

The compound toxicity prediction model takes all descriptor characteristics of a modeling compound as input, and finally outputs a toxicity classification label to predict the compound toxicity.

The toxicity classification labels comprise two types, wherein GHS toxicity classification labels 1-4 are defined as toxic class labels, and GHS toxicity classification labels 5 or 6 are defined as non-toxic class labels.

The GHS toxicity Classification label is derived from European chemical Agency (ECHA), New Zealand Education associates Association Limited, NZEPA, Japan Institute of technology and Evaluation (JPNITE), public database of Australian safety working office (SWA), and toxicity Classification labels under GHS (Global harmonic System of Classification and labeling of Chemicals) corresponding to compounds.

The data source of the molecular descriptor is a ToxCrst-DSTox data set of a Tox21 public database.

Specifically, the DSSTox (Distributed Structure-secure proximity) database maps chemical-related biological test data and physicochemical property data precisely onto corresponding chemical structures. DSSTox provides a high quality public chemical resource for supporting improved predictive toxicology. The DSSTox database combines the latest chemical informatics workflow and provides a chemical infrastructure for the research of safe chemicals of EPA, including ToxCast and Tox21 high-throughput toxicology work.

The Tox21 consortium (Tox21 consortium) is a federal project including the National Toxicology Program (NTP), the National Center for advancement transformational sciences, the Food and Drug Administration (FDA) aiming to focus on and accelerate the progress of this discovery. ToxCast is an abbreviation for toxicity forecast (ToxicityForecaster), an item that is one of the major contributors of EPA to the Tox21 consortium collaborative effort. The ToxCast project employs a high throughput in vitro screening format to identify compounds with potential signs of toxicity. These compounds were then preferably studied intensively. Environmental Protection Agency (EPA) in 2007 from the United statesInitiation of ToxScat^TMSince the project, this chemical screening project has produced a huge amount of data.

The published ToxCast-DSSTox dataset is downloaded from the us environmental protection agency's official website (https:// epa.

Data padding is performed first. Cleaning compound entries for critical data deletions, cleaning partial compound types, and filling in missing data (filling in missing SMILES data using the PubChem package and QC detail annotation files).

And then carrying out data standardization. The standardization of the structural data of the compounds was done by the external tool Chemaxon JCHEM for data integration, spare tools were: eTox (python package "standby"), module outputs are four columns of data: SMILES, ID, accession number, molecular analysis.

In step S102, the structure of each candidate modeling compound is converted into a tautomer that stably exists at pH 7.4, and a molecular descriptor is calculated using an open source drug molecular descriptor calculation tool.

Specifically, ChemAxon JChem 17.25.0 tool can be used to convert the compound structure into a tautomer stably existing at pH 7.4, and open source drug molecule Descriptor calculation tools PaDEL-Descriptor and RDKit are used to calculate the molecule Descriptor using SMILES, ID, accession number, molecular analysis as input data.

The molecular descriptors comprise the most common physicochemical properties in a machine learning method for establishing a drug toxicity prediction model, including a lipid-water partition coefficient, an apparent partition coefficient, molecular solubility, molecular weight, the number of hydrogen bond donors, the number of hydrogen bond acceptors, the number of rotatable bonds, the number of rings, the number of aromatic rings, the sum of the numbers of oxygen atoms and hydrogen atoms, polar surface area, molecular part polar surface area and molecular surface area, and the like.

All the molecular descriptors can be calculated through an open source software tool PaDEL-Descriptor or an RDkit software tool; converting the structure of the compound into a main tautomer existing at pH 7.4 by using a Chemaxon JChem 17.25.0 tool, and calculating a 2D molecular Descriptor by using open source drug molecular Descriptor calculation tools PaDEL-Descriptor and RDkit; the molecular descriptors with all zeros or zero variance are deleted.

In step S103, a target protein descriptor is calculated using an internal random forest ligand-target prediction algorithm. In one embodiment, the internal random forest ligand-target prediction algorithm is PIDGIN v 2.

PIDGIN v2(https:// github. com/lhm30/PIDGIN 2) is a protein target prediction tool trained on PubChem (21/06/16) and ChEMBL 21 using random forest algorithms, which provides a Platt proportional probability for each input compound with affinity for each target. The target protein descriptors of candidate modeling compound target proteins were calculated using the internal random forest ligand-target prediction algorithm PIDGINv 2. And screening the corresponding relation of the target proteins according to the recall rate of more than 0.5 and the Tanimoto similarity coefficient of more than 0.25, and reserving 109 target protein descriptors.

In step S104, the quantitative high throughput screening analysis descriptor is derived from the PubChem database. In particular to Tox21 quantitative High-Throughput Screening (qHTS) analysis data (https:// PubChem. ncbi. nlm. nih. gov/assay) issued by Pubchem. High throughput quantitative screening is a major source of computational toxicological data and can detect biological activity of chemical compounds at seven or more concentration levels spanning four orders of magnitude. Preliminary results were obtained for all PubChem assay data for 192 test methods. Using the results of filter analysis of type counter-screening assays, autofluorescence assays, other, the analysis methods that were less relevant to prediction of drug toxicity were deleted, leaving 76. In these assay tests, the PubChem activity score (qhtvitality score) provides a quantitative high throughput screening assay feature, and the score is a continuous numerical descriptor that summarizes the behavior of a compound relative to the assay record. The activity score ranged from 0 to 100, with inactive compounds scoring 0, active compounds scoring 40 to 100, and inconclusive compounds scoring in between. If there are multiple scores due to repeated measurements, a median score is used. The missing value is considered inactive and a score of 0 is assigned.

In step S105, three sets of descriptor data and toxicity classification labels may be integrated through the correspondence of compound labels, and compounds having four types of data at the same time are retained as modeling compounds. The compound labels can be various compound labels such as CHID in ToxCrat database, CID in PubChem database, and CAS in chemical registration number.

For example, the GHS toxicity classification label is a GHS acute oral toxicity classification label.

In step S105, a toxicity prediction model is constructed and trained by applying the Catboost algorithm.

In order to check the reliability of the constructed compound toxicity prediction model, in step S105, a part of the compound having four types of data at the same time can be used as a test compound, and at this time, a preliminary data set including three types of descriptor features and toxicity classifications can be constructed first; the input training data set and the test data set are randomly divided into the preliminary data set by applying a train _ test _ split function of a model _ selection module of an open source software tool Sciket-Learn. Wherein the test data set can be used for testing the reliability of the compound toxicity prediction model after the model is constructed. For example, the proportion of the test set is set to 20%, the random seed is set to 2019, and the data set division is performed at random 20 times with the random seeds of 0-19, so as to obtain 20 training data sets and corresponding test data sets of the model verification experiment.

In step S105, a machine learning algorithm based on ensemble learning is applied, feature importance is calculated for the input training data set by a predictive value modification method, descriptor features with zero importance are deleted, and features screened by the recursive feature elimination RFE algorithm are used as features of the model that are finally input to the training data set.

The features refer to the features of the molecular descriptors, target protein descriptors, and quantitative high throughput screening assay descriptors of the modeled compounds.

Correspondingly, when the test data set is constructed, the same pruning process is performed on the test data set characteristics.

Specifically, a machine learning algorithm Catboost based on ensemble learning is applied, and feature importance is calculated for an input training data set by a predictive value changing method, as shown in FIG. 7. Deleting the descriptor features with zero importance, screening the features through a Recursive Feature Elimination (RFE) algorithm to serve as the features of the model input training data set finally, and carrying out the same deleting processing on the features of the test data set. The Catboost is a machine learning library which is sourced by search giant Yandex in 2017 in Russia, is Gradient Boosting and Categoric Features, and is also a machine learning framework based on a Gradient Boosting decision tree. The Catboost adopts an effective strategy, reduces overfitting and ensures that all data sets can be used for learning. That is, when the data sets are randomly arranged and the average label value of the samples of the same category value is calculated, the label values of the samples before the sample are included in the calculation.

In step S105, before the compound toxicity prediction model is constructed, hyper-parameter optimization is performed on model parameters involved in constructing the prediction model by using a bayesian-optimization-based hyper-opt open-source software package, so as to obtain an optimal model parameter set. And (4) maximizing the average AUC of the verification set, outputting and acquiring an optimal model parameter set of the input training data set, and taking the optimal model parameter set as a parameter for constructing a compound toxicity prediction model.

Specifically, firstly, the training set is further divided into an optimized training set and an optimized verification set, the numerical ranges of parameters such as iterations, learning _ rate, depth, l2_ leaf _ reg and the like are set, the average AUC of the optimized verification set is maximized through a hyper pt function, and an optimal model parameter set suitable for being input into the training data set is output and obtained to be used as a parameter for constructing the compound toxicity prediction model. Hyperopt is a tool for adjusting parameters through Bayesian optimization, and the method has the advantages of high speed and good effect. In addition, the Hyperopt is combined with a MongoDB tool to carry out distributed parameter adjustment, and relatively excellent parameters can be quickly found. The designated dev version can use simulated annealing parameter adjustment and also supports strategies such as violent parameter adjustment, random parameter adjustment and the like. Bayesian optimization, also called Sequential model-based optimization (SMCO), is one of the most effective functional optimization methods. Compared with standard optimization strategies such as a conjugate gradient descent method and the like, the SMBO has the advantages that: smoothness is exploited without calculating gradients; can process real numbers, discrete values, condition variables, etc.; a large number of variable parallel optimizations can be processed.

In step S105, a compound toxicity prediction model is constructed by applying a Catboost algorithm, the prediction model is trained by fitting and training the characteristics of a training data set and optimal model parameters finally input by the model, and an optimal probability threshold is determined as a decision boundary by adopting a quintuplet cross validation program in the input training set. The optimal probability threshold is a threshold capable of maximizing the correct classification rate CCR.

And further, finally inputting the characteristics and the optimal model parameters of the training data set by the model, and fitting and training the prediction model by the CatBoost Classiier function. And a quintupling cross-validation procedure was used to determine the best probability threshold as the Decision Boundary (Decision Boundary) within the input training set. The optimal probability threshold is a threshold capable of maximizing a correct classification Rate CCR (CCR).

As shown in fig. 2, the modeling apparatus of the compound toxicity prediction model of the present invention at least includes the following modules:

The molecular descriptor is obtained by adopting the following method: and (3) converting the structure of each candidate modeling compound into a tautomer stably existing at pH 7.4, and calculating by using an open source drug molecular descriptor calculation tool to obtain the molecular descriptor.

The quantitative high throughput screening assay descriptor is derived from the PubChem database.

And in the target protein descriptor calculation module, calculating the target protein descriptor by using an internal random forest ligand-target prediction algorithm.

And in the compound toxicity prediction model construction and training module, constructing and training a toxicity prediction model by applying a Catboost algorithm.

The compound toxicity prediction model construction and training module also comprises a characteristic screening submodule for finally inputting a training data set, the characteristic importance of the input training data set is calculated by applying a machine learning algorithm Catboost based on ensemble learning through a predictive value changing method, the descriptor characteristic with the importance of zero is deleted, and the characteristic screened out by eliminating the RFE algorithm through recursive characteristics is used as the characteristic of the model finally input into the training data set.

And the super-parameter optimization sub-module is used for carrying out super-parameter optimization on model parameters involved in the process of constructing the prediction model by adopting a Catboost algorithm by applying a Bayesian optimization-based software package to obtain an optimal model parameter set.

And in the compound toxicity prediction model building and training module, the method can also be used for building a compound toxicity prediction model by applying a Catboost algorithm, fitting and training the prediction model by using the characteristics of a training data set finally input by the model and the optimal model parameters, and determining an optimal probability threshold value as a decision boundary by adopting a quintupling cross validation program in the input training set. The optimal probability threshold is a threshold capable of maximizing the correct classification rate CCR.

Since the principle of the apparatus in this embodiment is basically the same as that of the foregoing method embodiment, in the foregoing method and apparatus embodiment, the definitions of the same features, the calculation method, the enumeration of the embodiments, and the enumeration and description of the preferred embodiments may be used interchangeably, and are not repeated again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. These modules may all be implemented in software invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the obtaining module may be a processing element that is set up separately, or may be implemented by being integrated in a certain chip, or may be stored in a memory in the form of program code, and the certain processing element calls and executes the functions of the obtaining module. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

In some embodiments of the present invention, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the aforementioned method of modeling a compound toxicity prediction model.

In some embodiments of the present invention, there is also provided a computer processing apparatus comprising a processor and the aforementioned computer readable storage medium, the processor executing a computer program on the computer readable storage medium to implement the steps of the aforementioned modeling method for a compound toxicity prediction model.

In some embodiments of the present invention, there is also provided an electronic terminal, including: a processor, a memory, and a communicator; the memory is used for storing a computer program, the communicator is used for being in communication connection with an external device, and the processor is used for executing the computer program stored by the memory so as to enable the terminal to execute a modeling method for realizing the compound toxicity prediction model.

As shown in fig. 3, a schematic diagram of an electronic terminal provided by the present invention is shown. The electronic terminal comprises a processor 31, a memory 32, a communicator 33, a communication interface 34 and a system bus 35; the memory 32 and the communication interface 34 are connected with the processor 31 and the communicator 33 through a system bus 35 and are used for achieving mutual communication, the memory 32 is used for storing computer programs, the communicator 34 and the communication interface 34 are used for communicating with other devices, and the processor 31 and the communicator 33 are used for operating the computer programs so that the electronic terminal can execute the steps of the image analysis method.

The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The memory may include a Random Access Memory (RAM), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; the computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disc-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be a product that is not accessed by the computer device or may be a component that is used by an accessed computer device.

In particular implementations, the computer programs are routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.

The method for predicting the toxicity of the medicine provided by the invention comprises the following steps: and (3) carrying out toxicity prediction on the drug to be tested by using a compound toxicity prediction model, wherein the compound toxicity prediction model is obtained by constructing the modeling method of the compound toxicity prediction model or the modeling device of the compound toxicity prediction model.

Further, the constructed prediction model is used for calculating the characteristic data of the test set, predicting the toxicity classification label of the corresponding drug candidate compound, and drawing a histogram of Area value (Area under current, AUC), average precision (AveragePrecision), specificity (specificity), sensitivity (sensitivity) and correct classification rate under a statistic Receiver Operating Characteristic (ROC) Curve.

The prediction model integrating the three descriptors was compared with the prediction result of drug toxicity using only one descriptor, as shown in fig. 4, mc _ tp _ qHTS represents the prediction model integrating the three descriptors, molecular represents the prediction model using only molecular descriptors, target protein represents the prediction model using only target protein descriptors, qHTS represents the prediction model using only quantitative high-throughput screening analysis data, the comparison result of receiver operation characteristic curves is shown in fig. 5, and the comparison result of average accuracy broken lines is shown in fig. 6. To further verify the reliability of the prediction model construction method of the present invention, the method flow after the above data set division is repeated for 20 randomly divided input data sets, and the respective average values of the area value under the receiver operation characteristic curve, the average precision, the specificity, the sensitivity and the correct classification rate are counted, as shown in table 1. The advantages of the prediction model integrating the three descriptors on various performance indexes can be obviously seen, and the embodiment of the method ensures that the prediction model has higher prediction performance and improves the interpretability of the prediction result.

Descriptors	AUC of ROC	Average precision	Sensitivity	Specificity	CCR
						mc_tp_qHTS	0.92	0.83	0.90	0.81	0.84
molecular	0.92	0.83	0.89	0.76	0.83
						target protein	0.85	0.71	0.69	0.75	0.78
qHTS	0.60	0.40	0.58	0.58	0.58

In conclusion, the invention provides a method for predicting drug toxicity by utilizing various physicochemical and biological properties of an integrated compound and a machine learning algorithm based on integrated learning. The method comprises the following steps: acquiring physical and chemical structure data and target protein data of a compound corresponding to a drug, and quantitatively screening and analyzing the data and corresponding GHS (Global chemical unified Classification and labeling System) toxicity Classification labels; processing and calculating corresponding descriptor data respectively; integrating three sets of descriptor data by multiple compound labels; performing feature screening on descriptor data by applying a machine learning algorithm Catboost and a recursive feature elimination algorithm based on ensemble learning; after model hyper-parameter optimization, constructing and training a toxicity prediction model by using a Catboost algorithm; and performing similar treatment after the data of the drug to be detected is acquired, and classifying by using a preset toxicity prediction model, wherein the classification result represents the toxicity of the data of the drug to be detected. The method integrates the molecular descriptor, the target protein descriptor and the quantitative high-throughput screening analysis characteristics, and the model has interpretability and prediction performance and has better chemical and biological significance and value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method of modeling a compound toxicity prediction model, comprising at least the steps of:

step S102, providing molecular descriptors of each candidate modeling compound;

2. A method of modeling a compound toxicity prediction model according to claim 1, further comprising one or more of the following features:

a. the toxicity classification labels comprise two types, wherein GHS toxicity classification labels 1-4 are defined as toxic labels, and GHS toxicity classification labels 5 or 6 are defined as non-toxic labels;

b. the data source of the molecular descriptor is a ToxConst-DSTox data set of a Tox21 public database;

c. the quantitative high throughput screening analysis descriptor is derived from a PubChem database;

d. in step S103, calculating a target protein descriptor by using an internal random forest ligand-target prediction algorithm;

e. in step S105, a toxicity prediction model is constructed and trained by applying the Catboost algorithm.

3. A method of modelling a model for predicting the toxicity of a compound as claimed in claim 2, wherein said molecular descriptors are obtained using the following method: and (3) converting the structure of each candidate modeling compound into a tautomer stably existing at pH 7.4, and calculating by using an open source drug molecular descriptor calculation tool to obtain the molecular descriptor.

4. A method of modelling a model for predicting the toxicity of a compound as claimed in claim 3, further comprising one or more of the following features: in the step S105, the process proceeds,

f. applying a machine learning algorithm Catboost based on ensemble learning, calculating feature importance of an input training data set by a predictive value changing method, deleting descriptor features with zero importance, and taking features screened out by a recursive feature elimination RFE algorithm as features of a model finally input training data set;

g. and carrying out hyper-parameter optimization on model parameters involved in the process of constructing a prediction model by adopting a Catboost algorithm by applying a Bayesian optimization-based software package to obtain an optimal model parameter set.

5. The modeling method of the compound toxicity prediction model according to claim 4, wherein the Catboost algorithm is applied to construct the compound toxicity prediction model, the prediction model is trained by fitting the characteristics of the model finally input into the training data set and the optimal model parameters, and a quintupling cross validation procedure is adopted to determine the optimal probability threshold as the decision boundary in the input training set.

6. A modeling apparatus for a compound toxicity prediction model, comprising at least the following modules:

7. The modeling apparatus of a compound toxicity prediction model of claim 5, further comprising one or more of the following features:

d. in the target protein descriptor calculation module, calculating a target protein descriptor by using an internal random forest ligand-target prediction algorithm;

e. and in the compound toxicity prediction model construction and training module, constructing and training a toxicity prediction model by applying a Catboost algorithm.

8. The modeling apparatus of a compound toxicity prediction model according to claim 6, wherein the molecular descriptor is obtained by the following method: and (3) converting the structure of each candidate modeling compound into a tautomer stably existing at pH 7.4, and calculating by using an open source drug molecular descriptor calculation tool to obtain the molecular descriptor.

9. The modeling apparatus for a compound toxicity prediction model according to any one of claims 6-8, wherein the compound toxicity prediction model construction and training module further comprises one or more sub-modules for the following features:

f. the feature screening submodule is used for applying a machine learning algorithm Catboost based on ensemble learning, calculating feature importance of the input training data set through a predictive value changing method, deleting descriptor features with zero importance, and taking features screened out through a recursive feature elimination RFE algorithm as features of the model input training data set finally;

g. and the hyper-parameter optimization submodule is used for carrying out hyper-parameter optimization on model parameters involved in the process of constructing the prediction model by adopting a Catboost algorithm by applying a Bayesian optimization-based software package to obtain an optimal model parameter set.

10. The modeling apparatus for a compound toxicity prediction model according to claim 9, wherein the compound toxicity prediction model constructing and training module is further configured to apply a castboost algorithm to construct the compound toxicity prediction model, to fit and train the prediction model with the characteristics of the training data set and the optimal model parameters finally input by the model, and to determine the optimal probability threshold as the decision boundary by using a quintupling cross validation procedure in the input training set.

11. A computer-readable storage medium on which a computer program is stored, which program, when executed by a processor, implements a method of modeling a compound toxicity prediction model according to any one of claims 1 to 5.

12. A computer processing apparatus comprising a processor and a computer readable storage medium according to claim 11, wherein the processor executes a computer program on the computer readable storage medium to perform the steps of the method for modeling a compound toxicity prediction model according to any one of claims 1-5.

13. An electronic terminal, comprising: a processor, a memory, and a communicator; the memory is configured to store a computer program, the communicator is configured to communicatively couple with an external device, and the processor is configured to execute the computer program stored in the memory to cause the terminal to perform the method of modeling a compound toxicity prediction model according to any one of claims 1-5.

14. A method for predicting drug toxicity, comprising the steps of: predicting the toxicity of a test drug by using a compound toxicity prediction model constructed by the method for modeling a compound toxicity prediction model according to any one of claims 1 to 5 or the device for modeling a compound toxicity prediction model according to any one of claims 6 to 10.