CN110277173A

CN110277173A - BiGRU drug toxicity forecasting system and prediction technique based on Smi2Vec

Info

Publication number: CN110277173A
Application number: CN201910423330.4A
Authority: CN
Inventors: 全哲; 林轩; 阳王东; 陈岑; ***; 李肯立; 李克勤
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2019-09-24

Abstract

The present invention provides a kind of BiGRU drug toxicity forecasting system and prediction technique based on Smi2Vec, comprising: Smi2Vec module, the Smi2Vec module are used to characterization of molecules being converted to atom vector；BiGRU drug toxicity disaggregated model is set to the Smi2Vec output end, the BiGRU drug toxicity disaggregated model includes 1 embeding layer, 1 BiGRU layers, 2 pond layers and 2 dense layers for training the atom vector；And classifier is set to the output end of the BiGRU drug toxicity disaggregated model for generating the output label of classification of task.Compared with the relevant technologies, the BiGRU drug toxicity forecasting system and prediction technique provided by the invention based on Smi2Vec can reach high stable and height accurately requires.

Description

BiGRU drug toxicity forecasting system and prediction technique based on Smi2Vec

[technical field]

The present invention relates to pharmaceutical properties prediction field more particularly to a kind of BiGRU drug toxicity based on Smi2Vec are pre- Examining system and prediction technique.

[background technique]

The process of drug design and development needs to expend a large amount of human and material resources and financial resources, grinds when by biological or chemical When studying carefully means proves that a certain specific molecular can realize certain therapeutic effect, due to newfound molecule Chang Yinwei toxicity, low activity Novel drugs cannot be finally developed into the various problems such as low solubility, lead to that all that has been achieved is spoiled.

Traditional neural network was once widely used in pharmaceutical properties prediction, such as bioactivity, toxicity, water solubility, still That there are algorithms is inefficient for these methods, it is difficult to for batch training, the disadvantages of being easy to appear over-fitting.

Bolt is assisted to select some obvious non-compliant molecular structures using computer approach in the related technology.Due to meter The limitation of sample is not present in calculation machine virtual screening, so if first carrying out computer virtual screening in medicament research and development early stage, then Pharmacology test is carried out again, and such R&D process is compared with conventional measures, more scientific, reasonability, will shorten significantly new The R&D cycle of medicine reduces R & D Cost.

The main flow direction of primer discovery is the D-M (Determiner-Measure) construction of molecule and the research of activity relationship (QSAR), common at present QSAR method be mainly two dimensional quantitative structure activity relationship method (2D-QSAR), three-dimensional quantitative structure activity relationship method (3D-QSAR) and The characteristics of four-dimensional quantitative structure activity relationship method (4D-QSAR), these three methods can all be limited to itself, based on big data analysis and The method of machine learning needs mass data, higher to the Spreading requirements of positive negative sample；Conventional machines learning method is for sample Acquisition classification, training need to take a substantial amount of time；Above based on have supervision and unsupervised machine learning algorithm not only need Mass data, and need to calculate characterization of molecules using stoichiometry software, it also needs to take considerable time.

Therefore, it is necessary to provide a kind of new BiGRU drug toxicity forecasting system based on Smi2Vec and prediction technique come It solves the above problems.

[summary of the invention]

The technical problem to be solved by the present invention is to the foreseeable various methods of Drug in the prior art can all be limited to certainly The characteristics of body, needs mass data based on big data analysis and the method for machine learning, higher to the Spreading requirements of positive negative sample； Conventional machines learning method classifies for sample collection, training needs to take a substantial amount of time；It is based on having supervision and without prison above The machine learning algorithm superintended and directed not only needs mass data, but also needs to calculate characterization of molecules using stoichiometry software, equally needs The technical issues of taking considerable time.

The present invention solves above-mentioned technical problem by the following technical programs:

The BiGRU drug toxicity forecasting system based on Smi2Vec that the present invention provides a kind of, comprising:

Smi2Vec module, the Smi2Vec module are used to characterization of molecules being converted to atom vector；

BiGRU drug toxicity disaggregated model is set to the Smi2Vec output for training the atom vector End, the BiGRU drug toxicity disaggregated model include 1 embeding layer, 1 BiGRU layers, 2 pond layers and 2 dense layers；

And classifier is set to the BiGRU drug toxicity disaggregated model for generating the output label of classification of task Output end.

Preferably, the embeding layer is set to the output end of the Smi2Vec module, and the classifier is set to described close Collect the output end of layer.

The BiGRU drug toxicity prediction technique based on Smi2Vec that the present invention also provides a kind of, comprising:

Step S1: building data set, the data set includes training set, test set and development set；

The conversion of step S2:Smi2Vec: by Smi2Vec module by the training set with the molecule of SMILES format Feature Conversion is atom vector；

Step S3: building BiGRU drug toxicity disaggregated model: the BiGRU drug toxicity disaggregated model includes 1 insertion Layer, 1 BiGRU layers, 2 pond layers, 2 dense layers；

Step S4: the atom vector is input to the BiGRU drug toxicity disaggregated model to the BiGRU drug poison Property disaggregated model is trained；

Step S5: the training result of the BiGRU drug toxicity disaggregated model is sent to the classifier, the classifier The BiGRU drug toxicity disaggregated model that continues to make a gift to someone the training result after optimization loss function is trained；

Step 6: being calculated by successive ignition, the BiGRU drug toxicity disaggregated model training is completed；

Step 7: the conversion of Smi2Vec being carried out to the data in the test set and transformation result is input to BiGRU medicine In object toxicity category model, test result is obtained；

Step S8: the test result is analyzed and is discussed.

Preferably, the data set building is made of the training set (80%) and the test set (20%).

Preferably, the data set building is by the training set (80%), the test set (10%) and the development set (10%) it forms.

Preferably, the step 2 can specifically be divided into following steps:

The molecule of SMILES format is cut into independent atom by step 21, and is extracted to the feature of the atom；

One by one coding of the step 22 with one-hot coding method to the atom being syncopated as, is converted to original for SMILES molecule Subvector；

Step 23 constructs mapping function, is carried out with Word2Vec Open-Source Tools to the SMILES molecule in the training set Pre-training, generate dictionary, corresponding sample vector is found by dictionary enquiring, if lacked in dictionary corresponding sample to It is matching to generate a vector at random for amount.

Preferably, in the step S4, the atom vector is sequentially into the embeding layer, BiGRU layers described, described Pond layer and the dense layer are handled, to be trained to the BiGRU drug toxicity disaggregated model.

Preferably, in the step S5, the training result of the dense layer is sent to the classifier, the classifier Continue to make a gift to someone the training result embeding layer after optimization loss function to continue mould of classifying to the BiGRU drug toxicity Type is trained.

Preferably, in the step S6,100 iterative calculation or the structure iterated to calculate when continuous 5 times are carried out no longer When variation, the BiGRU drug toxicity disaggregated model i.e. training is completed.

It is provided by the invention to propose a kind of BiGRU drug toxicity prediction based on Smi2Vec compared with the relevant technologies SMILES characterization of molecules is converted to atom vector using Smi2Vec module, changes mode to characterization of molecules by system and prediction technique Propose that a kind of conversion time is short, direction of high conversion efficiency；In addition, by comparing several common conventional machines Model is practised, the performance of BiGRU drug toxicity forecasting system on Tox21 data set provided by the invention based on Smi2Vec is equal Better than the performance of conventional machines learning model, high stable can be reached and height accurately requires；In addition, provided by the invention be based on The BiGRU drug toxicity forecasting system of Smi2Vec have it is low to the Spreading requirements of positive negative sample, for sample collection classification, instruction White silk needs to expend time short advantage.

[Detailed description of the invention]

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing, in which:

Fig. 1 is the frame diagram of the BiGRU drug toxicity forecasting system provided by the invention based on Smi2Vec；

Fig. 2 is the BiGRU drug toxicity prediction technique flow chart based on Smi2Vec described in Fig. 1；

Fig. 3 is the computing block diagram of Smi2Vec；

Fig. 4 is the working principle diagram of atom vector；

Fig. 5 is the main frame composition of BiGRU drug toxicity disaggregated model；

Fig. 6 is the BiGRU drug toxicity prediction technique provided by the invention based on Smi2Vec and traditional characterization of molecules side The effect contrast figure of method ECFP.

[specific embodiment]

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiments are merely a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.

Fig. 1 is please referred to, Fig. 1 is the frame of the BiGRU drug toxicity forecasting system provided by the invention based on Smi2Vec Frame figure, the BiGRU drug toxicity forecasting system based on Smi2Vec that the present invention provides a kind of, including Smi2Vec module, BiGRU (bidirectional valve controlled Recognition with Recurrent Neural Network) drug toxicity disaggregated model and classifier, in which:

The Smi2Vec module is used to characterization of molecules being converted to atom vector, specifically, the Smi2Vec is used to incite somebody to action (Simplified molecular input line entry specification simplifies molecule and linearly inputs rule SMILES Model) characterization of molecules of format is converted to atom vector；

It is defeated to be set to the Smi2Vec for training the atom vector for the BiGRU drug toxicity disaggregated model Outlet, the BiGRU drug toxicity disaggregated model include 1 embeding layer, 1 BiGRU layers, 2 pond layers and 2 dense layers；

The classifier is used to generate the output label of classification of task, is set to the BiGRU drug toxicity classification mould The output end of type.

Specifically, the embeding layer is set to the output end of the Smi2Vec module, the classifier is set to described close Collect the output end of layer.

Fig. 2-5 is please referred to, Fig. 2 is the BiGRU drug toxicity prediction technique process based on Smi2Vec described in Fig. 1 Figure；Fig. 3 is the computing block diagram of Smi2Vec；Fig. 4 is the working principle diagram of atom vector；Fig. 5 is BiGRU drug toxicity classification mould The main frame composition of type.

The BiGRU drug toxicity prediction technique based on Smi2Vec that the present invention also provides a kind of characterized by comprising

Step S1: building data set: the data set is made of training set (80%) and test set (20%), certainly, institute Stating data set can also be made of training set (80%), development set (10%) and test set (10%).By in the data set to by The influential drug labelling of body is set as positive sample for 1, does not have influential label to be set as negative sample, removes the negative sample This, to reject interference data, reduces the influence of noise in the data set；

The conversion of step S2:Smi2Vec: by Smi2Vec module by the training set with the molecule of SMILES format Be converted to vector；

The conversion process of specific Smi2Vec is as follows,

Step 2.1: independent atom will be cut into the molecule of SMILES format in the training set, wherein to base Group occur atomic group then by than with inquiry after extract, regard as with individual atom computing.

Statistics is extracted to the data for the atom being syncopated as again and obtains following feature: [' c ', ' C ', ' (', ') ', ' O ', '=', ' N ', ' [', '] ', ' n ', ' H ', '/', '-', ' S ', ' Cl ', '@@', '@', ' F ', '+', ' ', ' s ', ' # ', ' o ', ' Br ', ' P ', ' ', ' I ', ' Si ', ' % ', ' Sn ', ' As ', ' Se ', ' * ', ' Hg ', ' B ', ' Pt ', ' e ', ' Au ', ' Ge ', ‘Cu’,‘Na’,‘Fe’,‘Sb’,‘T’,‘R’,‘Co’,‘i’,‘Pd’,‘Zn’,‘Pb’,‘M’,‘a’,‘Cd’, ‘Ni’,‘A’, ‘V’,‘d’,‘Ag’,‘K’,‘G’,‘r’,‘Al’,‘p’,‘L’,‘u’,‘Ca’,‘t’,‘Cr’,‘Mn’,‘h’, ‘Li’,‘Mg’, ‘Tl',‘Ti',‘W',‘In',‘Zr',‘b'].Features above is comprising common elements and represents special valence link, bracket, and special point Son, the symbol of ion etc. ignore number, decimal point.The dictionary comprising all statistical natures in molecule is obtained, dictionary value is The molecule or character frequency of occurrence；

Step 2.2: the atom being syncopated as being encoded one by one with one-hot coding method, SMILES molecule is converted into original Subvector；

Step 2.3 constructs mapping function, with Word2Vec Open-Source Tools to the SMILES character in the training set The molecule that string form occurs carries out pre-training, generates dictionary, corresponding atom vector is found by dictionary enquiring, if in word Corresponding atom vector is lacked in allusion quotation, and it is matching to generate an atom vector at random；

Step S3: building BiGRU drug toxicity disaggregated model, wherein the BiGRU drug toxicity disaggregated model includes 1 A embeding layer, 1 BiGRU layers, 2 pond layers and 2 dense layers；

Step S4: vector described in step 2 is input to the BiGRU drug toxicity disaggregated model and is trained；

Specifically, the atom vector is sequentially into the embeding layer, BiGRU layers described, the pond layer and described intensive Layer is handled, to be trained to the BiGRU drug toxicity disaggregated model.

Specific training process is as follows,

Step S41: input x is the atom vector of drug；

Step 42: the true value for exporting y indicates 0 with [1,0], and [0,1] indicates 1, and the result of training and test is one every time A probability value, respectively a and b, and a+b=1 form a data [a, b]；

Step 43: key is BiGRU link in the BiGRU drug toxicity disaggregated model, for list entries X= (x1, x2 ..., xt), for currently hiding layer state in each GRU unit of t momentIt is by current input X, (t-1) Moment forward hidden state outputWith the output of reversed hidden stateThree parts codetermine.Since BiGRU can Regard two unidirectional GRU as, so hiding layer state of the BiGRU in t momentBy preceding to hiding layer stateWith it is reversed Hide layer stateWeighted sum obtains:

Here Φ and σ represents different activation primitives, W, WZ, and WR and WR represent corresponding weight matrix, and bz and br divide Door Biao Shi not updated and reset the bigoted of door.One update doorIts hiding layer state is calculated for control loop unit.When Reset door r_tValue when being 0, its meeting so that cycling element progress reset operation come the calculating state before forgetting.

Step S5: the result of the BiGRU drug toxicity disaggregated model training is sent to the classifier, the classifier The BiGRU drug toxicity disaggregated model that continues to make a gift to someone after optimization loss function is trained,

Specifically, the training result of the dense layer is sent to the classifier, after the classifier optimization loss function Continue to make a gift to someone the training result embeding layer to continue to be trained the BiGRU drug toxicity disaggregated model.

Preferably, classification results probability value y is calculated using sigmoid function here_i, and original tag beforeIt is right Than objective function LOSS can be obtained are as follows:

y_i=sigmoid (Wⁱh_t+b_i)

Step 6: being calculated by successive ignition, obtain the model that finally training is completed, specifically, carrying out 100 iteration meters When the structure calculated or iterated to calculate when continuous 5 times no longer changes, the BiGRU drug toxicity disaggregated model i.e. training is completed；

Step 7: the data in the test set or the test set and the development set are carried out with the conversion of Smi2Vec And transformation result is input in the BiGRU drug toxicity disaggregated model of training completion and is calculated, obtain test result；

Step S8: the obtained test result of step S7 is analyzed and is discussed.

In the following, by described in proposed by the invention based on Smi2Vec BiGRU drug toxicity forecasting system and prediction side The carry out performance evaluating of method.

It should be noted that the data set used in the present embodiment is Tox21 data set (Tox21 Data Challenge) the progress performance evaluating of the BiGRU drug toxicity forecasting system to described based on Smi2Vec and prediction technique Performance is evaluated and tested, which may be to human body 12 kinds of receptors (NR-AR, NR-AR-LBD, NR-AhR, NR- comprising 8013 kinds Aromatase, NR-ER, NR-ER-LBD, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, SR- P53 the data) having an impact.

Firstly, being commented for the performance of the BiGRU drug toxicity forecasting system based on Smi2Vec provided by the invention It surveys, in this embodiment, each task in the Tox21 data set is tested.In this group experiment, main presentation Radom Forrest and SVM conventional machines learning model as a result, because Radom Forrest and SVM conventional machines learn mould Type shows better performance on the Tox21 data set than other conventional methods.Specifically, the Tox21 data set There are 12 tasks.From following table as can be seen that generally, the BiGRU drug based on Smi2Vec proposed by the invention is malicious Property forecasting system all shows optimal performance on the Tox21 data set.Specifically, on the verifying collection of all task class The BiGRU drug toxicity forecasting system based on Smi2Vec provided by the invention is passed compared to Radom Forrest and SVM System machine learning model has the performance boost of 12.74%-32.75%, there is the performance boost of 5%-40.4% on test set, real The classifying quality of high standard is showed.

Again, it is please predicted in conjunction with refering to Fig. 6, Fig. 6 for the BiGRU drug toxicity provided by the invention based on Smi2Vec Method and traditional characterization of molecules method ECFP respectively RF (Ranom Forrest), LR (Logistic RegRession), Effect contrast figure on DT (Decision Tree), KN (K-Nearest Neighbor) model.In order to embody characterization of molecules side The effect and conventional molecular characterizing method ECFP of method training in identical machine learning model, from the Tox21 data set From the point of view of contrast and experiment, the BiGRU drug toxicity prediction technique provided by the present invention based on Smi2Vec is on 4 kinds of models ROC-AUC score be superior to conventional method.

Above-described is only embodiments of the present invention, it should be noted here that for those of ordinary skill in the art For, without departing from the concept of the premise of the invention, improvement can also be made, but these belong to protection model of the invention It encloses.

Claims

1. a kind of BiGRU drug toxicity forecasting system based on Smi2Vec characterized by comprising

BiGRU drug toxicity disaggregated model is set to the Smi2Vec output end for training the atom vector, described BiGRU drug toxicity disaggregated model includes 1 embeding layer set gradually, 1 BiGRU layers, 2 pond layers and 2 dense layers； And

Classifier is set to the output of the BiGRU drug toxicity disaggregated model for generating the output label of classification of task End.

2. the BiGRU drug toxicity forecasting system according to claim 1 based on Smi2Vec, which is characterized in that described embedding Enter the output end that layer is set to the Smi2Vec module, the classifier is set to the output end of the dense layer.

3. a kind of BiGRU drug toxicity prediction technique based on Smi2Vec characterized by comprising

The conversion of step S2:Smi2Vec: by Smi2Vec module by the training set with the characterization of molecules of SMILES format Be converted to atom vector；

Step S3: building BiGRU drug toxicity disaggregated model: the BiGRU drug toxicity disaggregated model includes 1 set gradually A embeding layer, 1 BiGRU layers, 2 pond layers, 2 dense layers；

Step S4: the atom vector is input to the BiGRU drug toxicity disaggregated model to the BiGRU drug toxicity point Class model is trained；

Step S5: the training result of the BiGRU drug toxicity disaggregated model is sent to the classifier, the classifier optimization The BiGRU drug toxicity disaggregated model that continues to make a gift to someone the training result after loss function continues to train；

Step S6: calculating by successive ignition, and the BiGRU drug toxicity disaggregated model training is completed；

Step S7: the conversion of Smi2Vec is carried out to the data in the test set and transformation result is input to BiGRU drug poison In property disaggregated model, test result is obtained；

Step S8: the test result is analyzed and is discussed.

4. the BiGRU drug toxicity prediction technique according to claim 3 based on Smi2Vec, which is characterized in that the number It is made of according to collection building the training set (80%) and the test set (20%).

5. the BiGRU drug toxicity prediction technique according to claim 3 based on Smi2Vec, which is characterized in that the number It is made of according to collection building the training set (80%), the test set (10%) and the development set (10%).

6. the BiGRU drug toxicity prediction technique according to claim 3 based on Smi2Vec, which is characterized in that the step Rapid S2 the following steps are included:

Step S21: the molecule of SMILES format is cut into independent atom, and the feature of the atom is extracted；

Step S22: the coding one by one with one-hot coding method to the atom being syncopated as, by SMILES molecule be converted to atom to Amount；

Step S23: building mapping function instructs the SMILES molecule in the training set with Word2Vec Open-Source Tools in advance Practice, generates dictionary, corresponding sample vector is found by dictionary enquiring, if lacking corresponding sample vector in dictionary, It is matching that a vector is generated at random.

7. the BiGRU drug toxicity prediction technique according to claim 3 based on Smi2Vec, which is characterized in that described In step S4, the atom vector is carried out sequentially into the embeding layer, BiGRU layers described, the pond layer and the dense layer Processing, to be trained to the BiGRU drug toxicity disaggregated model.

8. the BiGRU drug toxicity prediction technique according to claim 3 based on Smi2Vec, which is characterized in that described In step S5, the training result of the dense layer is sent to the classifier, continue after the classifier optimization loss function by The training result makes a gift to someone the embeding layer to continue to be trained the BiGRU drug toxicity disaggregated model.

9. the BiGRU drug toxicity prediction technique according to claim 3 based on Smi2Vec, which is characterized in that described In step S6, when the structure for carrying out 100 iterative calculation or iterating to calculate when continuous 5 times no longer changes, the BiGRU drug poison Property disaggregated model i.e. training complete.