CN112164426A

CN112164426A - Drug small molecule target activity prediction method and device based on TextCNN

Info

Publication number: CN112164426A
Application number: CN202010998886.9A
Authority: CN
Inventors: 宋怡然; 马元巍; 李泽朋; 顾徐波
Original assignee: Changzhou Weiyizhi Technology Co Ltd
Current assignee: Changzhou Weiyizhi Technology Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-01-01

Abstract

The invention provides a method and a device for predicting the activity of a drug small molecule target based on TextCNN, wherein the method comprises the following steps: preprocessing drug micromolecule data to be trained and tested to obtain a natural language and a coded drug text data set; in the training stage, importing the medicine text data in a training set, and vectorizing the coded natural language in the medicine text data set at an embedding layer; performing convolution operation on the embedded layer at the convolution layer to extract features; performing maximum pooling on the features extracted from the convolutional layer in the maximum pooling layer; performing full-connection operation on the feature vectors obtained after the maximum pooling layer in a full-connection layer to obtain a classification result; in the testing stage, testing is carried out by adopting test concentrated drug text data, and the TextCNN model is adjusted according to the testing result to obtain a final TextCNN model; according to the final TextCNN model, the activity of the small molecular target of the medicine is predicted, and the research and development efficiency of the medicine is greatly improved.

Description

Drug small molecule target activity prediction method and device based on TextCNN

Technical Field

The invention relates to the technical field of molecular targets, in particular to a method for predicting the activity of a drug small molecular target based on TextCNN, a device for predicting the activity of the drug small molecular target based on TextCNN, computer equipment and a non-transitory computer-readable storage medium.

Background

The new drug development cycle is very long, and comprises target selection and verification, discovery of a seedling-end compound (Hit), discovery of a Lead compound (Lead) and discovery and optimization of a Candidate drug (Candidate), layer-by-layer screening and high cost. The virtual screening is hoped to be carried out through a computer, rules behind drug molecules are fully mined based on the existing biochemical database related to the drugs, and the speed of drug discovery and development is accelerated, such as discovery and evaluation of new target protein seedling-end compounds.

The conventional method of drug characteristic prediction performs quantitative structure-activity relationship or structure-activity relationship (QSAR/QSPR) modeling by using molecular descriptors (including 1D/2D/3D/high dimensional descriptors, such as molecular weight, and other physicochemical properties) as input characteristics, and more than 5000 molecular descriptors have been developed. However, the algorithmic model predictive performance of such a modeling approach using molecular descriptors in large numbers is highly dependent on whether valid molecular descriptor features can be selected. While the feature engineering is time-consuming and labor-consuming, the generalization and universality of the algorithm model also greatly depend on the quality of the trained model. Also, there are the following problems: (1) under the condition of less data sets, the overfitting phenomenon is easy to occur in training, the generalization capability of the model is not strong, the good and bad of the feature representation determine the performance of the machine learning method, and when the research of the interdisciplines of bioinformatics is carried out, not only relevant knowledge of artificial intelligence is required, but also experts in task related fields are required to help to design the features, so that a higher threshold exists in the research, and meanwhile, the process of feature design consumes time and energy, and because the information with strong tissue distinction is very difficult to extract and organize from the data; (2) there are often data imbalance problems in drug data.

Disclosure of Invention

The invention provides a drug small molecule target activity prediction method based on TextCNN for solving the technical problems, which can improve the accuracy of virtual screening of drug target prediction, effectively screen drug molecules and predict characteristics, has strong generalization and universal capability of a model and has great practical value.

The technical scheme adopted by the invention is as follows:

a drug small molecule target activity prediction method based on TextCNN comprises the following steps: preprocessing drug micromolecule data to be trained and tested to obtain a natural language and coded drug text data set which is divided into a training set and a testing set; in the training stage, importing the medicine text data in a training set, and vectorizing the coded natural language in the medicine text data set in an embedding layer of the TextCNN model; performing convolution operation on the embedded layer at the convolution layer of the TextCNN model to extract features; performing maximum pooling on the features extracted by the convolutional layer at a maximum pooling layer of the TextCNN model; performing full-connection operation on the feature vectors obtained after the maximum pooling layer in a full-connection layer of the TextCNN model to obtain a classification result; in the testing stage, testing is carried out by adopting the text data of the concentrated testing medicines, and the TextCNN model is adjusted according to the testing result to obtain a final TextCNN model; and predicting the activity of the drug small molecule target according to the final TextCNN model.

According to one embodiment of the invention, the drug small molecule data to be trained and tested is preprocessed to obtain a natural language and coded drug text data set, which is divided into a training set and a testing set, and comprises: acquiring smiles character strings of all drug micromolecules in the drug micromolecule data; coding smiles character strings of the drug small molecules in a form of One-hot coding, and dividing a coded drug text data set into a training set and a testing set.

According to an embodiment of the invention, the method for predicting the activity of the small molecule target of the drug based on TextCNN further comprises: adopting a smote method to perform oversampling on a few types of samples in the training set to obtain a processed training set; and training the TextCNN model according to the processed training set.

According to one embodiment of the invention, vectorizing the encoded natural language in the drug text dataset at the embedding layer of the TextCNN model comprises: and coding the smiles character strings into dense vectors, so as to facilitate the embedded layer of training.

According to one embodiment of the invention, the convolution operation of the embedded layer at the convolution layer of the TextCNN model to extract features comprises: after the smiles character string is processed by an embedding layer, a two-dimensional matrix is obtained; and performing convolution operation on the two-dimensional matrix in the embedded layer by using a plurality of convolution cores to obtain a plurality of one-dimensional vectors.

According to one embodiment of the invention, the maximal pooling of the convolutional layer extracted features in the maximal pooling layer of the TextCNN model comprises: and taking the maximum value of the plurality of one-dimensional vectors obtained after the convolution operation, splicing the maximum value and the spliced one-dimensional vectors together to serve as the output value of the maximum pooling layer, and extracting and reducing the dimension of the features obtained by the convolution layer.

According to one embodiment of the invention, in the testing stage, the iteration times are set, and an Adam optimizer is used for optimizing grid parameters; in the training stage, training is carried out through a five-fold cross validation method so as to reduce model overfitting.

Corresponding to the method, the invention also provides a device for predicting the activity of the drug small molecule target based on the TextCNN, which comprises the following steps: the preprocessing module is used for preprocessing the drug micromolecule data to be trained and tested to obtain a natural language and coded drug text data set which is divided into a training set and a testing set; the model training module is used for importing the medicine text data in a training set in a training stage, vectorizing the coded natural language in the medicine text data set in an embedded layer, performing convolution operation on the embedded layer in a convolution layer to extract features, performing maximum pooling on the features extracted by the convolution layer in a maximum pooling layer, and performing full-connection operation on feature vectors obtained after the maximum pooling layer in a full-connection layer to obtain a classification result; the model testing module is used for testing by adopting the test concentrated drug text data and adjusting the TextCNN model according to the test result to obtain a final TextCNN model; and the prediction module is used for predicting the activity of the drug small molecule target according to the final TextCNN model.

Corresponding to the method, the invention further provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the method for predicting the activity of the target of the drug small molecule based on the TextCNN is realized.

In response to the method, the invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for predicting the activity of the target of the TextCNN-based small molecule of the drug.

The invention has the beneficial effects that:

the method processes the characteristics of molecules by a one-hot single-hot coding method, greatly shortens the characteristic extraction time and the model training cost on the premise of not calculating additional molecular structures or physicochemical characteristics and performing characteristic engineering, innovatively uses a textCNN method for text classification, is favorable for improving the accuracy of virtual screening of drug target prediction, effectively performs drug molecule screening and characteristic prediction, has strong generalization and universal capability of the model, and has great practical value.

Drawings

FIG. 1 is a flowchart of a method for predicting the activity of a drug small molecule target based on TextCNN according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating predicted results according to an embodiment of the present invention;

fig. 3 is a block diagram of a device for predicting the activity of a target of a TextCNN-based drug small molecule according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for predicting the activity of a drug small molecule target based on TextCNN according to an embodiment of the present invention.

As shown in fig. 1, the method for predicting the activity of a TextCNN-based drug small molecule target of an embodiment of the present invention may include the following steps:

s1, preprocessing the drug small molecule data to be trained and tested to obtain a natural language and coded drug text data set, and dividing the natural language and coded drug text data set into a training set and a testing set.

Specifically, characters of all drug small molecule Smiles character strings in a current data set are obtained, the maximum length of the character strings and the number of the types of the characters are counted, a drug molecule Smiles structural formula is coded in a One-hot coding mode, and the positions of the characters are counted to construct a final vector of the drug molecules.

Further, according to an embodiment of the present invention, the method for predicting the activity of a small molecule target of a TextCNN-based drug further includes: adopting a smote method to perform oversampling on a few types of samples in the training set to obtain a processed training set; and training the TextCNN model according to the processed training set.

Specifically, under the condition of a small number of data sets, the overfitting phenomenon is easy to occur in the training, so the processed training set is obtained by oversampling a few samples in the training set by adopting a smote method. Specifically, for the training set, when the training set belongs to a minority class of samples, for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class of samples by using the euclidean distance as a standard to obtain k neighbors of the sample x; setting a sampling proportion according to the sample unbalance proportion to determine a sampling multiplying power N, and randomly selecting a plurality of samples from k neighbors of each minority sample X to form a set X; constructing a new sample x' by combining each random adjacent sample with the sample x; calculating the distance from each newly generated data to all samples by using the Euclidean distance as a standard to obtain k adjacent to the newly generated data, and deleting the data if the majority class of the k adjacent to the newly generated data is not the minority class for generating the data; carrying out normalization processing on each sample in the training set X; and (4) performing 0 complementing operation on the tail of a single sample in the training set, and filling the sample characteristic sequence to obtain the processed training set. When the samples are not a few samples, directly carrying out normalization processing on each sample in the training set X; and (4) performing 0 complementing operation on the tail of a single sample in the training set, and filling the sample characteristic sequence to obtain the processed training set.

And S2, importing the medicine text data in the training set in the training stage, and vectorizing the coded natural language in the medicine text data set in the embedding layer of the TextCNN model.

According to one embodiment of the invention, vectorizing the encoded natural language in the drug text dataset at the embedding layer of the TextCNN model comprises: smiles strings are encoded into dense vectors, facilitating the embedding layer of training.

Specifically, an input natural language, i.e., simles character string, is encoded into a dense vector. By comparing the use of pre-trained word vectors with a set of word vectors trained directly during the training of TextCNN, the former performed better than the latter. If a pre-trained word vector is used, the training is divided into a static method and a no-static method, wherein the former means that the parameters of the word vector are not adjusted in the process of training the TextCNN, and the latter adjusts the parameters of the word vector in the process of training, so that the result of the latter is better than that of the former. The more general way is: instead of tuning the embedding layer every batch, tuning is done once every 100 batches, which not only reduces training time, but also fine-tunes the word vector.

S3, the embedded layer is convolved on the convolution layer of the TextCNN model to extract features.

According to one embodiment of the invention, the convolution operation is carried out on the embedded layer at the convolution layer of the TextCNN model to extract the characteristics, and the method comprises the following steps: after the smiles character string is processed by an embedding layer, a two-dimensional matrix is obtained; and performing convolution operation on the two-dimensional matrix in the embedded layer by using a plurality of convolution cores to obtain a plurality of one-dimensional vectors.

Specifically, the input sentence or text is transformed into a two-dimensional matrix after passing through the embedded layer, and assuming that the length of the text is | T |, and the size of the word vector is | d |, the size of the two-dimensional matrix is | T | × | d |, and the following convolution operation is performed on the one | T | × | d |, two-dimensional matrix. The size of the convolution kernel is generally set to n x | d |, where n is the length of the convolution kernel and | d | is the width of the convolution kernel, which is the same as the dimension of the word vector, i.e. the convolution is performed only along the text sequence, and n can be chosen in many ways, such as 2, 3, 4, 5, etc. For a text with | T | d |, if the size of the convolution kernel is chosen to be n | d |, the result after convolution is a vector of | T-n +1| 1. In the TextCNN model, multiple convolution kernels of different types need to be used simultaneously, while there may be multiple convolution kernels of each size. Our model passes through a one-dimensional convolutional layer of (2, 3, 4) kernel _ sizes, each with two output channels.

S4, the features extracted from the convolutional layer are maximally pooled in the maximally pooled layer of the TextCNN model.

According to one embodiment of the invention, the maximum pooling of the convolutional layer extracted features in the maximum pooling layer of the TextCNN model comprises the following steps: and taking the maximum value of a plurality of one-dimensional vectors obtained after the convolution operation, splicing the maximum value and taking the maximum value as the output value of the maximum pooling layer so as to extract and reduce the dimension of the features obtained by the convolution layer.

Specifically, a plurality of one-dimensional vectors obtained after convolution are maximized and then spliced together to serve as an output value of the layer. If the convolution kernels have the sizes of 2, 3, 4 and 5, and there are 128 convolution kernels in each size, 4 × 128 one-dimensional vectors are obtained after convolution and then maximum pooling is performed, 4 × 128 values are obtained and are spliced together to obtain the final structure-512 × 1 vector. The significance of the maximum pooling layer is to extract the features with the maximum activation degree from the features extracted for the convolutional layer.

And S5, performing full connection operation on the feature vectors obtained after the largest pooling layer in the full connection layer of the TextCNN model to obtain a classification result.

And S6, in the testing stage, testing the text data of the concentrated drug by adopting the test, and adjusting the TextCNN model according to the testing result to obtain the final TextCNN model. It should be noted that, the number of times of training is determined according to the adjusted model performance and the learning curve, so as to prevent the model from being over-fitted. Assuming that the set training times are 10 times, when the model expression and learning curve meet the requirements, the training can be stopped in advance in order to prevent the model from being over-fitted.

In particular, according to the test results, the weights and biases can be adjusted using back propagation to adjust the TextCNN model: and testing the model by adopting a test set, wherein an error exists between a test result and an actual result, and the error is measured by using the following cost function:

wherein w is weight, b is offset, n is number of input data, a represents actual output when input is x, a is a function of w and b, y is predicted output, is a function about x, and | | represents a modulus of a vector; if the error is to be reduced, the cost function is minimized, that is, the variation of the cost function is negative, that is, the output predicted value is close to the actual value until the predicted value is infinitely close; the variable quantity of the cost function is the variable quantity of the weight to the cost function derivation multiplied by the weight and the variable quantity of the bias to the cost function derivation multiplied by the bias, the variable quantity of the weight and the bias is set to be a negative number multiplied by the bias to the cost function, and the second derivative is negative, namely the first derivative is continuously reduced, namely the original function is continuously reduced, so that the aim that the actual output is close to the expected output is achieved; the absolute value of the negative number is defined as a learning rate eta, a test set is adopted to test a currently trained model every training set times, a partial derivative is solved for a cost function according to a test result and an actual result, and the weight and the bias are adjusted to be as follows:

the TextCNN model is adjusted.

In some embodiments of the invention, the evaluation metrics of the model include one or more of accuracy, precision, recall, area under the ROC curve, confusion matrix. That is, the number of model adjustments is determined based on these evaluation indexes.

The Accuracy is the percentage of the result with correct prediction in the total sample, the actual prediction data in the historical prediction database is evaluated in the embodiment of the invention, namely the sample involved in the model evaluation in the embodiment of the invention is the set of the historical actual prediction data. The formula for Accuracy is as follows:

Accuracy＝(TP+TN)/(TP+TN+FP+FN)

wherein TP (True Positive) refers to Positive samples predicted to be Positive by the model, FP (False Positive) refers to Negative samples predicted to be Positive by the model, FN (False Negative) refers to Positive samples predicted to be Negative by the model, and TN (True Negative) refers to Negative samples predicted to be Negative by the model.

The Precision ratio Precision is also called Precision ratio, and it is referred to the prediction result, and means the probability of the actual positive sample among all the samples predicted to be positive, that is, how much confidence is in the result of predicting to be positive sample to predict the accuracy. The Precision is formulated as follows:

Precision＝TP/(TP+FP)

recall, called Recall, refers to the probability of being predicted as a positive sample among the actual positive samples for the original sample. The formula for Recall rate recalls is as follows:

Recall＝TP/(TP+FN)

roc (receiver Operating characteristics) curve, also called receiver Operating characteristic curve. The two main indicators in the ROC curve are the true rate TPR and the false positive rate FPR, where the abscissa is the false positive rate FPR and the ordinate is the true rate TPR.

The true rate tpr (true Positive rate), also called sensitivity, is as follows:

TPR＝TP/(TP+FN)

false Positive rate FPR (false Positive Rat), also called specificity, is as follows:

FPR＝FP/(TN+FP)

the ROC curve has the good characteristic that when the distribution of the positive and negative samples changes, the ROC curve can be kept unchanged, and the influence of the imbalance of the sample types on the index result can be well eliminated.

The area Under the ROC curve, AUC (area Under dark), is the size of the portion of the area Under the ROC curve. The larger the area under the ROC curve indicates the better the model performance, and the AUC is the evaluation index generated thereby. Typically, the AUC values range from 0.5 to 1.0, with larger AUC's representing better prediction performance. If the model is perfect, its AUC is 1, proving that all positive examples are preceded by negative examples, if the model is a simple two-class random guess model, its AUC is 0.5, if one model is better than the other, its area under the curve is relatively large, and the corresponding AUC value will also be large.

The Confusion Matrix (fusion Matrix) is also referred to as an error Matrix, by which the effect of the algorithm can be visually observed. Each column of which is a predicted classification of the sample and each row is a true classification of the sample (or vice versa), which reflects the degree of confusion of the classification results.

The embodiment of the invention can evaluate the quality of the historical real prediction performance by any one evaluation index, and also can evaluate the quality of the historical real prediction performance by a plurality of evaluation indexes.

In the preferred embodiment of the invention, the quality of the historical true prediction performance is evaluated by the area AUC under the ROC curve, and if the AUC corresponding to the prediction model is the maximum, the true prediction effect of the prediction model is the best.

When the quality of the historical true prediction performance is evaluated by a plurality of evaluation indexes, each evaluation index can have corresponding weight, the weighted average value of the results obtained by the plurality of evaluation indexes is used as the evaluation basis of the historical true prediction performance of various prediction modes, and if the weighted average value corresponding to the prediction model is the largest, the true prediction effect of the prediction model is the best. Preferably, the area under the ROC curve, AUC, is the highest weight.

And S7, predicting the activity of the small molecule target of the drug according to the final TextCNN model.

That is, after the model is established, several important parameters are optimized by a full-space search method through the performance of a parameter adjustment optimization algorithm: learning rate, activation function, dimensionality and the like, and better model prediction capability and performance are obtained. Dropout and regularization (five-fold cross validation) are introduced into the model to reduce overfitting of the model, the learning rate of an Adam optimizer is selected to be exponentially reduced, the condition that the Loss is not reduced along with the increase of iteration times caused by the learning rate is avoided to a certain extent, and under-fitting of the model is prevented, and FIG. 2 is a schematic diagram of a prediction result obtained by adopting the method.

In conclusion, the artificial intelligence deep learning algorithm is effectively combined with drug research and development, so that the properties of drug molecules, such as target binding property, activity, toxicity and the like, can be effectively predicted through the structural molecular formula characteristics of the drug small molecules, theoretical basis and technical support are provided for solving the problems of time consumption, labor consumption and poor universality of algorithm construction during virtual drug screening, and more effective property prediction schemes are further provided for drug research and development.

The invention also provides a device for predicting the activity of the drug small molecule target based on the TextCNN, which corresponds to the method for predicting the activity of the drug small molecule target based on the TextCNN in the embodiment.

As shown in fig. 3, the apparatus for predicting the activity of a TextCNN-based small molecule target of a drug according to an embodiment of the present invention may include: a pre-processing module 10, a model training module 20, a model testing module 30, and a prediction module 40.

The preprocessing module 10 is configured to preprocess drug small molecule data to be trained and tested to obtain a natural language and coded drug text data set, which is divided into a training set and a testing set. The model training module 20 is configured to, in a training phase, import the drug text data in the training set, vectorize the coded natural language in the drug text data set in the embedding layer, perform convolution operation on the embedding layer in the convolution layer to extract features, perform maximum pooling on the features extracted by the convolution layer in the maximum pooling layer, perform full join operation on feature vectors obtained after the maximum pooling layer in the full join layer, and obtain a classification result. The model testing module 30 is configured to perform testing by using the test-concentrated drug text data, and adjust the TextCNN model according to the test result to obtain the final TextCNN model. And the prediction module 40 is used for predicting the activity of the small molecule target of the drug according to the final TextCNN model.

According to one embodiment of the invention, a preprocessing module 10 preprocesses drug small molecule data to be trained and tested to obtain a natural language and a coded drug text data set, which are divided into a training set and a testing set and are specifically used for acquiring smiles character strings of all drug small molecules in the drug small molecule data; coding smiles character strings of the drug small molecules in a form of One-hot coding, and dividing a coded drug text data set into a training set and a testing set.

According to an embodiment of the present invention, the preprocessing module 10 is further configured to perform oversampling on a few types of samples in the training set by using a smote method to obtain a processed training set; the model training module 20 trains the TextCNN model according to the processed training set.

According to an embodiment of the present invention, the model training module 20 is specifically configured to encode smiles strings into dense vectors, which facilitates the training of the embedded layer when the embedded layer of the TextCNN model vectorizes the encoded natural language in the drug text dataset.

According to an embodiment of the invention, the model training module 20 performs convolution operation on the embedded layer at the convolution layer of the TextCNN model to extract features, specifically, when the smiles character string is processed by the embedded layer, a two-dimensional matrix is obtained; and performing convolution operation on the two-dimensional matrix in the embedded layer by using a plurality of convolution cores to obtain a plurality of one-dimensional vectors.

According to an embodiment of the present invention, when the maximum pooling layer of the TextCNN model performs maximum pooling on the features extracted from the convolutional layer, the model training module 20 is specifically configured to take the maximum value of the multiple one-dimensional vectors obtained after the convolution operation, and splice the maximum value together to serve as the output value of the maximum pooling layer, so as to extract and reduce the dimensions of the features obtained from the convolutional layer.

According to one embodiment of the invention, the model test module 30 sets the number of iterations in the test phase, and optimizes the grid parameters using an Adam optimizer; the model training module 20 performs training in a training phase by a five-fold cross validation method to reduce model overfitting.

It should be noted that details not disclosed in the apparatus for predicting the target activity of a TextCNN-based small drug molecule according to the embodiment of the present invention refer to details disclosed in the method for predicting the target activity of a TextCNN-based small drug molecule according to the embodiment of the present invention, and are not repeated herein.

The invention further provides a computer device corresponding to the embodiment.

The computer device of the embodiment of the invention comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the method for predicting the target activity of the drug small molecule based on the TextCNN can be realized according to the embodiment of the invention.

According to the computer equipment provided by the embodiment of the invention, when the processor executes the computer program stored on the memory, the data of the drug micromolecules to be trained and tested are preprocessed firstly to obtain the natural language and the coded drug text data set which is divided into the training set and the testing set, then model training and model testing are carried out, finally, the activity of the drug micromolecules target is predicted according to the finally tested TextCNN model, the accuracy of virtual screening of drug target prediction can be improved, drug molecule screening and feature prediction are effectively carried out, the generalization and the universal capability of the model are strong, and the computer equipment has great practical value.

The invention also provides a non-transitory computer readable storage medium corresponding to the above embodiment.

A non-transitory computer readable storage medium of an embodiment of the present invention, on which a computer program is stored, which when executed by a processor, can implement the method for predicting the activity of a target of a TextCNN-based small molecule of a drug according to the above embodiment of the present invention.

According to the non-transitory computer readable storage medium of the embodiment of the invention, when the processor executes the computer program stored on the processor, the data of the drug small molecules to be trained and tested are preprocessed firstly to obtain the natural language and the coded drug text data set, the natural language and the coded drug text data set are divided into the training set and the testing set, then model training and model testing are carried out, finally, the activity of the drug small molecule target is predicted according to the finally tested TextCNN model, the accuracy of virtual screening of drug target prediction can be improved, drug molecule screening and feature prediction are effectively carried out, the generalization and the universal capability of the model are strong, and the model has great practical value.

According to the computer program product of the embodiment of the invention, the processor executes the instructions therein first.

In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The meaning of "plurality" is two or more unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A drug small molecule target activity prediction method based on TextCNN is characterized by comprising the following steps:

preprocessing drug micromolecule data to be trained and tested to obtain a natural language and coded drug text data set which is divided into a training set and a testing set;

in the training stage, importing the medicine text data in a training set, and vectorizing the coded natural language in the medicine text data set in an embedding layer of the TextCNN model;

performing convolution operation on the embedded layer at the convolution layer of the TextCNN model to extract features;

performing maximum pooling on the features extracted by the convolutional layer at a maximum pooling layer of the TextCNN model;

performing full-connection operation on the feature vectors obtained after the maximum pooling layer in a full-connection layer of the TextCNN model to obtain a classification result;

in the testing stage, testing is carried out by adopting the text data of the concentrated testing medicines, and the TextCNN model is adjusted according to the testing result to obtain a final TextCNN model;

and predicting the activity of the drug small molecule target according to the final TextCNN model.

2. The method for predicting the activity of the drug small molecule target based on TextCNN according to claim 1, wherein the method comprises the following steps of preprocessing the drug small molecule data to be trained and tested to obtain a natural language and coded drug text data set, and dividing the natural language and the coded drug text data set into a training set and a testing set:

acquiring smiles character strings of all drug micromolecules in the drug micromolecule data;

coding smiles character strings of the drug small molecules in a form of One-hot coding, and dividing a coded drug text data set into a training set and a testing set.

3. The method for predicting the activity of a TextCNN-based drug small molecule target according to claim 2, further comprising:

adopting a smote method to perform oversampling on a few types of samples in the training set to obtain a processed training set;

and training the TextCNN model according to the processed training set.

4. The method of claim 2, wherein vectorizing the encoded natural language in a drug text dataset at an embedding layer of the TextCNN model comprises:

and coding the smiles character strings into dense vectors, so as to facilitate the embedded layer of training.

5. The method for predicting the target activity of a TextCNN-based drug small molecule according to claim 4, wherein the step of performing convolution operation on the embedded layer at the convolution layer of the TextCNN model to extract features comprises the following steps:

after the smiles character string is processed by an embedding layer, a two-dimensional matrix is obtained;

and performing convolution operation on the two-dimensional matrix in the embedded layer by using a plurality of convolution cores to obtain a plurality of one-dimensional vectors.

6. The method for predicting TextCNN-based drug small molecule target activity according to claim 5, wherein maximally pooling features extracted from the convolutional layer in a maximal pooling layer of the TextCNN model comprises:

and taking the maximum value of the plurality of one-dimensional vectors obtained after the convolution operation, splicing the maximum value and the spliced one-dimensional vectors together to serve as the output value of the maximum pooling layer, and extracting and reducing the dimension of the features obtained by the convolution layer.

7. The TextCNN-based drug small molecule target activity prediction method according to claim 1, wherein in the testing stage, iteration times are set, and an Adam optimizer is used to optimize grid parameters;

in the training stage, training is carried out through a five-fold cross validation method so as to reduce model overfitting.

8. A device for predicting the activity of a drug small molecule target based on TextCNN is characterized by comprising:

the preprocessing module is used for preprocessing the drug micromolecule data to be trained and tested to obtain a natural language and coded drug text data set which is divided into a training set and a testing set;

the model training module is used for importing the medicine text data in a training set in a training stage, vectorizing the coded natural language in the medicine text data set in an embedded layer, performing convolution operation on the embedded layer in a convolution layer to extract features, performing maximum pooling on the features extracted by the convolution layer in a maximum pooling layer, and performing full-connection operation on feature vectors obtained after the maximum pooling layer in a full-connection layer to obtain a classification result;

the model testing module is used for testing by adopting the test concentrated drug text data and adjusting the TextCNN model according to the test result to obtain a final TextCNN model;

and the prediction module is used for predicting the activity of the drug small molecule target according to the final TextCNN model.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements a TextCNN-based drug small molecule target activity prediction method according to any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements a TextCNN-based drug small molecule target activity prediction method according to any one of claims 1 to 7.