CN110020671A

CN110020671A - The building of drug relationship disaggregated model and classification method based on binary channels CNN-LSTM network

Info

Publication number: CN110020671A
Application number: CN201910174269.4A
Authority: CN
Inventors: 孙霞; 马龙; 张蕾; 冯筠; 吴楠楠
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2019-07-16
Anticipated expiration: 2039-03-08
Also published as: CN110020671B

Abstract

The invention discloses a kind of drug relationship disaggregated model construction methods based on binary channels CNN-LSTM network, parent drugs text set is pre-processed, backout is carried out to pretreated drug text each in pretreated drug text set, obtains backward text set；Using pretreated drug text set as positive sequence text set；Training neural network, obtains drug relationship disaggregated model；Neural network includes parallel positive sequence Text character extraction layer and backward Text character extraction layer, Fusion Features layer and classification layer；Positive sequence Text character extraction layer and backward Text character extraction layer include the convolution block set gradually and shot and long term Memory Neural Networks block；The present invention is extracted the local feature of drug text using CNN, is extracted the global characteristics of drug text respectively using LSTM by constructing binary channels CNN-LSTM network, and the drug relationship feature extracted is more abundant, so that classification accuracy rate improves.

Description

The building of drug relationship disaggregated model and classification based on binary channels CNN-LSTM network Method

Technical field

The present invention relates to the building of drug relationship disaggregated model and classification methods, and in particular to one kind is based on binary channels CNN- The drug relationship disaggregated model of LSTM network constructs and classification method.

Background technique

Drug relationship refers to while or taking comprehensive effect caused by two or more drugs whithin a period of time.This Kind effect can be divided into synergistic effect, antagonistic effect and non-interaction.Mutual antagonistic effect between drug can cause patient Serious health risk.Drug relationship extracts the typical relationship that (DDIE) task is natural language processing field and extracts task, It is intended to detect and identify the semantic relation of drug pair, to drug safety accident is reduced, the development of biomedical technology is promoted to have Significance.

In recent years, the expert in terms of biomedical and text mining is made that very big effort in DDIE task, also creates Many methods are made, these methods can be mainly divided into three classes: the method for rule-based mode, the side based on statistical machine learning Method and method based on deep learning.Although the method for rule-based mode can targetedly be gone in identification target text very much Entity relationship, but this method is there are three very serious drawback: (1) needing to expend a large amount of man power and material and remove research mesh Text is marked, the information extraction quality for otherwise making rule cannot be guaranteed；(2) it is needed when laying down a regulation in the field Expert a large amount of priori knowledge is provided, and may to make standard because of the reason of subjective consciousness different by different experts The regular collection of cause；(3) because this method has very strong specific aim to domain knowledge, it is only applicable in the field Information extraction, generalization ability is generally poor, so the method for rule-based mode does not cause researcher widely to pay close attention to. Although these method performances based on statistical machine learning are good, it require that coming by fine and cumbersome Feature Engineering Extract suitable characteristic set.However, the extraction quality of these features depend on existing natural language processing tool, therefore by To the adverse effect of these tool noises and cost, extracts and be characterized in unordered, characteristic mass hardly results in effective guarantor Card causes the accuracy rate of classification not high.

Summary of the invention

The drug relationship disaggregated model building based on binary channels CNN-LSTM network that the purpose of the present invention is to provide a kind of And classification method leads to drug to solve feature randomness that drug relationship classification method in the prior art extracts The not high problem of the accuracy rate of relationship classification.

In order to realize above-mentioned task, the invention adopts the following technical scheme:

A kind of drug relationship disaggregated model construction method based on binary channels CNN-LSTM network, the method according to Lower step executes:

Step 1 obtains parent drugs text set；

Drug relationship in parent drugs text each in parent drugs text set is labeled, drug relationship mark is obtained Label collection；

Step 2 pre-processes the parent drugs text set, obtains pretreated drug text set；

The pretreatment includes Text normalization, text size is fixed and text vector mapping；

Step 3 carries out backward behaviour to pretreated drug text each in the pretreated drug text set Make, obtains backward text set；

Using the pretreated drug text set as positive sequence text set；

Step 4, using the positive sequence text set and backward text set as input, by the drug relationship tally set As output, training neural network obtains drug relationship disaggregated model；

The neural network includes the parallel positive sequence Text character extraction layer and backward text feature set gradually Extract layer, Fusion Features layer and classification layer；

The positive sequence Text character extraction layer and the backward Text character extraction layer include the volume set gradually Block and shot and long term Memory Neural Networks block.

Further, the convolution block is provided with 4.

Further, each convolution block includes the batch regularization sublayer set gradually, convolution sublayer, activation letter Number sublayer, convolution sublayer, activation primitive sublayer and pond beggar layer.

Further, the activation primitive in the activation primitive sublayer is ReLU function.

Further, the Fusion Features layer includes full articulamentum.

Further, the classification layer includes Softmax function layer.

A kind of drug relationship classification method based on binary channels CNN-LSTM network, to drug text to be sorted according to Lower step executes:

Step A, the drug text to be sorted is pre-processed using the method for step 2 in claim 1, is obtained Pretreated drug text；

It step B, will be described in the pretreated drug text input to any one of claim 1-6 claim Drug relationship disaggregated model in, obtain classification results.

The present invention has following technical characterstic compared with prior art:

1, a kind of building of drug relationship disaggregated model and classification based on binary channels CNN-LSTM network provided by the invention Method is extracted the local feature of drug text using CNN, is distinguished using LSTM by constructing binary channels CNN-LSTM network The global characteristics of drug text are extracted, the drug relationship feature extracted is more abundant, so that classification accuracy rate improves；

2, a kind of building of drug relationship disaggregated model and classification based on binary channels CNN-LSTM network provided by the invention Method completes feature extraction by the way that the positive sequence text of drug relationship text and backward text are respectively fed to CNN-LSTM network Journey, compared to single pass LSTM network, the drug relationship feature extracted is more comprehensive, so that classification accuracy rate improves；

3, a kind of building of drug relationship disaggregated model and classification based on binary channels CNN-LSTM network provided by the invention Method simplifies the process of drug characteristic vector extraction, improves drug relationship classification by extracting drug Text eigenvector Accuracy；

4, a kind of building of drug relationship disaggregated model and classification based on binary channels CNN-LSTM network provided by the invention Method is not necessarily to manual intervention and pertinent arts using the parent drugs relational text comprising multiple pharmaceutical entities as input, It does not need manually to extract complicated text feature, generalization ability is strong.

Detailed description of the invention

Fig. 1 is the classification of drug model structure provided in one embodiment of the present of invention；

Fig. 2 is the convolution block internal structure chart provided in one embodiment of the present of invention.

Specific embodiment

It makes explanations first to the term occurred in specific embodiment:

Shot and long term Memory Neural Networks (LSTM): LSTM network is by input gate, forgetting door, out gate and memory unit structure At LSTM passes through the door control mechanism of this complexity, can effectively learn the long-term Dependency Specification of input data, in text data It has a wide range of applications in processing with the serialization informations such as track data.

Convolutional neural networks (CNN): a kind of comprising convolutional calculation and with the feedforward neural network of depth structure.

Embodiment one

As shown in Figure 1, disclosing a kind of drug relationship classification based on binary channels CNN-LSTM network in the present embodiment Model building method, the method execute according to the following steps:

Step 1 obtains parent drugs text set；

The biomedical text acquired in the present embodiment can be adopted by modes such as Biomedical literature and papers Collection, the text of acquisition can be for document and paper partly or wholly, but needs to guarantee that text semantic expression is complete.

It is at least needed in the parent drugs text comprising two drug target title words, the two drug target title words are It is related to the drug word of drug relationship classification, remaining is other words, such as parent drugs text in the present embodiment are as follows: “Some quinolones,including ciprofloxacin,have been associated with transient elevations in serum creatinine in patients receiving cyclosporine Concomitantly ", wherein " quinolones ", " ciprofloxacin " and " cyclosporine " is medicine name word, Remaining word is other words.

In the data set used herein, between 0 to 150 words, most text size is distributed in the length of text Between 20 to 60 words, and backward rely on the phenomenon that (such as the grammatical phenomenons such as attributive clause) 46% is accounted in data set.

Drug relationship label includes 5 kinds, is that advice suggests respectively, effect effect, mechanism drug action mechanism, int Positive and irrelevant false.

In the present embodiment, patent is utilized " based on multilayer convolutional Neural to the pretreatment mode of parent drugs text set The drug relationship classification method of network " in drug text set and processing mode.In each of parent drugs text set The different different length of parent drugs text formatting, and medicine name word is complicated and uncommon, when being classified using neural network, It is readily incorporated error, therefore is pre-processed firstly the need of to the parent drugs text of acquisition, including parent drugs are literary All words in this carry out morphology normalization, i.e., the morphology of all words is unified；By drug target title word using unified Naming method be named and be replaced original drug target title word in the form after naming, concrete operations with Following steps:

All words in the parent drugs text set are normalized step 2.1, are named with Unified Form, then benefit The drug target title word is replaced by the drug target title word after being named with Unified Form, after being normalized Drug text set；

Wherein, normalization includes that morphology normalizes and name normalization；

To make drug text in classification, it can more accurately classify, reduce error and introduce, therefore by parent drugs Each of text word carries out morphology normalization, converts them to unified format.To every in parent drugs text One word carries out morphology normalization, the parent drugs text after being normalized, until each in parent drugs text set Each of a parent drugs text word all have passed through morphology normalization, the parent drugs text set after being normalized.

In order to improve the generalization of neural network, by all drug target title prefixes in drug text first with unified shape Formula name, Unified Form are the form of " X serial number ", and wherein X can be any English word, such as " day ", " interaction " etc., serial number is with the sequence serial number of English form, such as " one, two, three " etc., and by the system Title after one name replaces the title of former drug target word, the drug target title word after replacing be " drugone ", " drugtwo " and " drugthree " etc. obtain pretreated drug text set there is no influencing between drug text.

Step 2.2, by each drug text uniform length in the drug text set after normalization, obtain length it is fixed after Drug text set；

The length of each of pretreated drug text set drug text is fixed as n, for length Text less than n is filled, and the mode of filling can be for by the way of full odd jobs random number, then drug text can indicate Are as follows:

S=w₁w₂w₃...w_n

Step 2.3, it is fixed to each length in drug text set after fixed of the length after drug text into Row vector mapping, obtains pretreated drug text set；

Since neural network can not be handled the text of natural language form directly as input, by drug text Originally it is mapped as the text vector of digital form, comprising the following steps:

A, term vector table is constructed, the term vector table is made of word and corresponding digital term vector；

Term vector table is made of word and corresponding digital term vector, each word is corresponding unique in term vector table The term vector of digital form as far as possible inserts more words in the table, and term vector table is enable to cover more word.

For more significant term vectors can be converted out, in the present embodiment, studied using Stanford university NLP small Group provides GloVe (Global Vectors for Word Representation) model term vector table, including 2196016 term vectors, the dimension of each term vector are 300.If inputting the word in urtext not in this term vector table, Then the often one-dimensional of the term vector of the word is initialized to 0.

B, the fixed drug text of each length in drug text set is mapped by way of tabling look-up, is obtained pre- Treated drug text set.

For each of n dimension drug text word all by way of looking into the term vector table, it is mapped to one Each of n dimension drug text word is all mapped to the term vector of d dimension, therefore by the vector of a d dimension in this manner One length is that the parent drugs text S of n is just mapped as the text vector of one (n × d) dimension:

It and include that the parent drugs text S that m length is n is just mapped as a m × (n × d) text for one Vector set includes m (n × d) text vectors tieed up in drug text set.

Using the pretreated drug text set as positive sequence text set；

There are also the language phenomenons such as the text, such as attribute postposition of backward in natural language, in the present solution, to make extraction Feature is more comprehensive, using positive sequence drug text and backward drug text respectively to neural metwork training, obtains disaggregated model.

When carrying out backout to drug text, by the reversed order in text vector, such as the vector of one 1 dimension [0.21 0.35 0.62 0.85 0.96], after backward are as follows: [0.96 0.85 0.62 0.35 0.21].

The neural network includes parallel positive sequence Text character extraction layer and backward Text character extraction layer, feature Fused layer and classification layer；

The positive sequence Text character extraction layer is identical as the backward Text character extraction layer structure, includes successively The convolution block and shot and long term Memory Neural Networks block of setting.

In the present embodiment, in order to improve the accuracy rate that drug relationship is classified, the structure of neural network has been carried out again Design then these local features are sent to as shown in Figure 1, extracting the local feature of text using convolution block first LSTM model extracts the global characteristics and temporal aspect of text to supplement, but it is the text that can handle positive sequence that this, which is also, if The text modified backward of that attributive clause etc is encountered, processing capacity is still very weak, so using two identical features Extract layer handles the positive sequence and backward of input text respectively, and then the positive sequence and backward feature extracted is combined, and obtains most Whole text feature；Text feature is exported into classification layer again later and is classified.

In the present embodiment, the number of convolution block is not accurate enough lower than 4 local features extracted, the number of convolution block Higher than 4, it may appear that the phenomenon that over-fitting, cause feature extraction to fail, therefore as a preferred embodiment, convolution block It is provided with 4.

Optionally, each convolution block includes batch regularization sublayer, the convolution sublayer, activation primitive set gradually Sublayer, convolution sublayer, activation primitive sublayer and pond beggar layer.

In the present embodiment, as shown in Fig. 2, positive sequence drug text and backward drug text can be sent to convolution block it After be introduced into batch regularization layer, the effect of batch regularization layer is that input data is made more to meet normal distribution, meets normal state The speed of the sample training of distribution can greatly improve, and accuracy rate can also improve.

In the present embodiment, the data after regularization are sent into convolutional layer and carry out convolution operation, the parameter setting of convolutional layer Are as follows: the number of convolution unit filter 128.

Enter activation primitive later, data meaningless after convolution are deleted, as a preferred embodiment, swashing Function living is Relu function.

Obtained data are sent into pond layer, pond layer uses maximum by the operation for repeating the above convolution sum activation Pondization operation, such as, a pond window size is 2*2, by the pond window of this 2*2 after convolution sum activation It is slided in data, number maximum in window is elected as representative during sliding, it is a with regard to how many to slide how many times It represents, these is then used to represent the representative as initial data.Such do is advantageous in that: not losing guaranteeing that text spy demonstrate,proves Under the premise of mistake, data are reduced, accelerate the training of network.

Positive sequence drug text and backward drug text are passing through 4 convolution blocks identical in this way and then are entering length Global characteristics present in drug relationship text and temporal aspect are obtained in phase Memory Neural Networks block.

In the present embodiment, shot and long term Memory Neural Networks block interior joint number is set as 64.

Optionally, the Fusion Features layer includes full articulamentum.

In the present embodiment, positive sequence drug text and backward drug text are respectively fed to above-mentioned CNN-LSTM net After network, positive sequence text feature and backward text feature are respectively obtained, the two features are sent to full articulamentum simultaneously.Such as Say that positive text feature and backward text feature there are 100, then constructing a first layer is 200 nodes, the second layer 100 The full articulamentum of a node, positive text feature is sent into 100 nodes before first layer, after backward text feature is sent into first layer Then this 200 features are fused together by 100 nodes in this way.

Optionally, the classification layer includes Softmax function layer.

In the present embodiment, full articulamentum and Softmax function layer constitute last portion of drug relationship sorting algorithm Point, for exporting the drug relationship label of digital vectors form according to the quantity of classification, so that it is determined that last drug relationship point Each output node of the final result of class, full articulamentum and Softmax function layer represents a drug categories, and classifier is most The drug label exported eventually is given pharmaceutical entities to the probability for belonging to each drug categories, and the probability value is in [0,1].Example Such as, it is now assumed that drug relationship there are 2 kinds, relationship and not related, the then output node setting of Softmax function layer have been respectively represented It is 2, i.e., there are two types of drug relationships, positive and negative are respectively represented, if the number of Softmax function layer output The drug relationship label of vector form is p [positive, negative]=[0.1,0.9], i.e. Softmax function layer exports As a result it is 0.1 there are the probability value of positive in, is 0.9 there are the probability value of negative, is then judged with this.? In the present embodiment, drug relationship includes 5 kinds, is that advice suggests respectively, effect effect, mechanism drug action mechanism, int Positive and irrelevant false.

The stratification convolution loop neural network is trained with output using above-mentioned input, drug is obtained and closes Be taxonomical hierarchy convolution loop neural network, the drug relationship text and each drug relationship label be number to Amount form；Stratification convolution loop neural network described in repetition training n times, with the best stratification volume of this n times training performance Product Recognition with Recurrent Neural Network as the drug relationship taxonomical hierarchy convolution loop neural network, wherein N >=1.

The training set of one taxonomical hierarchy convolution loop neural network comprising two parts, first is that being inputted after pretreatment The drug text set of taxonomical hierarchy convolution loop neural network, second is that each drug text in pretreated drug text set In this corresponding parent drugs text, the drug relationship label between drug target title word is obtained to every in drug text set The corresponding drug relationship tally set of one drug text is exported as the target of multilayer convolutional network.Likewise, taxonomical hierarchy The test set of convolution loop neural network also includes two parts, the difference is that during the test, only by pretreated medicine Object text set is input in trained taxonomical hierarchy convolution loop neural network, taxonomical hierarchy convolution loop nerve Network can obtain the classification of drug result set of model prediction according to the drug text data of input and trained model parameter, so The true tag of classification of drug result set and drug relationship is compared afterwards, with the two comparison result classification of assessment stratification volume The performance of product Recognition with Recurrent Neural Network.

In this example, using 2013 drug relationship data set of DDIExtraction as drug relationship text to classification Stratification convolution loop neural network is trained and tests, by the 80% of entire data set as training set, 20% as survey Examination collection, i.e., training set is made of 27792 drug relationship text samples, and test set is by 6409 drug relationship text sample groups At.Then 10 training are carried out to stratification convolution loop neural network using ready-portioned training set, chosen in 10 training Final mask of the best model of modelling effect as drug relationship stratification convolution loop neural network.

Embodiment two

Step A, the drug text to be sorted is pre-processed using the method for step 2 in embodiment 1, is obtained pre- Treated drug text；

Step B, by drug relationship disaggregated model described in the pretreated drug text input embodiment 1 In, obtain classification results.

After training final drug relationship stratification convolution loop neural network, model can predict any drug Drug relationship involved in relational text, by the drug text input drug relationship stratification convolution loop that drug relationship is unknown Neural network, the drug that maximum probability is chosen from the digital vectors that drug relationship stratification convolution loop neural network exports close It is the drug relationship classification results of the drug text as unknown drug relationship.

In the present embodiment, drug text to be sorted is " Some quinolones have been associated with transient elevations in serum creatinine in patients receiving Cyclosporine concomitantly ", first aim medicine name word are quinolones, second target medication name Word is referred to as cyclosporine, carries out drug relationship point by trained drug relationship stratification convolution loop neural network Class, the drug relationship digital vectors label of output are as follows:

P [mechanism, advice, effect, int, false]=[0.02,0.09,0.1,0.67,0.12]

I.e. between two drug targets quinolones, cyclosporine is 2% there are the probability of mechanism, I.e. between two drug targets quinolones, cyclosporine is 9% there are the probability of advice, i.e. two target medicines Between object quinolones, cyclosporine is 10% there are the probability of effect, i.e. two drug targets Between quinolones, cyclosporine is 67% there are the probability of int, i.e. two drug target quinolones, Between cyclosporine is 12% there are the probability of false, wherein being up to 67% there are the probability of int relationship, therefore It will be between two drug targets quinolones, cyclosporine using drug relationship stratification convolution loop neural network Relationship is classified as int positive relationship.

The drug relationship classification method and drug in the prior art based on binary channels CNN-LSTM network that this programme provides Sorting algorithm is compared, performance comparison sheet 1, when evaluating a drug relationship classification method performance quality, accuracy rate, recall rate It is bigger with F value, illustrate that drug relationship disaggregated model performance is better, from table 1 it follows that drug relationship layer proposed by the present invention Secondaryization convolution loop neural network will significantly be better than other methods in three accuracy rate, recall rate and F value indexs, this The drug relationship classification method proposed by the present invention based on the two-way convolution loop neural network of stratification is demonstrated in drug relationship Possess optimal classification performance in classification problem.

The drug relationship classification method provided by the invention of table 1 and the deemed-to-satisfy4 energy of other drugs relationship classification compare

Claims

1. a kind of drug relationship disaggregated model construction method based on binary channels CNN-LSTM network, which is characterized in that described Method executes according to the following steps:

Step 1 obtains parent drugs text set；

Drug relationship in parent drugs text each in parent drugs text set is labeled, drug relationship label is obtained Collection；

Step 3 carries out backout to pretreated drug text each in the pretreated drug text set, obtains Obtain backward text set；

Using the pretreated drug text set as positive sequence text set；

Step 4, using the positive sequence text set and backward text set as input, using the drug relationship tally set as Output, training neural network, obtains drug relationship disaggregated model；

The neural network includes the parallel positive sequence Text character extraction layer and backward Text character extraction set gradually Layer, Fusion Features layer and classification layer；

The positive sequence Text character extraction layer and the backward Text character extraction layer include the convolution block set gradually And shot and long term Memory Neural Networks block.

2. special as described in claim 1 based on the drug relationship disaggregated model construction method of binary channels CNN-LSTM network Sign is that the convolution block is provided with 4.

3. special as claimed in claim 2 based on the drug relationship disaggregated model construction method of binary channels CNN-LSTM network Sign is that each convolution block includes the batch regularization sublayer set gradually, convolution sublayer, activation primitive sublayer, volume Product sublayer, activation primitive sublayer and pond beggar layer.

4. special as claimed in claim 3 based on the drug relationship disaggregated model construction method of binary channels CNN-LSTM network Sign is that the activation primitive in the activation primitive sublayer is ReLU function.

5. special as described in claim 1 based on the drug relationship disaggregated model construction method of binary channels CNN-LSTM network Sign is that the Fusion Features layer includes full articulamentum.

6. special as described in claim 1 based on the drug relationship disaggregated model construction method of binary channels CNN-LSTM network Sign is that the classification layer includes Softmax function layer.

7. a kind of drug relationship classification method based on binary channels CNN-LSTM network, which is characterized in that drug to be sorted Text executes according to the following steps:

Step A, the drug text to be sorted is pre-processed using the method for step 2 in claim 1, obtains pre- place Drug text after reason；

Step B, by medicine described in the pretreated drug text input to any one of claim 1-6 claim In object relationship disaggregated model, classification results are obtained.