CN116401369A

CN116401369A - Entity identification and classification method for biological product production terms

Info

Publication number: CN116401369A
Application number: CN202310665618.9A
Authority: CN
Inventors: 杨春; 曾茂迪; 李俊谚; 陈跃辉
Original assignee: Baimus Chengdu Digital Technology Co ltd
Current assignee: Baimus Chengdu Digital Technology Co ltd
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-07-07
Anticipated expiration: 2043-06-07
Also published as: CN116401369B

Abstract

The invention belongs to the field of entity identification of production terms in the biopharmaceutical production process, and provides an entity identification and classification method for the production terms of biological products, which comprises the following steps: word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained; manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set; constructing a word vector +BiLSTM +CRF neural network model based on the first word vector model, and training the model on the constructed data set to obtain a second word vector model; performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result; the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.

Description

Entity identification and classification method for biological product production terms

Technical Field

The invention relates to the field of entity identification of production terms in the biopharmaceutical production process, in particular to an entity identification and classification method for the production terms of biological products.

Background

With the continuous and deep development of intelligent manufacturing, in the production of the biopharmaceutical industry, machine learning needs to be carried out on production terms, automatic recognition processing is carried out on a computer, and an entity recognition and classification method is an important basis for realizing intelligent production and control and is also a bottom technology for intelligent manufacturing information processing.

Disclosure of Invention

The invention aims to provide an entity identification and classification method for biological product production terms, which can accurately realize automatic identification and classification of biological product production terms.

The invention solves the technical problems and adopts the following technical scheme:

an entity identification and classification method for biological product production terms, comprising the steps of:

word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained;

manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set;

constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;

performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;

the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.

As a further explanation, word vector training is carried out on unlabeled corpus in biopharmaceutical production by using a continuous Word bag model in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;

the continuous bag-of-words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model is expressed as:

wherein T represents the current time, T represents the total time number, and

is the center word in the text of the current production term,

2c words before and after the central word, and predicting the central word according to the known 2c words and the continuous word bag model>

And the center word->

The probability of occurrence is related to the 2c words before and after the word.

As a further illustration, the continuous bag of words model first subjects the center word to one-hot encoding

The front and back 2c words form the corresponding word vector +.>

Then the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front 2c words and the rear 2c words is obtained/>

Training by maximum likelihood function estimation to obtain each word ++>

Final word vector->

The word vector trained in a unified way is +.>

Where n represents the number of word vectors.

As a further illustration, the artificial labeling of unlabeled corpora in biopharmaceutical production, constructing a data set, specifically includes:

preprocessing data of original corpus, including deleting irrelevant content, special symbols and removing stop words;

according to the difference of actual production lines, preliminarily determining the identified entity categories, wherein the entity categories comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis;

in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B, the non-beginning part of the entity is represented by I, and the non-entity part is represented by O.

As a further illustration, the construction of the word vector +BiLSTM +CRF neural network model specifically comprises:

inputting the word vector obtained by training into a BiLSTM neural network model to obtain global features with context information;

inputting the obtained feature vector with the context information into a CRF, extracting the dependency features among labels, and calculating a loss function;

according to the loss function, the parameters of the entity identification model are updated by adopting an SGD random gradient descent method, and the specific method for updating the parameters of the entity identification model by adopting the SGD random gradient descent method is as follows: the gradient of the error on this sample with respect to the parameter is calculated by randomly extracting a training sample, and then updating the parameter value continuously towards the negative gradient direction until the objective function takes the minimum value and the iteration is stopped.

By way of further illustration, the states of the BiLSTM neural network model neurons are calculated by the following formula:

wherein ,

is a Sigmoid function->

Is the input word vector at the current moment, +.>

Is the hidden layer state of the last moment, +.>

Is a forgetting door, decides the information category to be forgotten, < ->

Is an input gate, determines the kind of information to be retained,/->

Is the input word vector at the current moment +.>

Acquired intermediate state,/->

Is a memory cell, controls the change of state of the cell, < ->

Is the state value of the last moment, +.>

Is the output value in the memory cell, +.>

Is the hidden layer state at the current moment, +.>

Feedback connection matrix representing forgetting gate, +.>

Feedback matrix representing input gates, +.>

Feedback matrix representing hidden units +.>

Feedback matrix representing output gates, +.>

Threshold value representing forgetful door, ++>

Threshold value representing input gate, +.>

Threshold value representing hidden layer element->

Representing the threshold of the output gate.

As a further illustration, the inputting the obtained feature vector with the context information into the CRF, extracting the dependency features between the labels, and calculating the loss function specifically includes:

corresponding word vector sequences for text input words given production terms

And a predicted sequence corresponding to each input word +.>

And define the predictive score of y +.>

:

wherein ,

is a transition matrix, a parameter matrix obtained by learning the sequency between labels by CRF, represents the probability of all labels being transited to the next label,/a>

Is a probability score matrix, which is transformed from a feature matrix with context information, < - > is a->

Is the probability that the ith word is marked as a label j, and t is the predicted label number;

calculating the probability of y using the defined predictive score according to the Softmax function:

the log likelihood function of the probability is:

wherein

Representing the actual labeling sequence,/->

Representing all possible labeling sequences, +.>

Representing other path scores;

the loss of loss function is defined as:

finally, decoding the predicted sequence by a Viterbi algorithm to obtain a probability

Maximum prediction labeling sequence +.>

The expression is as follows:

。

as a further explanation, the entity recognition is performed on the text of the biological product production term to be recognized by using the second word vector model to obtain a recognition result, which specifically includes:

reading a production term text which needs entity recognition, and inputting the production term text into a trained word vector +BiLSTM +CRF model;

the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence of language in the text is obtained in a CRF (color filter) by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.

By way of further illustration, the improvement of the k-means clustering algorithm is specifically: after normalizing the word vector, redefining the distance of cosine similarity

The original Euclidean distance calculation method is improved, so that the K-means algorithm is improved; the improved principle is as follows:

according to the defined production term entity, the word vector of the corresponding production term entity after word vectorization is as follows

Optional word vector->

Will beAfter word vector normalization, deduce:

wherein

Is->

and />

European distance,/, of->

Is->

and />

From the distance equalization, the improved cosine similarity distance +.>

The definition is as follows:

this gives:

according to the criterion that the square error sum criterion function becomes smaller, sequentially and iteratively solving a local optimal solution along the initial value word vector, and further finding k partitions which enable the square error function value to be minimum, wherein the formula for minimizing the square error is as follows:

wherein

Is cluster->

Mean vector of>

The compactness of the entities in the cluster around the cluster mean vector is characterized, and the smaller the value is, the higher the similarity of the entities in the cluster is.

By way of further illustration, the classification of entities of biopharmaceutical production terminology text is accomplished by comparing cosine similarity between each cluster and entity word vectors of the identified dataset, specifically including:

extracting 5-10 production term entities closest to the centroid from 20-50 clusters obtained by the data set, respectively carrying out calculation on the cosine similarity of the word vectors and the production term entities to be classified in the test set to obtain the average value of the cosine similarity, taking the average value as the cosine similarity judgment value between the clusters and the entities to be classified, and dividing the production term entities to be classified under the cluster with the largest cosine similarity, thereby completing the classification task;

the cosine similarity calculation method comprises the following steps:

let the word vector of the entity in the training set be

The term vector of the entity to be classified is +.>

Then->

and />

The cosine similarity calculation formula of (2) is:

wherein ,

，/>

the larger the value, the>

and />

The higher the association, i.e +.>

The closer to 1 represents

and />

The more similar.

The beneficial effects of the invention are as follows: through the entity identification and classification method for the biological product production term, words or words commonly used in the biological pharmacy production can be established through mapping based on a deep learning method, a vector space is established, the vector space is input into a neural network for feature extraction, finally, CRF (central processing unit) is combined for labeling prediction, a relatively accurate identification result is output, and reasonable classification is carried out.

Drawings

FIG. 1 is a flow chart of a method for identifying and classifying entities for use in terms of biological product production in accordance with an embodiment of the present invention;

FIG. 2 is a block diagram of a CBOW model of an embodiment of the invention;

FIG. 3 is a block diagram of a BiLSTM neural network in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of a CRF of an embodiment of the invention;

FIG. 5 is a block diagram of a word vector +BiLSTM +CRF neural network model, in accordance with an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

The embodiment provides a method for identifying and classifying entities for biological product production terms, the flow chart of which is shown in fig. 1, wherein the method comprises the following steps:

s1, carrying out word vector training on unlabeled corpus in biopharmaceutical production to obtain a first word vector model;

s2, manually labeling unlabeled corpus in biopharmaceutical production to construct a data set;

s3, constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;

s4, performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;

s5, clustering entity word vectors in the data set into 20-50 clusters through an improved k-means clustering algorithm, and comparing cosine similarity between each cluster and the entity word vectors of the identified data set to realize entity classification of the biopharmaceutical production term text.

In this embodiment, referring to fig. 2, word vector training may be performed on unlabeled corpus in biopharmaceutical production by using a continuous Word bag model (CBOW) in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;

here, the continuous bag of words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model can be expressed as:

wherein T represents the current time, T tableIndicating the total time, and setting

Is the center word in the text of the current production term,

And the center word->

It should be noted that, in this embodiment, the continuous bag-of-words model first encodes the center word by single-hot encoding

The front and back 2c words form the corresponding word vector +.>

Then the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front and rear 2c words is obtained>

Training by maximum likelihood function estimation to obtain each word ++>

Final word vector->

For convenience of presentation, the word vectors trained in this way are

Where n represents the number of word vectors.

In this embodiment, the manually labeling the unlabeled corpus in the biopharmaceutical production to construct the data set specifically includes:

preprocessing data of original corpus, including deleting irrelevant content, special symbols, removing stop words, etc.;

according to different actual production lines, preliminarily determining the identified entity types, wherein the entity types comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis products, wherein the prevention biological products such as various vaccines, immunoglobulins, interferons, human coagulation factors and the like, the treatment biological products such as antitoxin, human blood proteins, human interferons, human insulin, growth hormone, human epidermal growth factors and the like, and in-vivo and in-vitro diagnosis products such as protein derivatives, surface antigen detection reagents and the like;

in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B (Begin), the non-beginning part of the entity is represented by I (Inside), and the non-entity part is represented by O (Outside).

It should be noted that, referring to fig. 3, fig. 4 and fig. 5, the construction of the word vector+bilstm+crf neural network model may specifically include:

inputting the word vector obtained by training into a BiLSTM (Bi-directional Long Short-Term Memory) neural network model to obtain global features with context information;

LSTM is a special recurrent neural network, and the states of the neurons of the BiLSTM neural network model are calculated by the following formula:

wherein ,

is a Sigmoid function->

Is the input word vector at the current moment, +.>

Is the hidden layer state of the last moment, +.>

Is a forgetting door, decides the information category to be forgotten, < ->

Is an input gate, determines the kind of information to be retained,/->

Is the input word vector at the current moment +.>

Acquired intermediate state,/->

Is a memory cell, controls the change of state of the cell, < ->

Is the state value of the last moment, +.>

Is the output value in the memory cell, +.>

Is the hidden layer state at the current moment, +.>

Feedback connection matrix representing forgetting gate，/>

Feedback matrix representing input gates, +.>

Feedback matrix representing hidden units +.>

Feedback matrix representing output gates, +.>

Threshold value representing forgetful door, ++>

Threshold value representing input gate, +.>

Threshold value representing hidden layer element->

Representing the threshold of the output gate.

In the actual application process, inputting the obtained feature vector with the context information into the CRF, extracting the dependency features among labels, and calculating a loss function, wherein the method specifically comprises the following steps:

corresponding word vector sequences for text input words given production terms

And a predicted sequence corresponding to each input word +.>

And define the predictive score of y +.>

:

wherein ,

the log likelihood function of the probability is:

wherein

Representing the actual labeling sequence,/->

Representing all possible labeling sequences, +.>

Representing other path scores;

the loss of loss function is defined as:

Maximum prediction labeling sequence +.>

The expression is as follows:

。

additionally, the entity recognition is performed on the text of the biological product production term to be recognized by using the second word vector model to obtain a recognition result, which specifically comprises the following steps:

the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence is obtained in the CRF by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.

In this embodiment, the improvement of the k-means clustering algorithm specifically refers to: after normalizing the word vector, redefining the distance of cosine similarity

Optional word vector->

After normalizing the word vector, push out:

wherein

Is->

and />

European distance,/, of->

Is->

and />

From the distance equalization, the improved cosine similarity distance +.>

The definition is as follows:

this gives:

wherein

Is cluster->

Mean vector of>

The degree of closeness of an intra-cluster entity around a cluster mean vector is described, with the smaller the value, the higher the intra-cluster entity similarity.

Finally, the entity classification of the text of the biopharmaceutical production term is realized by comparing the cosine similarity between each cluster and the entity word vector of the identified dataset, and specifically comprises the following steps:

the cosine similarity calculation method comprises the following steps:

let the word vector of the entity in the training set be

The term vector of the entity to be classified is +.>

Then->

and />

The cosine similarity calculation formula of (2) is:

wherein ,

，/>

the larger the value, the>

and />

The higher the association, i.e +.>

The closer to 1 represents

and />

The more similar.

Therefore, the CBOW model used by the invention can perform unsupervised training through large-scale unlabeled data to obtain word vectors with strong semantic expression capability as the input of a subsequent model; moreover, as the BiLSTM neural network is introduced, the global features with the context semantic information in the text sequence of the production term are extracted; meanwhile, on the basis of global features extracted by BiLSTM, the dependency relationship between labels is learned through CRF, so that the accuracy of entity identification in the production term text is improved; finally, by redefining the distance of cosine similarity

The original Euclidean distance calculation method in the K-means algorithm is improved, the method is suitable for entity classification of biopharmaceutical production terms, and the whole process is simple to operate and high in portability.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for identifying and classifying entities for the production of biological products, characterized in that it comprises the following steps:

2. The method for identifying and classifying entities for terms of biopharmaceutical production of claim 1, wherein Word vector training is performed on unlabeled corpus in biopharmaceutical production using continuous Word bag model in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;

wherein T represents the current time, T represents the total time number, and

is the center word in the text of the current production term,

the task at this time is to predict the continuous bag of words model based on the known 2c words, which are the 2c words before and after the center wordCenter word->

And the center word->

3. The method for entity identification and classification of biological product production terms according to claim 2, wherein the continuous bag of words model first encodes the center word by single hot encoding

The front and back 2c words form corresponding word vectors

Training through maximum likelihood function estimation method to obtain each word

Final word vector->

The word vector trained in a unified way is +.>

Where n represents the number of word vectors.

4. The method for identifying and classifying entities for biological product production terms according to claim 1, wherein the manually labeling unlabeled corpus in biopharmaceutical production to construct a dataset comprises:

5. The method for identifying and classifying entities for biological production terms according to claim 1, wherein said constructing a word vector +bilstm +crf neural network model specifically comprises:

6. The method of claim 5, wherein the states of the BiLSTM neural network model neurons are calculated by the following formula:

wherein ,

is a Sigmoid function->

Is the input word vector at the current moment, +.>

Is the hidden layer state of the last moment, +.>

Is a forgetting door, decides the information category to be forgotten, < ->

Is an input gate, determines the kind of information to be retained,/->

Is the input word vector at the current moment +.>

Acquired intermediate state,/->

Is a memory cell, controls the change of state of the cell, < ->

Is the state value of the last moment, +.>

Is the output value in the memory cell, +.>

Is the hidden layer state at the current moment, +.>

Feedback connection matrix representing forgetting gate, +.>

Feedback matrix representing input gates, +.>

Feedback matrix representing hidden units +.>

Feedback matrix representing output gates, +.>

Threshold value representing forgetful door, ++>

Threshold value representing input gate, +.>

Threshold value representing hidden layer element->

Representing the threshold of the output gate.

7. The method for identifying and classifying entities for biological production terms according to claim 5, wherein said inputting the obtained feature vector with context information to CRF, extracting the dependency features between labels, and calculating the loss function, comprises:

corresponding word vector sequences for text input words given production terms

And a predicted sequence corresponding to each input word +.>

And define the predictive score of y +.>

:

wherein ,

the log likelihood function of the probability is:

wherein

Representing the actual labeling sequence,/->

Representing all possible labeling sequences, +.>

Representing other path scores;

the loss of loss function is defined as:

Maximum prediction labeling sequence +.>

The expression is as follows:

。

8. the method for identifying and classifying entities of biological product production terms according to claim 1, wherein the identifying the entities of the biological product production term text to be identified by using the second word vector model to obtain an identification result specifically comprises:

9. The method for identifying and classifying entities for biological production terms according to claim 1, characterized in that the k-means clustering algorithm is modified, in particular: after normalizing the word vector, redefining the distance of cosine similarity

Optional word vector->

After normalizing the word vector, push out:

wherein

Is->

and />

European distance,/, of->

Is->

and />

From the distance equalization, the improved cosine similarity distance +.>

The definition is as follows:

this gives:

wherein

Is cluster->

Mean vector of>

10. The method for entity identification and classification of biopharmaceutical production terms according to claim 1, wherein the entity classification of biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and entity word vectors of the identified dataset, in particular comprising:

the cosine similarity calculation method comprises the following steps:

let the word vector of the entity in the training set be