CN116401369A - Entity identification and classification method for biological product production terms - Google Patents

Entity identification and classification method for biological product production terms Download PDF

Info

Publication number
CN116401369A
CN116401369A CN202310665618.9A CN202310665618A CN116401369A CN 116401369 A CN116401369 A CN 116401369A CN 202310665618 A CN202310665618 A CN 202310665618A CN 116401369 A CN116401369 A CN 116401369A
Authority
CN
China
Prior art keywords
word
production
entity
word vector
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310665618.9A
Other languages
Chinese (zh)
Other versions
CN116401369B (en
Inventor
杨春
曾茂迪
李俊谚
陈跃辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baimus Chengdu Digital Technology Co ltd
Original Assignee
Baimus Chengdu Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baimus Chengdu Digital Technology Co ltd filed Critical Baimus Chengdu Digital Technology Co ltd
Priority to CN202310665618.9A priority Critical patent/CN116401369B/en
Publication of CN116401369A publication Critical patent/CN116401369A/en
Application granted granted Critical
Publication of CN116401369B publication Critical patent/CN116401369B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of entity identification of production terms in the biopharmaceutical production process, and provides an entity identification and classification method for the production terms of biological products, which comprises the following steps: word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained; manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set; constructing a word vector +BiLSTM +CRF neural network model based on the first word vector model, and training the model on the constructed data set to obtain a second word vector model; performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result; the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.

Description

Entity identification and classification method for biological product production terms
Technical Field
The invention relates to the field of entity identification of production terms in the biopharmaceutical production process, in particular to an entity identification and classification method for the production terms of biological products.
Background
With the continuous and deep development of intelligent manufacturing, in the production of the biopharmaceutical industry, machine learning needs to be carried out on production terms, automatic recognition processing is carried out on a computer, and an entity recognition and classification method is an important basis for realizing intelligent production and control and is also a bottom technology for intelligent manufacturing information processing.
Disclosure of Invention
The invention aims to provide an entity identification and classification method for biological product production terms, which can accurately realize automatic identification and classification of biological product production terms.
The invention solves the technical problems and adopts the following technical scheme:
an entity identification and classification method for biological product production terms, comprising the steps of:
word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained;
manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set;
constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;
performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;
the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.
As a further explanation, word vector training is carried out on unlabeled corpus in biopharmaceutical production by using a continuous Word bag model in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;
the continuous bag-of-words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model is expressed as:
Figure SMS_1
wherein T represents the current time, T represents the total time number, and
Figure SMS_2
is the center word in the text of the current production term,
Figure SMS_3
2c words before and after the central word, and predicting the central word according to the known 2c words and the continuous word bag model>
Figure SMS_4
And the center word->
Figure SMS_5
The probability of occurrence is related to the 2c words before and after the word.
As a further illustration, the continuous bag of words model first subjects the center word to one-hot encoding
Figure SMS_6
The front and back 2c words form the corresponding word vector +.>
Figure SMS_7
Then the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front 2c words and the rear 2c words is obtained/>
Figure SMS_8
Training by maximum likelihood function estimation to obtain each word ++>
Figure SMS_9
Final word vector->
Figure SMS_10
The word vector trained in a unified way is +.>
Figure SMS_11
Where n represents the number of word vectors.
As a further illustration, the artificial labeling of unlabeled corpora in biopharmaceutical production, constructing a data set, specifically includes:
preprocessing data of original corpus, including deleting irrelevant content, special symbols and removing stop words;
according to the difference of actual production lines, preliminarily determining the identified entity categories, wherein the entity categories comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis;
in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B, the non-beginning part of the entity is represented by I, and the non-entity part is represented by O.
As a further illustration, the construction of the word vector +BiLSTM +CRF neural network model specifically comprises:
inputting the word vector obtained by training into a BiLSTM neural network model to obtain global features with context information;
inputting the obtained feature vector with the context information into a CRF, extracting the dependency features among labels, and calculating a loss function;
according to the loss function, the parameters of the entity identification model are updated by adopting an SGD random gradient descent method, and the specific method for updating the parameters of the entity identification model by adopting the SGD random gradient descent method is as follows: the gradient of the error on this sample with respect to the parameter is calculated by randomly extracting a training sample, and then updating the parameter value continuously towards the negative gradient direction until the objective function takes the minimum value and the iteration is stopped.
By way of further illustration, the states of the BiLSTM neural network model neurons are calculated by the following formula:
Figure SMS_12
wherein ,
Figure SMS_25
is a Sigmoid function->
Figure SMS_16
Is the input word vector at the current moment, +.>
Figure SMS_19
Is the hidden layer state of the last moment, +.>
Figure SMS_27
Is a forgetting door, decides the information category to be forgotten, < ->
Figure SMS_30
Is an input gate, determines the kind of information to be retained,/->
Figure SMS_28
Is the input word vector at the current moment +.>
Figure SMS_31
Acquired intermediate state,/->
Figure SMS_18
Is a memory cell, controls the change of state of the cell, < ->
Figure SMS_23
Is the state value of the last moment, +.>
Figure SMS_13
Is the output value in the memory cell, +.>
Figure SMS_22
Is the hidden layer state at the current moment, +.>
Figure SMS_17
Feedback connection matrix representing forgetting gate, +.>
Figure SMS_20
Feedback matrix representing input gates, +.>
Figure SMS_26
Feedback matrix representing hidden units +.>
Figure SMS_29
Feedback matrix representing output gates, +.>
Figure SMS_14
Threshold value representing forgetful door, ++>
Figure SMS_21
Threshold value representing input gate, +.>
Figure SMS_15
Threshold value representing hidden layer element->
Figure SMS_24
Representing the threshold of the output gate.
As a further illustration, the inputting the obtained feature vector with the context information into the CRF, extracting the dependency features between the labels, and calculating the loss function specifically includes:
corresponding word vector sequences for text input words given production terms
Figure SMS_32
And a predicted sequence corresponding to each input word +.>
Figure SMS_33
And define the predictive score of y +.>
Figure SMS_34
:
Figure SMS_35
wherein ,
Figure SMS_36
is a transition matrix, a parameter matrix obtained by learning the sequency between labels by CRF, represents the probability of all labels being transited to the next label,/a>
Figure SMS_37
Is a probability score matrix, which is transformed from a feature matrix with context information, < - > is a->
Figure SMS_38
Is the probability that the ith word is marked as a label j, and t is the predicted label number;
calculating the probability of y using the defined predictive score according to the Softmax function:
Figure SMS_39
the log likelihood function of the probability is:
Figure SMS_40
wherein
Figure SMS_41
Representing the actual labeling sequence,/->
Figure SMS_42
Representing all possible labeling sequences, +.>
Figure SMS_43
Representing other path scores;
the loss of loss function is defined as:
Figure SMS_44
finally, decoding the predicted sequence by a Viterbi algorithm to obtain a probability
Figure SMS_45
Maximum prediction labeling sequence +.>
Figure SMS_46
The expression is as follows:
Figure SMS_47
as a further explanation, the entity recognition is performed on the text of the biological product production term to be recognized by using the second word vector model to obtain a recognition result, which specifically includes:
reading a production term text which needs entity recognition, and inputting the production term text into a trained word vector +BiLSTM +CRF model;
the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence of language in the text is obtained in a CRF (color filter) by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.
By way of further illustration, the improvement of the k-means clustering algorithm is specifically: after normalizing the word vector, redefining the distance of cosine similarity
Figure SMS_48
The original Euclidean distance calculation method is improved, so that the K-means algorithm is improved; the improved principle is as follows:
according to the defined production term entity, the word vector of the corresponding production term entity after word vectorization is as follows
Figure SMS_49
Optional word vector->
Figure SMS_50
Will beAfter word vector normalization, deduce:
Figure SMS_51
wherein
Figure SMS_52
Is->
Figure SMS_53
and />
Figure SMS_54
European distance,/, of->
Figure SMS_55
Is->
Figure SMS_56
and />
Figure SMS_57
From the distance equalization, the improved cosine similarity distance +.>
Figure SMS_58
The definition is as follows:
Figure SMS_59
this gives:
Figure SMS_60
according to the criterion that the square error sum criterion function becomes smaller, sequentially and iteratively solving a local optimal solution along the initial value word vector, and further finding k partitions which enable the square error function value to be minimum, wherein the formula for minimizing the square error is as follows:
Figure SMS_61
wherein
Figure SMS_62
Is cluster->
Figure SMS_63
Mean vector of>
Figure SMS_64
The compactness of the entities in the cluster around the cluster mean vector is characterized, and the smaller the value is, the higher the similarity of the entities in the cluster is.
By way of further illustration, the classification of entities of biopharmaceutical production terminology text is accomplished by comparing cosine similarity between each cluster and entity word vectors of the identified dataset, specifically including:
extracting 5-10 production term entities closest to the centroid from 20-50 clusters obtained by the data set, respectively carrying out calculation on the cosine similarity of the word vectors and the production term entities to be classified in the test set to obtain the average value of the cosine similarity, taking the average value as the cosine similarity judgment value between the clusters and the entities to be classified, and dividing the production term entities to be classified under the cluster with the largest cosine similarity, thereby completing the classification task;
the cosine similarity calculation method comprises the following steps:
let the word vector of the entity in the training set be
Figure SMS_65
The term vector of the entity to be classified is +.>
Figure SMS_66
Then->
Figure SMS_67
and />
Figure SMS_68
The cosine similarity calculation formula of (2) is:
Figure SMS_69
wherein ,
Figure SMS_70
,/>
Figure SMS_71
the larger the value, the>
Figure SMS_72
and />
Figure SMS_73
The higher the association, i.e +.>
Figure SMS_74
The closer to 1 represents
Figure SMS_75
and />
Figure SMS_76
The more similar.
The beneficial effects of the invention are as follows: through the entity identification and classification method for the biological product production term, words or words commonly used in the biological pharmacy production can be established through mapping based on a deep learning method, a vector space is established, the vector space is input into a neural network for feature extraction, finally, CRF (central processing unit) is combined for labeling prediction, a relatively accurate identification result is output, and reasonable classification is carried out.
Drawings
FIG. 1 is a flow chart of a method for identifying and classifying entities for use in terms of biological product production in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of a CBOW model of an embodiment of the invention;
FIG. 3 is a block diagram of a BiLSTM neural network in accordance with an embodiment of the present invention;
FIG. 4 is a block diagram of a CRF of an embodiment of the invention;
FIG. 5 is a block diagram of a word vector +BiLSTM +CRF neural network model, in accordance with an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
The embodiment provides a method for identifying and classifying entities for biological product production terms, the flow chart of which is shown in fig. 1, wherein the method comprises the following steps:
s1, carrying out word vector training on unlabeled corpus in biopharmaceutical production to obtain a first word vector model;
s2, manually labeling unlabeled corpus in biopharmaceutical production to construct a data set;
s3, constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;
s4, performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;
s5, clustering entity word vectors in the data set into 20-50 clusters through an improved k-means clustering algorithm, and comparing cosine similarity between each cluster and the entity word vectors of the identified data set to realize entity classification of the biopharmaceutical production term text.
In this embodiment, referring to fig. 2, word vector training may be performed on unlabeled corpus in biopharmaceutical production by using a continuous Word bag model (CBOW) in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;
here, the continuous bag of words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model can be expressed as:
Figure SMS_77
wherein T represents the current time, T tableIndicating the total time, and setting
Figure SMS_78
Is the center word in the text of the current production term,
Figure SMS_79
2c words before and after the central word, and predicting the central word according to the known 2c words and the continuous word bag model>
Figure SMS_80
And the center word->
Figure SMS_81
The probability of occurrence is related to the 2c words before and after the word.
It should be noted that, in this embodiment, the continuous bag-of-words model first encodes the center word by single-hot encoding
Figure SMS_82
The front and back 2c words form the corresponding word vector +.>
Figure SMS_83
Then the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front and rear 2c words is obtained>
Figure SMS_84
Training by maximum likelihood function estimation to obtain each word ++>
Figure SMS_85
Final word vector->
Figure SMS_86
For convenience of presentation, the word vectors trained in this way are
Figure SMS_87
Where n represents the number of word vectors.
In this embodiment, the manually labeling the unlabeled corpus in the biopharmaceutical production to construct the data set specifically includes:
preprocessing data of original corpus, including deleting irrelevant content, special symbols, removing stop words, etc.;
according to different actual production lines, preliminarily determining the identified entity types, wherein the entity types comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis products, wherein the prevention biological products such as various vaccines, immunoglobulins, interferons, human coagulation factors and the like, the treatment biological products such as antitoxin, human blood proteins, human interferons, human insulin, growth hormone, human epidermal growth factors and the like, and in-vivo and in-vitro diagnosis products such as protein derivatives, surface antigen detection reagents and the like;
in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B (Begin), the non-beginning part of the entity is represented by I (Inside), and the non-entity part is represented by O (Outside).
It should be noted that, referring to fig. 3, fig. 4 and fig. 5, the construction of the word vector+bilstm+crf neural network model may specifically include:
inputting the word vector obtained by training into a BiLSTM (Bi-directional Long Short-Term Memory) neural network model to obtain global features with context information;
inputting the obtained feature vector with the context information into a CRF, extracting the dependency features among labels, and calculating a loss function;
according to the loss function, the parameters of the entity identification model are updated by adopting an SGD random gradient descent method, and the specific method for updating the parameters of the entity identification model by adopting the SGD random gradient descent method is as follows: the gradient of the error on this sample with respect to the parameter is calculated by randomly extracting a training sample, and then updating the parameter value continuously towards the negative gradient direction until the objective function takes the minimum value and the iteration is stopped.
LSTM is a special recurrent neural network, and the states of the neurons of the BiLSTM neural network model are calculated by the following formula:
Figure SMS_88
wherein ,
Figure SMS_92
is a Sigmoid function->
Figure SMS_89
Is the input word vector at the current moment, +.>
Figure SMS_96
Is the hidden layer state of the last moment, +.>
Figure SMS_90
Is a forgetting door, decides the information category to be forgotten, < ->
Figure SMS_97
Is an input gate, determines the kind of information to be retained,/->
Figure SMS_94
Is the input word vector at the current moment +.>
Figure SMS_98
Acquired intermediate state,/->
Figure SMS_101
Is a memory cell, controls the change of state of the cell, < ->
Figure SMS_105
Is the state value of the last moment, +.>
Figure SMS_93
Is the output value in the memory cell, +.>
Figure SMS_100
Is the hidden layer state at the current moment, +.>
Figure SMS_103
Feedback connection matrix representing forgetting gate,/>
Figure SMS_106
Feedback matrix representing input gates, +.>
Figure SMS_104
Feedback matrix representing hidden units +.>
Figure SMS_107
Feedback matrix representing output gates, +.>
Figure SMS_91
Threshold value representing forgetful door, ++>
Figure SMS_99
Threshold value representing input gate, +.>
Figure SMS_95
Threshold value representing hidden layer element->
Figure SMS_102
Representing the threshold of the output gate.
In the actual application process, inputting the obtained feature vector with the context information into the CRF, extracting the dependency features among labels, and calculating a loss function, wherein the method specifically comprises the following steps:
corresponding word vector sequences for text input words given production terms
Figure SMS_108
And a predicted sequence corresponding to each input word +.>
Figure SMS_109
And define the predictive score of y +.>
Figure SMS_110
:
Figure SMS_111
wherein ,
Figure SMS_112
is a transition matrix, a parameter matrix obtained by learning the sequency between labels by CRF, represents the probability of all labels being transited to the next label,/a>
Figure SMS_113
Is a probability score matrix, which is transformed from a feature matrix with context information, < - > is a->
Figure SMS_114
Is the probability that the ith word is marked as a label j, and t is the predicted label number;
calculating the probability of y using the defined predictive score according to the Softmax function:
Figure SMS_115
the log likelihood function of the probability is:
Figure SMS_116
wherein
Figure SMS_117
Representing the actual labeling sequence,/->
Figure SMS_118
Representing all possible labeling sequences, +.>
Figure SMS_119
Representing other path scores;
the loss of loss function is defined as:
Figure SMS_120
finally, decoding the predicted sequence by a Viterbi algorithm to obtain a probability
Figure SMS_121
Maximum prediction labeling sequence +.>
Figure SMS_122
The expression is as follows:
Figure SMS_123
additionally, the entity recognition is performed on the text of the biological product production term to be recognized by using the second word vector model to obtain a recognition result, which specifically comprises the following steps:
reading a production term text which needs entity recognition, and inputting the production term text into a trained word vector +BiLSTM +CRF model;
the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence is obtained in the CRF by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.
In this embodiment, the improvement of the k-means clustering algorithm specifically refers to: after normalizing the word vector, redefining the distance of cosine similarity
Figure SMS_124
The original Euclidean distance calculation method is improved, so that the K-means algorithm is improved; the improved principle is as follows:
according to the defined production term entity, the word vector of the corresponding production term entity after word vectorization is as follows
Figure SMS_125
Optional word vector->
Figure SMS_126
After normalizing the word vector, push out:
Figure SMS_127
wherein
Figure SMS_128
Is->
Figure SMS_129
and />
Figure SMS_130
European distance,/, of->
Figure SMS_131
Is->
Figure SMS_132
and />
Figure SMS_133
From the distance equalization, the improved cosine similarity distance +.>
Figure SMS_134
The definition is as follows:
Figure SMS_135
this gives:
Figure SMS_136
according to the criterion that the square error sum criterion function becomes smaller, sequentially and iteratively solving a local optimal solution along the initial value word vector, and further finding k partitions which enable the square error function value to be minimum, wherein the formula for minimizing the square error is as follows:
Figure SMS_137
wherein
Figure SMS_138
Is cluster->
Figure SMS_139
Mean vector of>
Figure SMS_140
The degree of closeness of an intra-cluster entity around a cluster mean vector is described, with the smaller the value, the higher the intra-cluster entity similarity.
Finally, the entity classification of the text of the biopharmaceutical production term is realized by comparing the cosine similarity between each cluster and the entity word vector of the identified dataset, and specifically comprises the following steps:
extracting 5-10 production term entities closest to the centroid from 20-50 clusters obtained by the data set, respectively carrying out calculation on the cosine similarity of the word vectors and the production term entities to be classified in the test set to obtain the average value of the cosine similarity, taking the average value as the cosine similarity judgment value between the clusters and the entities to be classified, and dividing the production term entities to be classified under the cluster with the largest cosine similarity, thereby completing the classification task;
the cosine similarity calculation method comprises the following steps:
let the word vector of the entity in the training set be
Figure SMS_141
The term vector of the entity to be classified is +.>
Figure SMS_142
Then->
Figure SMS_143
and />
Figure SMS_144
The cosine similarity calculation formula of (2) is:
Figure SMS_145
wherein ,
Figure SMS_146
,/>
Figure SMS_147
the larger the value, the>
Figure SMS_148
and />
Figure SMS_149
The higher the association, i.e +.>
Figure SMS_150
The closer to 1 represents
Figure SMS_151
and />
Figure SMS_152
The more similar.
Therefore, the CBOW model used by the invention can perform unsupervised training through large-scale unlabeled data to obtain word vectors with strong semantic expression capability as the input of a subsequent model; moreover, as the BiLSTM neural network is introduced, the global features with the context semantic information in the text sequence of the production term are extracted; meanwhile, on the basis of global features extracted by BiLSTM, the dependency relationship between labels is learned through CRF, so that the accuracy of entity identification in the production term text is improved; finally, by redefining the distance of cosine similarity
Figure SMS_153
The original Euclidean distance calculation method in the K-means algorithm is improved, the method is suitable for entity classification of biopharmaceutical production terms, and the whole process is simple to operate and high in portability.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for identifying and classifying entities for the production of biological products, characterized in that it comprises the following steps:
word vector training is carried out on unlabeled corpus in biopharmaceutical production, and a first word vector model is obtained;
manually labeling the unlabeled corpus in the biopharmaceutical production to construct a data set;
constructing a word vector +BiLSTM +CRF neural network model on the basis of the first word vector model, and training the model on the constructed data set to obtain a second word vector model;
performing entity recognition on the biological product production term text to be recognized by using the second word vector model to obtain a recognition result;
the entity word vectors in the data set are clustered into 20-50 clusters through a modified k-means clustering algorithm, and entity classification of the biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and the entity word vectors of the identified data set.
2. The method for identifying and classifying entities for terms of biopharmaceutical production of claim 1, wherein Word vector training is performed on unlabeled corpus in biopharmaceutical production using continuous Word bag model in Word2vec, and the corpus selects words or words of common terms in biopharmaceutical production;
the continuous bag-of-words model predicts a certain center word by using information of the front and rear 2c words of the center word, and the model is expressed as:
Figure QLYQS_1
wherein T represents the current time, T represents the total time number, and
Figure QLYQS_2
is the center word in the text of the current production term,
Figure QLYQS_3
the task at this time is to predict the continuous bag of words model based on the known 2c words, which are the 2c words before and after the center wordCenter word->
Figure QLYQS_4
And the center word->
Figure QLYQS_5
The probability of occurrence is related to the 2c words before and after the word.
3. The method for entity identification and classification of biological product production terms according to claim 2, wherein the continuous bag of words model first encodes the center word by single hot encoding
Figure QLYQS_6
The front and back 2c words form corresponding word vectors
Figure QLYQS_7
Then the 2c words are sent to the input layer, and are sent to the output layer after being calculated by multiplying a shared weight matrix in the projection layer, finally the probability of the central word about the front and rear 2c words is obtained>
Figure QLYQS_8
Training through maximum likelihood function estimation method to obtain each word
Figure QLYQS_9
Final word vector->
Figure QLYQS_10
The word vector trained in a unified way is +.>
Figure QLYQS_11
Where n represents the number of word vectors.
4. The method for identifying and classifying entities for biological product production terms according to claim 1, wherein the manually labeling unlabeled corpus in biopharmaceutical production to construct a dataset comprises:
preprocessing data of original corpus, including deleting irrelevant content, special symbols and removing stop words;
according to the difference of actual production lines, preliminarily determining the identified entity categories, wherein the entity categories comprise prevention biological products, treatment biological products and in-vivo and in-vitro diagnosis;
in the labeling process, according to the characteristics of the text of the production term, the entities are labeled by adopting a BIO labeling method, the beginning part of the entity is represented by B, the non-beginning part of the entity is represented by I, and the non-entity part is represented by O.
5. The method for identifying and classifying entities for biological production terms according to claim 1, wherein said constructing a word vector +bilstm +crf neural network model specifically comprises:
inputting the word vector obtained by training into a BiLSTM neural network model to obtain global features with context information;
inputting the obtained feature vector with the context information into a CRF, extracting the dependency features among labels, and calculating a loss function;
according to the loss function, the parameters of the entity identification model are updated by adopting an SGD random gradient descent method, and the specific method for updating the parameters of the entity identification model by adopting the SGD random gradient descent method is as follows: the gradient of the error on this sample with respect to the parameter is calculated by randomly extracting a training sample, and then updating the parameter value continuously towards the negative gradient direction until the objective function takes the minimum value and the iteration is stopped.
6. The method of claim 5, wherein the states of the BiLSTM neural network model neurons are calculated by the following formula:
Figure QLYQS_12
wherein ,
Figure QLYQS_17
is a Sigmoid function->
Figure QLYQS_16
Is the input word vector at the current moment, +.>
Figure QLYQS_23
Is the hidden layer state of the last moment, +.>
Figure QLYQS_15
Is a forgetting door, decides the information category to be forgotten, < ->
Figure QLYQS_22
Is an input gate, determines the kind of information to be retained,/->
Figure QLYQS_18
Is the input word vector at the current moment +.>
Figure QLYQS_25
Acquired intermediate state,/->
Figure QLYQS_24
Is a memory cell, controls the change of state of the cell, < ->
Figure QLYQS_29
Is the state value of the last moment, +.>
Figure QLYQS_13
Is the output value in the memory cell, +.>
Figure QLYQS_19
Is the hidden layer state at the current moment, +.>
Figure QLYQS_21
Feedback connection matrix representing forgetting gate, +.>
Figure QLYQS_27
Feedback matrix representing input gates, +.>
Figure QLYQS_26
Feedback matrix representing hidden units +.>
Figure QLYQS_30
Feedback matrix representing output gates, +.>
Figure QLYQS_14
Threshold value representing forgetful door, ++>
Figure QLYQS_20
Threshold value representing input gate, +.>
Figure QLYQS_28
Threshold value representing hidden layer element->
Figure QLYQS_31
Representing the threshold of the output gate.
7. The method for identifying and classifying entities for biological production terms according to claim 5, wherein said inputting the obtained feature vector with context information to CRF, extracting the dependency features between labels, and calculating the loss function, comprises:
corresponding word vector sequences for text input words given production terms
Figure QLYQS_32
And a predicted sequence corresponding to each input word +.>
Figure QLYQS_33
And define the predictive score of y +.>
Figure QLYQS_34
:
Figure QLYQS_35
wherein ,
Figure QLYQS_36
is a transition matrix, a parameter matrix obtained by learning the sequency between labels by CRF, represents the probability of all labels being transited to the next label,/a>
Figure QLYQS_37
Is a probability score matrix, which is transformed from a feature matrix with context information, < - > is a->
Figure QLYQS_38
Is the probability that the ith word is marked as a label j, and t is the predicted label number;
calculating the probability of y using the defined predictive score according to the Softmax function:
Figure QLYQS_39
the log likelihood function of the probability is:
Figure QLYQS_40
wherein
Figure QLYQS_41
Representing the actual labeling sequence,/->
Figure QLYQS_42
Representing all possible labeling sequences, +.>
Figure QLYQS_43
Representing other path scores;
the loss of loss function is defined as:
Figure QLYQS_44
finally, decoding the predicted sequence by a Viterbi algorithm to obtain a probability
Figure QLYQS_45
Maximum prediction labeling sequence +.>
Figure QLYQS_46
The expression is as follows:
Figure QLYQS_47
8. the method for identifying and classifying entities of biological product production terms according to claim 1, wherein the identifying the entities of the biological product production term text to be identified by using the second word vector model to obtain an identification result specifically comprises:
reading a production term text which needs entity recognition, and inputting the production term text into a trained word vector +BiLSTM +CRF model;
the text data of the production term is converted into word vectors after passing through a continuous word bag model, the word vectors are subjected to feature extraction through a BiLSTM neural network to obtain feature vectors with global information, and finally, the maximum possible labeling sequence of each sentence is obtained in the CRF by adopting a Viterbi algorithm, namely, the entity recognition result of the production term is produced.
9. The method for identifying and classifying entities for biological production terms according to claim 1, characterized in that the k-means clustering algorithm is modified, in particular: after normalizing the word vector, redefining the distance of cosine similarity
Figure QLYQS_48
The original Euclidean distance calculation method is improved, so that the K-means algorithm is improved; the improved principle is as follows:
according to the defined production term entity, the word vector of the corresponding production term entity after word vectorization is as follows
Figure QLYQS_49
Optional word vector->
Figure QLYQS_50
After normalizing the word vector, push out:
Figure QLYQS_51
wherein
Figure QLYQS_52
Is->
Figure QLYQS_53
and />
Figure QLYQS_54
European distance,/, of->
Figure QLYQS_55
Is->
Figure QLYQS_56
and />
Figure QLYQS_57
From the distance equalization, the improved cosine similarity distance +.>
Figure QLYQS_58
The definition is as follows:
Figure QLYQS_59
this gives:
Figure QLYQS_60
according to the criterion that the square error sum criterion function becomes smaller, sequentially and iteratively solving a local optimal solution along the initial value word vector, and further finding k partitions which enable the square error function value to be minimum, wherein the formula for minimizing the square error is as follows:
Figure QLYQS_61
wherein
Figure QLYQS_62
Is cluster->
Figure QLYQS_63
Mean vector of>
Figure QLYQS_64
The compactness of the entities in the cluster around the cluster mean vector is characterized, and the smaller the value is, the higher the similarity of the entities in the cluster is.
10. The method for entity identification and classification of biopharmaceutical production terms according to claim 1, wherein the entity classification of biopharmaceutical production term text is achieved by comparing cosine similarity between each cluster and entity word vectors of the identified dataset, in particular comprising:
extracting 5-10 production term entities closest to the centroid from 20-50 clusters obtained by the data set, respectively carrying out calculation on the cosine similarity of the word vectors and the production term entities to be classified in the test set to obtain the average value of the cosine similarity, taking the average value as the cosine similarity judgment value between the clusters and the entities to be classified, and dividing the production term entities to be classified under the cluster with the largest cosine similarity, thereby completing the classification task;
the cosine similarity calculation method comprises the following steps:
let the word vector of the entity in the training set be
Figure QLYQS_65
The term vector of the entity to be classified in terms of production is as follows
Figure QLYQS_66
Then->
Figure QLYQS_67
and />
Figure QLYQS_68
The cosine similarity calculation formula of (2) is:
Figure QLYQS_69
wherein ,
Figure QLYQS_70
,/>
Figure QLYQS_71
the larger the value, the>
Figure QLYQS_72
and />
Figure QLYQS_73
The higher the association, i.e +.>
Figure QLYQS_74
The closer to 1 represents +.>
Figure QLYQS_75
and />
Figure QLYQS_76
The more similar.
CN202310665618.9A 2023-06-07 2023-06-07 Entity identification and classification method for biological product production terms Active CN116401369B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310665618.9A CN116401369B (en) 2023-06-07 2023-06-07 Entity identification and classification method for biological product production terms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310665618.9A CN116401369B (en) 2023-06-07 2023-06-07 Entity identification and classification method for biological product production terms

Publications (2)

Publication Number Publication Date
CN116401369A true CN116401369A (en) 2023-07-07
CN116401369B CN116401369B (en) 2023-08-11

Family

ID=87018329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310665618.9A Active CN116401369B (en) 2023-06-07 2023-06-07 Entity identification and classification method for biological product production terms

Country Status (1)

Country Link
CN (1) CN116401369B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117454892A (en) * 2023-12-20 2024-01-26 深圳市智慧城市科技发展集团有限公司 Metadata management method, device, terminal equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298042A (en) * 2019-06-26 2019-10-01 四川长虹电器股份有限公司 Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN113191148A (en) * 2021-04-30 2021-07-30 西安理工大学 Rail transit entity identification method based on semi-supervised learning and clustering
CN114091460A (en) * 2021-11-24 2022-02-25 长沙理工大学 Multitask Chinese entity naming identification method
CN114996287A (en) * 2022-06-20 2022-09-02 上海电器科学研究所(集团)有限公司 Automatic equipment identification and capacity expansion method based on feature library
CN115510864A (en) * 2022-10-14 2022-12-23 昆明理工大学 Chinese crop disease and pest named entity recognition method fused with domain dictionary
CN115859980A (en) * 2022-11-24 2023-03-28 山东鲁软数字科技有限公司 Semi-supervised named entity identification method, system and electronic equipment
CN115859164A (en) * 2022-09-09 2023-03-28 第三维度(河南)软件科技有限公司 Method and system for identifying and classifying building entities based on prompt
CN116127084A (en) * 2022-10-21 2023-05-16 中国农业大学 Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method
CN116186266A (en) * 2023-03-06 2023-05-30 欧冶工业品股份有限公司 BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system
CN116187444A (en) * 2023-03-01 2023-05-30 中国人民解放军国防科技大学 K-means++ based professional field sensitive entity knowledge base construction method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298042A (en) * 2019-06-26 2019-10-01 四川长虹电器股份有限公司 Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN113191148A (en) * 2021-04-30 2021-07-30 西安理工大学 Rail transit entity identification method based on semi-supervised learning and clustering
CN114091460A (en) * 2021-11-24 2022-02-25 长沙理工大学 Multitask Chinese entity naming identification method
CN114996287A (en) * 2022-06-20 2022-09-02 上海电器科学研究所(集团)有限公司 Automatic equipment identification and capacity expansion method based on feature library
CN115859164A (en) * 2022-09-09 2023-03-28 第三维度(河南)软件科技有限公司 Method and system for identifying and classifying building entities based on prompt
CN115510864A (en) * 2022-10-14 2022-12-23 昆明理工大学 Chinese crop disease and pest named entity recognition method fused with domain dictionary
CN116127084A (en) * 2022-10-21 2023-05-16 中国农业大学 Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method
CN115859980A (en) * 2022-11-24 2023-03-28 山东鲁软数字科技有限公司 Semi-supervised named entity identification method, system and electronic equipment
CN116187444A (en) * 2023-03-01 2023-05-30 中国人民解放军国防科技大学 K-means++ based professional field sensitive entity knowledge base construction method
CN116186266A (en) * 2023-03-06 2023-05-30 欧冶工业品股份有限公司 BERT (binary image analysis) and NER (New image analysis) entity extraction and knowledge graph material classification optimization method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUANHAI DONG ET AL.: "Character-based LSTM-CRF witth Radical-level Features for Chinese Named Entity Recognition", 《NATIONAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS 2016》, pages 239 - 250 *
刘越: "K-means聚类算法的改进", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 2, pages 138 - 2336 *
黎航宇: "命名实体识别中适应性特征的跨领域与跨风格特性研究", 《软件》, vol. 35, no. 10, pages 100 - 106 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117454892A (en) * 2023-12-20 2024-01-26 深圳市智慧城市科技发展集团有限公司 Metadata management method, device, terminal equipment and storage medium
CN117454892B (en) * 2023-12-20 2024-04-02 深圳市智慧城市科技发展集团有限公司 Metadata management method, device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN116401369B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
Hasani et al. Spatio-temporal facial expression recognition using convolutional neural networks and conditional random fields
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN113033249A (en) Character recognition method, device, terminal and computer storage medium thereof
CN112614538A (en) Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN110110324B (en) Biomedical entity linking method based on knowledge representation
CN111476024A (en) Text word segmentation method and device and model training method
CN116401369B (en) Entity identification and classification method for biological product production terms
WO2010062268A1 (en) A method for updating a 2 dimensional linear discriminant analysis (2dlda) classifier engine
Thomas et al. A deep HMM model for multiple keywords spotting in handwritten documents
Kumar et al. Future of machine learning (ML) and deep learning (DL) in healthcare monitoring system
CN111581974A (en) Biomedical entity identification method based on deep learning
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
WO2020108808A1 (en) Method and system for classification of data
Chen et al. DeepGly: A deep learning framework with recurrent and convolutional neural networks to identify protein glycation sites from imbalanced data
Rahman et al. IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data
CN113191150B (en) Multi-feature fusion Chinese medical text named entity identification method
CN118013038A (en) Text increment relation extraction method based on prototype clustering
Missaoui et al. Multi-stream continuous hidden Markov models with application to landmine detection
CN113312907A (en) Remote supervision relation extraction method and device based on hybrid neural network
Wayahdi et al. KNN and XGBoost Algorithms for Lung Cancer Prediction
CN116757195A (en) Implicit emotion recognition method based on prompt learning
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
CN116127097A (en) Structured text relation extraction method, device and equipment
Kim Probabilistic sequence translation-alignment model for time-series classification
CN114036947A (en) Small sample text classification method and system for semi-supervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant