CN111563383A - Chinese named entity identification method based on BERT and semi CRF - Google Patents

Chinese named entity identification method based on BERT and semi CRF Download PDF

Info

Publication number
CN111563383A
CN111563383A CN202010272320.8A CN202010272320A CN111563383A CN 111563383 A CN111563383 A CN 111563383A CN 202010272320 A CN202010272320 A CN 202010272320A CN 111563383 A CN111563383 A CN 111563383A
Authority
CN
China
Prior art keywords
crf
named entity
layer
score
entity recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010272320.8A
Other languages
Chinese (zh)
Inventor
蔡毅
郑煜佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010272320.8A priority Critical patent/CN111563383A/en
Publication of CN111563383A publication Critical patent/CN111563383A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a Chinese named entity recognition method based on BERT and semi CRF, which constructs a named entity recognition model and comprises the following steps: obtaining a pre-trained BERT model; preprocessing original corpus data of named entity recognition to construct a training set of the named entity recognition; inputting the constructed training set data of named entity recognition into a pre-trained BERT language model; sequentially inputting the output of the BERT language model into a bidirectional LSTM neural network and a CRF and SemicCRF combined module, and carrying out multiple iterative training on the bidirectional LSTM neural network and the combined module; and carrying out named entity recognition on the Chinese text by using the complete named entity recognition model obtained after training. The invention solves the problem that the traditional word2vec can not distinguish the polysemous words, and combines the word level information which is often ignored by the traditional CRF method and the word level information by introducing a method based on the semi CRF, thereby improving the effect of the Chinese named entity recognition to a certain extent.

Description

Chinese named entity identification method based on BERT and semi CRF
Technical Field
The invention relates to the technical field of named entity recognition, in particular to a Chinese named entity recognition method based on BERT and semi CRF.
Background
Named Entity Recognition (NER) is a task in the field of Natural Language Processing (NLP) that aims at identifying entities from text and classifying them into predefined Entity types, such as names of people, places, organizations, etc. Named entity recognition can not only be used as a tool for information extraction alone, but also play an important role in other tasks and applications in the field of natural language processing, such as information retrieval, automatic text summarization, question answering, machine translation, knowledge base construction and the like.
The existing method for identifying named entities is Bi-LSTM + CRF, wherein the used Bi-LSTM (bidirectional long-short term memory network) is a deep neural network which is very popular in deep learning, and the characteristic context relationship in a long sequence can be learned in the identification of the named entities; the CRF (conditional random field) used is a traditional machine learning method, and the context of the label can be learned in named entity recognition.
The above-mentioned Bi-LSTM + CRF-based method requires learning the word-embedded representation from the named entity recognition dataset by itself, and there are drawbacks including: Bi-LSTM cannot deal with the situation of word ambiguity when the learning word is embedded and expressed; the named entity recognition data set is not large in scale, and the quality of word embedding representation which can be learned from the named entity recognition data set is limited; Bi-LSTM cannot process data in parallel, which results in that the size of the word-embedded representation it sets is limited not to be too large, otherwise the time cost of training learning will multiply. In addition, the characteristic exists in the text, namely the named entity mostly exists in a segment (segment) consisting of a plurality of words, and the CRF conditional random field in the Bi-LSTM + CRF-based method cannot utilize the information of the segment level in the unit of the word level.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a Chinese named entity identification method based on BERT and semi CRF. The invention can solve the problems of limited learning quality of word embedding representation and single word ambiguity which cannot be solved, and can avoid the problem that CRF can only use word-level information and neglects fragment-level information.
The purpose of the invention can be realized by the following technical scheme:
a named entity recognition method based on BERT and SemicRF builds a named entity recognition model, wherein the model comprises a BERT language model, a bidirectional LSTM and a CRF and SemicCRF combined module, and the method comprises the following steps:
acquiring a pre-trained BERT language model;
preprocessing original corpus data of named entity recognition to construct a training set of the named entity recognition;
inputting the obtained training set data of named entity recognition into a pre-trained BERT language model;
sequentially inputting the output of the BERT language model into a bidirectional LSTM neural network and a CRF and SemicCRF combined module, and carrying out multiple iterative training on the bidirectional LSTM neural network and the combined module;
and carrying out named entity recognition on the Chinese text by using the complete named entity recognition model obtained after training.
Further, the pre-trained BERT language model is obtained in a manner that: downloading BERT source codes of the open source of the Google, and using the BERT source codes to automatically pre-train a BERT pre-training language model on a mass of Chinese text corpora without labels; or directly download the Chinese BERT language model, Chinese _ L-12_ H-768_ A-12, pre-trained by Google officials.
Further, the step of preprocessing the original corpus data of the named entity recognition and constructing the training set of the named entity recognition includes:
carrying out conventional data preprocessing on the named entity identification original corpus;
determining the entity type to be identified according to the actual application requirement, or directly using the universal entity type;
marking the original corpus by adopting an entity marking method of BIOES;
and (3) formulating a specific marking rule according to the actual application requirement, and manually marking the unmarked original corpus or converting and correcting the marked original corpus by combining the marking rule.
Further, the step of sequentially inputting the output of the BERT language model into the bidirectional LSTM neural network and the CRF and SemiCRF joint module, and performing multiple iterative training on the bidirectional LSTM neural network and the joint module includes:
inputting the sequence output by the BERT language model into a bidirectional LSTM neural network;
the bidirectional LSTM neural network outputs probability distribution vectors of all entity types of each word in the sequence, namely CRF characteristics of each word at a word level (word level);
respectively inputting the obtained CRF characteristic sequences of the word level into a CRF layer and a semiCRF layer in a CRF and semiCRF combined module;
calculating a loss function of the CRF layer by adopting a bi-LSTM + CRF method in the CRF layer;
the semiCRF layer calculates the semiCRF characteristics of each segment in the sentence according to the CRF characteristics and the ground true tags of each word in the sentence, and further calculates the score of the optimal path;
the semi CRF layer calculates the scores of all paths through a forward algorithm through a semi CRF feature transfer matrix;
calculating a loss function of the semi CRF layer according to the best path score and all path scores;
the SGD is used to update the parameters of the entire named entity recognition model with a weighted sum of the loss function of the CRF layer and the loss function of the SemiCRF layer.
Further, the step of performing named entity recognition on the Chinese text by using the trained complete named entity recognition model includes:
inputting the sentence needing named entity recognition into the trained complete named entity recognition model;
after passing through a pre-trained BERT language model, the input sequence sequentially passes through a bidirectional LSTM neural network and a CRF and SemiCRF combined module, CRF characteristics of each word of an input sentence are calculated, and then a CRF characteristic matrix of a CRF layer and a SemiCRF characteristic matrix of a SemiCRF layer are calculated;
using a viterbi algorithmThe method decodes the best path of the input sentence on the CRF layer and the semi CRF layer respectively to obtain the CRF layer label sequence and the fraction score of the sequence on the CRF layerC-CSemicrF layer tag sequence and fractional score of the sequence on the SemicrF layerS-S
Calculating the fraction score of the label sequence decoded by the CRF layer in the SemiCRF layerC-SAnd calculating the fraction score of the label sequence decoded by the SemiCRF layer in the CRF layerS-C
Calculating the total score of the tag sequences of CRF layerC-C+scoreC-SAnd score of SemicRF layer tag sequenceS-S+scoreS-CAnd because the score is calculated through negative log-likelihood processing, the label sequence with the smallest score is taken as the result of named entity recognition.
Compared with the prior art, the invention has the following beneficial effects:
1. the BERT model used by the invention can learn the expression of word embedding with good quality from a large-scale Chinese text in a pre-training and fine-tuning mode, is not limited to a named entity recognition data set which needs to be manually labeled and processed, and can adjust the current semantics according to a context scene, thereby solving the problem of ambiguity.
2. Compared with the Conditional Random Field (CRF) only using word level information, the semi-Markov conditional random field (SemiCRF) introduced into the named entity recognition can be more suitable for the named entities with obvious segment level characteristics, and in order to ensure the named entity recognition effect of the SemiCRF, the semi-Markov conditional random field (SemiCRF) is combined with the CRF to a certain extent so that the model can simultaneously and effectively utilize the characteristics of the word level and the fragment level.
3. The invention considers two methods of CRF and semi CRF at the same time in the process of training and decoding, and especially, the accuracy rate of named entity identification can be ensured by taking the best result as the final result during decoding.
Drawings
FIG. 1 is a flow chart of a method for Chinese named entity recognition based on BERT and SemicRF in the present invention.
Fig. 2 is a schematic structural diagram of a named entity recognition model for chinese named entity recognition based on BERT and SemiCRF in the embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
Fig. 1 is a flow chart of a method for recognizing a named entity in chinese based on BERT and SemiCRF, and a named entity recognition model shown in fig. 2 is constructed, wherein the model comprises a BERT language model, a bidirectional LSTM and a CRF and SemiCRF joint module, and the method comprises the following steps:
s1, obtaining a pre-trained BERT model;
specifically, the obtaining mode includes: downloading BERT source codes of the open source of the Google, and obtaining a BERT pre-training language model by using the BERT source codes on a mass of non-label Chinese text corpora by using the existing pre-training technology; or directly download the Chinese BERT language model, Chinese _ L-12_ H-768_ A-12, pre-trained by Google officials.
S2, preprocessing the original corpus data of the named entity recognition, and constructing a training set of the named entity recognition, comprising the following steps:
s21, performing conventional data preprocessing on the named entity recognition original corpus, wherein the conventional data preprocessing comprises correcting wrongly-written characters, normalizing characters and the like; the original corpus data is named entity data which is marked;
s22, determining the entity type to be identified according to the actual application requirement, or directly using the universal entity type such as human name (PERSON), LOCATION name (LOCATION), ORGANIZATION name (ORGANIZATION), etc.;
s23, in order to cope with the situation that the entity lengths are different and the entity boundaries are difficult to distinguish, adopting a BIOES entity labeling method: b marks the beginning of the long entity, I marks the inside of the long entity, E marks the tail of the long entity, S marks the entity represented by only one word, O marks the non-entity, e.g., "liu xuan de" will be marked as (B-PER, I-PER, E-PER);
s24, formulating specific labeling rules according to actual application requirements, and manually labeling the unlabeled original corpus or converting and correcting the labeled original corpus by combining the labeling rules of the steps S22 and S23.
And S3, inputting the training set data of the named entity recognition obtained by preprocessing in the step S2 into the pretrained BERT language model.
Specifically, the training set data is input into a pre-trained BERT language model in sentence units, and the output of the BERT language model is a word embedding vector sequence.
S4, sequentially inputting the output of the BERT language model in the step S3 into the bidirectional LSTM neural network and the CRF and SemicCRF combined module, and performing multiple iterative training on the bidirectional LSTM neural network and the combined module, wherein the method comprises the following steps:
s41, inputting the sequence output by the BERT language model into a bidirectional LSTM neural network;
s42, the bidirectional LSTM neural network outputs probability distribution vectors of all entity types of each word in the sequence, namely CRF characteristics of each word at word level;
s43, respectively inputting the obtained CRF characteristic sequences into a CRF layer and a semiCRF layer in the CRF and semiCRF combined module;
s44, calculating a loss function of the CRF layer by the aid of a bi-LSTM + CRF method in the CRF layer;
for example, if the input sequence is "Wilson's doctor comes to California research", the corresponding label sequence of group true is (B-PER, I-PER, E-PER, O, O, O, B-LOC, I-LOC, I-LOC, I-LOC, E-LOC, O, O, O, O, O) for the "benefit" word, the CRF considers not only the score of the location itself labeled as I-LOC, but also the labeling results of the context "plus" and "good", if "benefit" is labeled as I-PER, it is obviously impossible to follow I-PER from "plus" B-LOC, the CRF learning the context of the label from the data set would therefore give a very low score of "benefit" labeled as I-PER;
s45, the semiCRF layer calculates the semiCRF characteristics of each segment in the sentence according to the CRF characteristics and the ground true labels of each word in the sentence, and further calculates the score of the best path;
in the invention, the words in the word embedding vector correspond to the single words in the text, and the segment level corresponds to the words in the text. Similarly for the input sequence "Wilson's doctor comes to California research study", the segment-level annotation sequences for SemicRF are ((1,3, PER), (4,4, O), (5,5, O), (6,6, O), (7,7, O), (8,12, LOC), (13,13, O), (14,14, O), (15,15, O), (16,16, O)), the score for the best path can be calculated according to the following formula:
Figure RE-GDA0002552921300000071
Figure RE-GDA0002552921300000072
Figure RE-GDA0002552921300000081
where s denotes a segment-level tag sequence, w denotes a word-embedded vector representation of the input sequence, liDenoted is the i-th segment level label, biAnd eiRespectively representing the corresponding positions of the beginning and end of the ith segment-level label on the input sequence, miIs a fraction of the ith segment itself, bi,jDenoted is the segment level transition score, y, from category i to category jkIs the group true tag of the kth word of the input sequence,
Figure RE-GDA0002552921300000082
is with the label ykA vector of weight parameters associated. w'kThe feature vector of the k-th word is represented, and the construction mode is as follows:
Figure RE-GDA0002552921300000083
wherein the content of the first and second substances,
Figure RE-GDA0002552921300000084
is the embedded vector corresponding to the index of each word in the segment;
s46, calculating the scores of all paths by the semi CRF layer through a forward algorithm through a semi CRF feature transfer matrix;
s47, calculating a loss function of the SemicRF layer according to the best path score and all the path scores;
required fraction of the loss function
Figure RE-GDA0002552921300000085
Is processed by negative log-likelihood, so it is expressed as Loss ═ scoreall_path-scorebest_path
And S48, adopting SGD (random gradient descent) to update parameters of the whole named entity recognition model by using the weighted sum of the loss function of the CRF layer and the loss function of the semiCRF layer, wherein the parameters comprise model parameters including a BERT language model, an LSTM neural network and a CRF and semiCRF joint module. The weighted weights need to be optimized by a controlled variable method, and the weighted weights can be changed according to different training data of named entity recognition.
S5, using the complete named entity recognition model obtained by training in the step S4 to recognize the named entity of the Chinese text, comprising the following steps:
s51, inputting the sentence needing named entity recognition into the trained complete named entity recognition model;
s52, after passing through a pre-trained BERT language model, the input sequence sequentially passes through a bidirectional LSTM neural network and a CRF and SemiCRF combined module, CRF characteristics of each word of the input sentence are calculated, and a CRF characteristic matrix of a CRF layer and a SemiCRF characteristic matrix of the SemiCRF layer are further calculated;
s53, decoding the best path of the input sentence on the CRF layer and the semi CRF layer respectively by using a viterbi algorithm, namely obtaining the CRF layer label sequence and the fraction score of the CRF layer label sequence on the CRF layerC-CAnd a SemicCRF layer tag sequence on a SemicCRF layerScoreS-S
S54, calculating the score of the label sequence decoded by the CRF layer in the step S53 in the SemiCRF layerC-SAnd calculating the score of the label sequence decoded by the SemiCRF layer in the step S53 in the CRF layerS-C
S55, respectively calculating the total score of the two label sequences obtained in the step S53C-C+ scoreC-S、scoreS-S+scoreS-CAnd because the score is calculated through negative log-likelihood processing, the label sequence with the smallest score is taken as the result of named entity recognition.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (7)

1. A method for recognizing a Chinese named entity based on BERT and SemicRF is characterized in that a named entity recognition model is constructed, the model comprises a BERT language model, a bidirectional LSTM and a CRF and SemicCRF combined module, and the method comprises the following steps:
obtaining a pre-trained BERT model;
preprocessing original corpus data of named entity recognition to construct a training set of the named entity recognition;
inputting the constructed training set data of named entity recognition into a pre-trained BERT language model;
sequentially inputting the output of the BERT language model into a bidirectional LSTM neural network and a CRF and SemicCRF combined module, and carrying out multiple iterative training on the bidirectional LSTM neural network and the combined module;
and carrying out named entity recognition on the Chinese text by using the complete named entity recognition model obtained after training.
2. The method of claim 1, wherein the pre-trained BERT model is obtained by: downloading BERT source codes of the open source of the Google, and using the BERT source codes to automatically pre-train a BERT pre-training language model on a mass of Chinese text corpora without labels; or directly download the Chinese BERT language model, Chinese _ L-12_ H-768_ A-12, pre-trained by Google officials.
3. The method according to claim 1, wherein the step of preprocessing the raw corpus data of the named entity recognition to construct the training set of the named entity recognition comprises:
carrying out conventional data preprocessing on the named entity identification original corpus;
determining the entity type to be identified according to the actual application requirement, or directly using the universal entity type;
marking the original corpus by adopting an entity marking method of BIOES;
and (3) formulating a specific labeling rule according to the actual application requirement, and manually labeling the unmarked original corpus or converting and correcting the labeled original corpus by combining the labeling rule.
4. The method of claim 1, wherein the step of inputting the output of the BERT language model to the bi-directional LSTM neural network and the CRF and SemiCRF joint module sequentially comprises the steps of:
inputting the sequence output by the BERT language model into a bidirectional LSTM neural network;
the bidirectional LSTM neural network outputs probability distribution vectors of all entity types of each word in the sequence, namely CRF characteristics of each word at a word level;
respectively inputting the obtained CRF characteristic sequences into a CRF layer and a semiCRF layer in a CRF and semiCRF combined module;
calculating a loss function of the CRF layer by adopting a bi-LSTM + CRF method in the CRF layer;
the semiCRF layer calculates the semiCRF characteristics of each segment in the sentence according to the CRF characteristics and the ground true tags of each word in the sentence, and further calculates the score of the optimal path;
the semi CRF layer calculates the scores of all paths through a forward algorithm through a semi CRF feature transfer matrix;
calculating a loss function of the semi CRF layer according to the best path score and all path scores;
the SGD is used to update the parameters of the named entity recognition model with a weighted sum of the loss function of the CRF layer and the loss function of the SemiCRF layer.
5. The method of claim 4, wherein the SemiCRF layer calculates the SemiCRF characteristics of each segment in the sentence according to the CRF characteristics and the ground true tags of each word in the sentence, and further calculates the best path score, wherein the best path score is calculated by:
Figure FDA0002443552190000021
Figure FDA0002443552190000022
Figure FDA0002443552190000023
where s denotes a segment-level tag sequence, w denotes a word-embedded vector representation of the input sequence, liDenoted is the i-th segment level label, biAnd eiRespectively representing the corresponding positions of the beginning and end of the ith segment-level label on the input sequence, miIs a fraction of the ith segment itself, bi,jDenoted is the segment level transition score, y, from category i to category jkIs the group true tag of the kth word of the input sequence,
Figure FDA0002443552190000034
is with the label ykA vector of weight parameters associated. w'kTo representThe feature vector of the kth word is constructed in the following way:
Figure FDA0002443552190000031
wherein the content of the first and second substances,
Figure FDA0002443552190000032
is the embedded vector corresponding to the index within the segment of each word that makes up the segment.
6. The method of claim 4, wherein the step of calculating a loss function for the semi CRF layer based on the best path score and all path scores includes dividing the score by the loss function
Figure FDA0002443552190000033
Negative log-likelihood processing is performed, so the Loss function is finally expressed as Loss ═ scoreall_path-scorebest_path
7. The method according to claim 1, wherein the step of performing named entity recognition on the chinese text using the trained complete named entity recognition model comprises:
inputting the sentence needing named entity recognition into the trained complete named entity recognition model;
after passing through a pre-trained BERT language model, the input sequence sequentially passes through a bidirectional LSTM neural network and a CRF and SemiCRF combined module, CRF characteristics of each word of the input sentence are calculated, and a CRF characteristic matrix of a CRF layer and a SemiCRF characteristic matrix of a SemiCRF layer are further calculated;
decoding the best paths of the input sentences on the CRF layer and the semi CRF layer respectively by using a viterbi algorithm, namely obtaining a CRF layer label sequence and a score of the CRF layer label sequence on the CRF layerC-CAnd SemicCRF layer tag sequences and fraction score of SemicCRF layer tag sequences on SemicCRF layerS-S
ComputingFractional score of tag sequence decoded by CRF layer in SemiCRF layerC-SAnd calculating the fraction score of the label sequence decoded by the SemiCRF layer in the CRF layerS-C
Calculating the total score of the tag sequences of CRF layerC-C+scoreC-SAnd score of SemicRF layer tag sequenceS-S+scoreS-CAnd because the score is calculated through negative log-likelihood processing, the label sequence with the smallest score is taken as the result of named entity recognition.
CN202010272320.8A 2020-04-09 2020-04-09 Chinese named entity identification method based on BERT and semi CRF Pending CN111563383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010272320.8A CN111563383A (en) 2020-04-09 2020-04-09 Chinese named entity identification method based on BERT and semi CRF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010272320.8A CN111563383A (en) 2020-04-09 2020-04-09 Chinese named entity identification method based on BERT and semi CRF

Publications (1)

Publication Number Publication Date
CN111563383A true CN111563383A (en) 2020-08-21

Family

ID=72073004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010272320.8A Pending CN111563383A (en) 2020-04-09 2020-04-09 Chinese named entity identification method based on BERT and semi CRF

Country Status (1)

Country Link
CN (1) CN111563383A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967266A (en) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 Chinese named entity recognition model and construction method and application thereof
CN112115238A (en) * 2020-10-29 2020-12-22 电子科技大学 Question-answering method and system based on BERT and knowledge base
CN112347253A (en) * 2020-11-04 2021-02-09 新智数字科技有限公司 Method and device for establishing text information recognition model and terminal equipment
CN112699682A (en) * 2020-12-11 2021-04-23 山东大学 Named entity identification method and device based on combinable weak authenticator
CN112733533A (en) * 2020-12-31 2021-04-30 浙大城市学院 Multi-mode named entity recognition method based on BERT model and text-image relation propagation
CN112836046A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Four-risk one-gold-field policy and regulation text entity identification method
CN112949310A (en) * 2021-03-01 2021-06-11 创新奇智(上海)科技有限公司 Model training method, traditional Chinese medicine name recognition method and device and network model
CN113011141A (en) * 2021-03-17 2021-06-22 平安科技(深圳)有限公司 Buddha note model training method, Buddha note generation method and related equipment
CN113127060A (en) * 2021-04-09 2021-07-16 中通服软件科技有限公司 Software function point identification method based on natural language pre-training model (BERT)
CN113158671A (en) * 2021-03-25 2021-07-23 胡明昊 Open domain information extraction method combining named entity recognition
CN113344098A (en) * 2021-06-22 2021-09-03 北京三快在线科技有限公司 Model training method and device
CN113468889A (en) * 2021-06-29 2021-10-01 上海犀语科技有限公司 Method and device for extracting model information based on BERT pre-training
CN113673248A (en) * 2021-08-23 2021-11-19 中国人民解放军32801部队 Named entity identification method for testing and identifying small sample text
CN113722476A (en) * 2021-07-30 2021-11-30 的卢技术有限公司 Resume information extraction method and system based on deep learning
CN113761891A (en) * 2021-08-31 2021-12-07 国网冀北电力有限公司 Power grid text data entity identification method, system, equipment and medium
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN113849597A (en) * 2021-08-31 2021-12-28 艾迪恩(山东)科技有限公司 Illegal advertising word detection method based on named entity recognition
WO2022048210A1 (en) * 2020-09-03 2022-03-10 平安科技(深圳)有限公司 Named entity recognition method and apparatus, and electronic device and readable storage medium
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis
CN115221882A (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium
CN115713083A (en) * 2022-11-23 2023-02-24 重庆邮电大学 Intelligent extraction method for key information of traditional Chinese medicine text
CN116204610A (en) * 2023-04-28 2023-06-02 深圳市前海数据服务有限公司 Data mining method and device based on named entity recognition of report capable of being ground

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879831A (en) * 2019-10-12 2020-03-13 杭州师范大学 Chinese medicine sentence word segmentation method based on entity recognition technology

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879831A (en) * 2019-10-12 2020-03-13 杭州师范大学 Chinese medicine sentence word segmentation method based on entity recognition technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHI-XIU YE ET AL.: "hybrid semi-Markov CRF for Neural Sequence Labeling", 《PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS 》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022048210A1 (en) * 2020-09-03 2022-03-10 平安科技(深圳)有限公司 Named entity recognition method and apparatus, and electronic device and readable storage medium
CN111967266A (en) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 Chinese named entity recognition model and construction method and application thereof
CN111967266B (en) * 2020-09-09 2024-01-26 中国人民解放军国防科技大学 Chinese named entity recognition system, model construction method, application and related equipment
CN112115238A (en) * 2020-10-29 2020-12-22 电子科技大学 Question-answering method and system based on BERT and knowledge base
CN112115238B (en) * 2020-10-29 2022-11-15 电子科技大学 Question-answering method and system based on BERT and knowledge base
CN112347253A (en) * 2020-11-04 2021-02-09 新智数字科技有限公司 Method and device for establishing text information recognition model and terminal equipment
CN112347253B (en) * 2020-11-04 2023-09-08 新奥新智科技有限公司 Text information recognition model building method and device and terminal equipment
CN112699682A (en) * 2020-12-11 2021-04-23 山东大学 Named entity identification method and device based on combinable weak authenticator
CN112733533A (en) * 2020-12-31 2021-04-30 浙大城市学院 Multi-mode named entity recognition method based on BERT model and text-image relation propagation
CN112733533B (en) * 2020-12-31 2023-11-07 浙大城市学院 Multi-modal named entity recognition method based on BERT model and text-image relation propagation
CN112836046A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Four-risk one-gold-field policy and regulation text entity identification method
CN112949310A (en) * 2021-03-01 2021-06-11 创新奇智(上海)科技有限公司 Model training method, traditional Chinese medicine name recognition method and device and network model
CN113011141A (en) * 2021-03-17 2021-06-22 平安科技(深圳)有限公司 Buddha note model training method, Buddha note generation method and related equipment
CN113158671B (en) * 2021-03-25 2023-08-11 胡明昊 Open domain information extraction method combined with named entity identification
CN113158671A (en) * 2021-03-25 2021-07-23 胡明昊 Open domain information extraction method combining named entity recognition
CN113127060A (en) * 2021-04-09 2021-07-16 中通服软件科技有限公司 Software function point identification method based on natural language pre-training model (BERT)
CN113344098A (en) * 2021-06-22 2021-09-03 北京三快在线科技有限公司 Model training method and device
CN113468889A (en) * 2021-06-29 2021-10-01 上海犀语科技有限公司 Method and device for extracting model information based on BERT pre-training
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN113722476A (en) * 2021-07-30 2021-11-30 的卢技术有限公司 Resume information extraction method and system based on deep learning
CN113673248A (en) * 2021-08-23 2021-11-19 中国人民解放军32801部队 Named entity identification method for testing and identifying small sample text
CN113673248B (en) * 2021-08-23 2022-02-01 中国人民解放军32801部队 Named entity identification method for testing and identifying small sample text
CN113849597A (en) * 2021-08-31 2021-12-28 艾迪恩(山东)科技有限公司 Illegal advertising word detection method based on named entity recognition
CN113761891A (en) * 2021-08-31 2021-12-07 国网冀北电力有限公司 Power grid text data entity identification method, system, equipment and medium
CN113849597B (en) * 2021-08-31 2024-04-30 艾迪恩(山东)科技有限公司 Illegal advertisement word detection method based on named entity recognition
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis
CN115221882A (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium
CN115221882B (en) * 2022-07-28 2023-06-20 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium
CN115713083A (en) * 2022-11-23 2023-02-24 重庆邮电大学 Intelligent extraction method for key information of traditional Chinese medicine text
CN115713083B (en) * 2022-11-23 2023-12-15 北京约来健康科技有限公司 Intelligent extraction method for traditional Chinese medicine text key information
CN116204610A (en) * 2023-04-28 2023-06-02 深圳市前海数据服务有限公司 Data mining method and device based on named entity recognition of report capable of being ground

Similar Documents

Publication Publication Date Title
CN111563383A (en) Chinese named entity identification method based on BERT and semi CRF
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN109635279B (en) Chinese named entity recognition method based on neural network
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN107203511A (en) A kind of network text name entity recognition method based on neutral net probability disambiguation
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN110750959A (en) Text information processing method, model training method and related device
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN111460824B (en) Unmarked named entity identification method based on anti-migration learning
CN110555084A (en) remote supervision relation classification method based on PCNN and multi-layer attention
CN111444704B (en) Network safety keyword extraction method based on deep neural network
CN111008526A (en) Named entity identification method based on dual-channel neural network
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN110837736B (en) Named entity recognition method of Chinese medical record based on word structure
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN110991185A (en) Method and device for extracting attributes of entities in article
CN114153971A (en) Error-containing Chinese text error correction, identification and classification equipment
CN115510864A (en) Chinese crop disease and pest named entity recognition method fused with domain dictionary
Du et al. Named entity recognition method with word position
CN110569506A (en) Medical named entity recognition method based on medical dictionary
CN111507103B (en) Self-training neural network word segmentation model using partial label set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200821

RJ01 Rejection of invention patent application after publication