CN113268452A - Entity extraction method, device, equipment and storage medium - Google Patents

Entity extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113268452A
CN113268452A CN202110569742.6A CN202110569742A CN113268452A CN 113268452 A CN113268452 A CN 113268452A CN 202110569742 A CN202110569742 A CN 202110569742A CN 113268452 A CN113268452 A CN 113268452A
Authority
CN
China
Prior art keywords
entity
extraction model
data set
module
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110569742.6A
Other languages
Chinese (zh)
Other versions
CN113268452B (en
Inventor
罗永贵
刘霄晨
肖劲
尹芳
张晓璐
马晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lianren Healthcare Big Data Technology Co Ltd
Original Assignee
Lianren Healthcare Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lianren Healthcare Big Data Technology Co Ltd filed Critical Lianren Healthcare Big Data Technology Co Ltd
Priority to CN202110569742.6A priority Critical patent/CN113268452B/en
Publication of CN113268452A publication Critical patent/CN113268452A/en
Application granted granted Critical
Publication of CN113268452B publication Critical patent/CN113268452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses an entity extraction method, an entity extraction device, entity extraction equipment and a storage medium. Wherein, the method comprises the following steps: acquiring an unlabeled data set and a labeled data set corresponding to the unlabeled data set, determining new words in the unlabeled data set, and forming a new word data set; converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained; enhancing the feature information output by the feature extraction module based on the new word data set, and inputting the enhanced feature information to a prediction module to obtain a prediction entity; and generating a loss function based on the predicted entity and the labeled data set, and carrying out iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model. According to the technical scheme, when the entity is extracted, the entity boundary information can be effectively learned by means of the new word data set, so that the accuracy of entity extraction is improved.

Description

Entity extraction method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a method, a device, equipment and a storage medium for entity extraction.
Background
In the writing process of the electronic medical record, a large number of medical professional terms exist, and meanwhile, because each doctor has a personalized writing habit, the same medical term can be often expressed differently, so that a large number of out-of-vocabulary words (OOV) exist in the electronic medical record, and great difficulty and challenge exist in the process of extracting entities of the electronic medical record.
The current common mode is a model training method based on single characters or words, and the model is trained by utilizing mass data so as to solve the problem of difficulty in identifying the out-of-set words by improving the generalization capability of the model. However, in the prior art, the extraction of the electronic medical record entity is faced with a vocabulary with infinite permutation and combination, and meanwhile, the accuracy of the identification and extraction of the electronic medical record entity is reduced due to the ambiguity of Chinese word segmentation.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for entity extraction, so as to improve the accuracy of entity extraction.
In a first aspect, an embodiment of the present invention provides a method for training an entity extraction model, including:
acquiring an unlabeled data set and a labeled data set corresponding to the unlabeled data set, and determining new words in the unlabeled data set to form a new word data set;
converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained, wherein the entity extraction model comprises a feature extraction module and a prediction module;
based on the new word data set, enhancing the feature information output by the feature extraction module, and inputting the enhanced feature information to the prediction module to obtain a prediction entity;
and generating a loss function based on the predicted entity and the labeled data set, and carrying out iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model.
In a second aspect, an embodiment of the present invention further provides an entity extraction method, including:
acquiring data to be processed, and converting the data to be processed into a preset format vector;
and inputting the preset format vector to a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed, wherein the entity extraction model is obtained by training based on the entity extraction model training method provided by any embodiment of the invention.
In a third aspect, an embodiment of the present invention further provides a training apparatus for an entity extraction model, including:
the new word determining module is used for acquiring an unlabeled data set and a labeled data set corresponding to the unlabeled data set, determining new words in the unlabeled data set and forming a new word data set;
the vector input module is used for converting each unmarked data in the unmarked data set into a preset format vector and inputting the preset format vector into an entity extraction model to be trained, wherein the entity extraction model comprises a feature extraction module and a prediction module;
the information enhancement module is used for enhancing the feature information output by the feature extraction module based on the new word data set, and inputting the enhanced feature information to the prediction module to obtain a prediction entity;
and the model generation module generates a loss function based on the predicted entity and the labeled data set, and performs iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model.
In a fourth aspect, an embodiment of the present invention further provides an entity extraction apparatus, including:
the data conversion module is used for acquiring data to be processed and converting the data to be processed into a preset format vector;
and the entity identification module is used for inputting the preset format vector to a pre-trained entity extraction model and identifying a target entity corresponding to the data to be processed, wherein the entity extraction model is obtained by training based on the entity extraction model training method provided by any embodiment of the invention.
In a fifth aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement a method of training an entity extraction model, and/or a method of entity extraction, as provided by any of the embodiments of the invention.
In a sixth aspect, embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the method for training an entity extraction model and/or the method for entity extraction described in any of the embodiments of the present invention.
According to the technical scheme of the embodiment of the invention, new words in an unlabeled data set are determined by acquiring the unlabeled data set and a labeled data set corresponding to the unlabeled data set, so as to form a new word data set; converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained, wherein the entity extraction model comprises a feature extraction module and a prediction module; based on the new word data set, enhancing the feature information output by the feature extraction module, and inputting the enhanced feature information to the prediction module to obtain a prediction entity; and generating a loss function based on the predicted entity and the labeled data set, and carrying out iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model. According to the technical scheme, when the entity is extracted, the entity boundary information can be effectively learned by means of the new word data set, so that the accuracy of entity extraction is improved.
Drawings
Fig. 1 is a flowchart of a training method for an entity extraction model according to an embodiment of the present invention.
Fig. 2 is a flowchart of an entity extraction method according to a second embodiment of the present invention.
Fig. 3 is a schematic diagram of an entity extraction structure based on new words according to a second embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a training apparatus for an entity extraction model according to a third embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a method for training an entity extraction model according to an embodiment of the present invention, where this embodiment is applicable to a case where an entity extraction model is trained according to a data set, and the method may be executed by an apparatus for training an entity extraction model according to an embodiment of the present invention, where the apparatus may be implemented by software and/or hardware, and the apparatus may be configured on an electronic computing device, and specifically includes the following steps:
s110, obtaining an unlabeled data set and a labeled data set corresponding to the unlabeled data set, determining a new word in the unlabeled data set, and forming a new word data set.
In this embodiment, the labeled data set is a data set in which the learning data is processed by the user with the aid of the labeling tool, and the corresponding unlabeled data set is a data set in which the learning data is not processed by the user with the aid of the labeling tool. The type, content, etc. of the unlabeled data set is not specifically limited herein. Optionally, the unlabeled data set may be an electronic medical record unlabeled data set, and the labeled data set may be a data set formed by selecting a small number of samples from a large number of electronic medical record unlabeled data sets and labeling the samples. And determining new words from the data set through a new word discovery algorithm to form a new word data set, wherein the new words refer to words which are not registered in the dictionary, namely words which are not in the dictionary, and the new words can include, but are not limited to, abbreviations, proper nouns, derivatives, compound words and the like. The new word discovery algorithm mainly comprises a statistical-based method, a rule-based method and a method combining the statistical-based method and the rule-based method.
Optionally, the new word discovery algorithm may be a rule-based method, specifically, a rule base, a professional lexicon or a pattern base is established by analyzing word formation characteristics and appearance characteristics of the vocabulary, and the new vocabulary is discovered by a rule matching method. The new word discovery algorithm can be a statistical-based method, and the specifically adopted statistical model can be N-gram (N-gram), the value of N can vary depending on the recognition sample and the requirements. For example, when the value of N is 2, the statistical model is bigram, and the statistical model only considers the grammar and data information obtained by two adjacent words. The new word discovery algorithm can also be a fusion method based on statistics and rules to realize more accurate new word discovery.
And S120, converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained, wherein the entity extraction model comprises a feature extraction module and a prediction module.
In this embodiment, each unmarked data in the unmarked data set is converted into a preset format vector, so that a vector adapted to the input requirement of the entity extraction model can be understood, and the entity extraction model to be trained can extract effective features, wherein the preset format can be determined according to the input requirement of the entity extraction model. The method for converting each unlabeled data in the unlabeled data set into a preset format vector may include, but is not limited to, a word to vector (char2vec) model, a word to vector (word2vec) model, and other vector conversion models.
The entity extraction model can be obtained by training the entity extraction model in advance through a large number of unlabeled data sets and labeled data sets corresponding to the unlabeled data sets. The trained entity extraction model comprises a feature extraction module and a prediction module, the feature extraction module learns the context relationship between each entity word in the unlabeled data and other entity words in the unlabeled data, and the prediction module predicts the entity type, wherein the entity type can include but is not limited to a name of a person, a place name or a mechanism; and training model parameters in the entity extraction model, and continuously adjusting the parameters of the entity extraction model to gradually reduce and stabilize the deviation between the output result of the model and the labeled data set so as to generate the entity extraction model.
The model parameters of the entity extraction model may adopt a random initialization principle, or may also adopt a fixed value initialization principle according to experience, which is not specifically limited in this embodiment. By carrying out initialization assignment on the weight and the offset value of each node of the model, the convergence speed and the performance of the model can be improved.
On the basis of the embodiment, the entity extraction model comprises a first extraction model based on a word vector and/or a second extraction model based on a word vector; the converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained includes: converting each unlabeled data in the unlabeled data set into a word vector, and inputting the word vector to a first extraction model to be trained; and/or converting each unlabeled data in the unlabeled data set into a word vector, and inputting the word vector to a second extraction model to be trained.
The word vector is vectorized representation of a word, and a method for converting each unmarked data in the unmarked data set into the word vector may include a word embedding model (char2vec), and the like; the word vector is a vectorized representation of a word, and a method for converting each unlabeled data in the unlabeled dataset into the word vector may include a word embedding model (word2vec), and the like. Wherein, the first extraction model based on the word vector and the second extraction model based on the word vector can be Named Entity Recognition (NER) models. The NER model may include, but is not limited to, LSTM-CRF, BERT-BilSTM-CRF, IDCNN/BilSTM-CRF, etc. deep learning models, which are not limited in this embodiment.
Exemplarily, an unlabeled data set can be trained through a word embedding model char2vec to obtain a word vector, and then the pre-trained word vector is input into an LSTM-CRF model; the unlabeled data set can be trained by a word embedding model word2vec to obtain a word vector, and then the pre-trained word vector is input into an LSTM-CRF model.
S130, based on the new word data set, the feature information output by the feature extraction module is enhanced, and the enhanced feature information is input to the prediction module to obtain a prediction entity.
In this embodiment, the feature information output by the feature extraction module is enhanced based on the new word data set, the feature information is further modified, and the potential multiple word information in the new word data set is used as the feature, so that the data in the feature information is more complete, thereby reducing the recognition error caused by ambiguity and improving the accuracy of the prediction module for predicting the entity.
Optionally, the feature information includes an emission matrix and a transition probability matrix, and correspondingly, the enhancing processing of the feature information output by the feature extraction module based on the new word data set includes: and determining an enhancement coefficient based on the number of new words in the new word data set, and enhancing the transition probability matrix based on the enhancement coefficient.
The system comprises a feature extraction module, a transmission matrix and a conversion probability matrix, wherein the transmission matrix and the conversion probability matrix are obtained by being output by the feature extraction module, and the transmission matrix describes the score of a certain entity class at the current position; the transition probability matrix describes the score from the current location entity category to the next location entity category. From the emission matrix and the transition probability matrix, a path score can be calculated, which can be understood as the probability of the entity type of the word in the current unlabeled dataset. The method comprises the steps of determining an enhancement coefficient based on the number of new words in a new word data set, then carrying out parameter adjustment on a transition probability matrix based on the enhancement coefficient, and combining potential various word information in the new word data set as characteristic information, so that recognition errors caused by ambiguity are reduced, and the accuracy of a prediction module for predicting an entity is improved.
It is emphasized that the transition probability matrix and the emission matrix are output from the feature extraction module and used as input of the prediction module, and the transition probability matrix and the emission matrix can be initialized randomly, so that parameters of the transition probability matrix and the emission matrix are updated along with model training.
For example, the path score may be represented by S, the transmission probability corresponding to the transmission matrix may be represented by E, and the transition probability corresponding to the transition probability matrix may be represented by T. The transmission probability may be referred to as a transmission score and the transition probability may also be referred to as a transition score. The method can be specifically realized by the following formula:
S=E+T
the transition probability T of the path score is multiplied by an enhancement factor (1+ γ exp (N/10000)), where γ is a hyper parameter and is in the range of 0 < γ < 1. N is the number of new words in the new word dataset.
S140, generating a loss function based on the predicted entity and the labeled data set, and carrying out iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model.
In this embodiment, the loss function may be a log-likelihood loss function. Specifically, the loss function is generated by calculating an emission score and a conversion score corresponding to the emission matrix and the conversion probability matrix in the feature information, then performing normalization processing on the emission score and the conversion score to obtain a maximum likelihood probability, and then converting the maximum likelihood probability into a logarithmic form. Iterative parameter adjustment is carried out on the entity extraction model through the loss function, so that the difference between the predicted entity and the labeled data set is reduced and tends to be stable, and the target entity extraction model is obtained. The normalization processing method is not limited to this embodiment, and optionally, the normalization processing method may be a softmax function.
And (3) iteratively executing the training process of the model until the training times and the training precision are met or the convergence state is reached, determining that the entity extraction model is trained completely, and obtaining the target entity extraction model.
Optionally, the prediction entity is a prediction entity output by the first extraction model, or a prediction entity output by the second extraction model, or a prediction entity obtained by fusing a prediction entity output by the first extraction model and a prediction entity output by the second extraction model.
The prediction entity obtained by the entity extraction model can be a prediction entity output by a first extraction model based on the word vector, the prediction entity obtained by the entity extraction model can also be a prediction entity output by a second extraction model based on the word vector, and the prediction entity obtained by the entity extraction model can also be a prediction entity obtained by fusing the prediction entity of the first extraction model based on the word vector and the prediction entity output by the second extraction model based on the word vector. It should be noted that the predicted entity is determined by ranking the predicted entities based on the probability of the entity type obtained by the entity extraction model. And fusing the predicted entity output by the first extraction model based on the word vector and the predicted entity output by the second extraction model based on the word vector, wherein the fusing is the fusing of entity type probability. For example, the entity type probability of the predicted entity of the first extraction model based on the word vector is 0.8, the entity type probability of the predicted entity output by the second extraction model based on the word vector is 0.6, and the two are subjected to average value calculation to obtain the entity type probability of the fused predicted entity of 0.7, so that the equilibrium point of the entity type probability of the current predicted entity is obtained, and the reliability of the predicted entity is ensured. In the embodiment, the prediction entity of the first extraction model based on the word vector and the prediction entity output by the second extraction model based on the word vector are fused, so that the entity extraction model can utilize the information of the word vector and also fuse the context related information of the word vector, and the identification accuracy of the entity extraction model is improved.
The embodiment of the invention provides a training method of an entity extraction model, which comprises the steps of determining new words in an unlabeled data set by acquiring the unlabeled data set and a labeled data set corresponding to the unlabeled data set to form a new word data set; converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained, wherein the entity extraction model comprises a feature extraction module and a prediction module; based on the new word data set, enhancing the feature information output by the feature extraction module, and inputting the enhanced feature information to the prediction module to obtain a prediction entity; and generating a loss function based on the predicted entity and the labeled data set, and carrying out iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model. According to the technical scheme, when the entity is extracted, the entity boundary information can be effectively learned by means of the new word data set, so that the accuracy of entity extraction is improved.
Example two
Fig. 2 is a flowchart of an entity extraction method according to a second embodiment of the present invention, where this embodiment is applicable to a case of performing entity extraction by using an entity extraction model, and the method may be executed by an entity extraction apparatus according to the second embodiment of the present invention, where the apparatus may be implemented by software and/or hardware, and the apparatus may be configured on an electronic computing device, and specifically includes the following steps:
s210, obtaining data to be processed, and converting the data to be processed into a preset format vector.
S220, inputting the preset format vector to a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed.
In this embodiment, the data to be processed is an unlabeled data set, the data of the unlabeled data set is converted into a preset format vector by obtaining the unlabeled data set, where the preset format vector may include but is not limited to a word vector and a word vector, and the preset format vector is input to a pre-trained entity extraction model to obtain a target entity corresponding to the unlabeled data set.
In an optional implementation manner of the embodiment of the present invention, after acquiring the data to be processed, the method further includes: determining new words in the data to be processed; the inputting the preset format vector into a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed includes: and inputting the preset format vector to a feature extraction module of the entity extraction model to obtain feature information, enhancing the feature information based on the new words, and inputting the enhanced feature information to a prediction module of the entity extraction model to obtain a target entity.
The entity extraction model may use a Long Short-Term Memory network (LSTM) as a feature extraction module, or may use a deformed network based on the LSTM as a feature extraction module, which is not limited in this embodiment. The prediction module may adopt a Conditional Random Field (CRF), and may select a prediction result of the CRF through a viterbi algorithm to obtain an optimal target entity.
In an optional implementation manner of the embodiment of the present invention, the entity extraction model includes a first extraction model based on a word vector and/or a second extraction model based on a word vector; the inputting the preset format vector into a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed includes: inputting the word vector converted from the data to be processed into the first extraction model to obtain a first entity; and/or inputting the word vector converted from the data to be processed into the second extraction model to obtain a second entity; and determining the first entity or the second entity as a target entity, or fusing the first entity and the second entity to obtain the target entity.
For example, as shown in fig. 3, the data to be processed may be represented by S1, and a new word in S1 is found by a new word finding algorithm to form a new word set, which may be represented by S3. Converting the data to be processed S1 into a word vector (char Embedding) by a char2vec method; chinese word segmentation is carried out on the data to be processed S1 based on the dictionary, the text in the data set to be processed is segmented into words, and the words are converted into word vectors (word Embedding) by a word2vec method. The first extraction model can be represented by M1, the second extraction model can be represented by M2, the word vectors are input into the model M1, the transition probability matrix in M1 is enhanced through S3, and the enhanced transition probability matrix is input into a CRF layer of M2 for Viterbi decoding to obtain a first entity; and inputting the word vector into a model M2, enhancing the conversion probability matrix in M2 through S3, and inputting the enhanced conversion probability matrix into a CRF layer of M2 for Viterbi decoding to obtain a second entity. And fusing the first entity and the second entity obtained by the model M1 and the model M2 to obtain a target entity.
The embodiment of the invention provides an entity extraction method, which comprises the steps of converting data to be processed into a preset format vector by acquiring the data to be processed; and inputting the preset format vector to a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed. According to the technical scheme, when the entity is extracted, the entity boundary information can be effectively learned by means of the new word data set, so that the accuracy of entity extraction is improved.
EXAMPLE III
Fig. 4 is a schematic structural diagram of a training apparatus for an entity extraction model according to a third embodiment of the present invention, where the training apparatus for an entity extraction model provided in this embodiment may be implemented by software and/or hardware, and may be configured in a terminal and/or a server to implement the training method for an entity extraction model in the third embodiment of the present invention. The device may specifically comprise: a new word determination module 310, a vector input module 320, an information enhancement module 330, and a model generation module 340.
The new word determining module 310 is configured to obtain an unlabeled data set and a labeled data set corresponding to the unlabeled data set, determine a new word in the unlabeled data set, and form a new word data set; the vector input module 320 is configured to convert each unlabeled data in the unlabeled data set into a preset format vector, and input the preset format vector to an entity extraction model to be trained, where the entity extraction model includes a feature extraction module and a prediction module; the information enhancement module 330 is configured to perform enhancement processing on the feature information output by the feature extraction module based on the new word data set, and input the enhanced feature information to the prediction module to obtain a predicted entity; the model generating module 340 generates a loss function based on the predicted entity and the labeled data set, and performs iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model.
On the basis of any optional technical scheme in the embodiment of the present invention, optionally, the feature information includes an emission matrix and a transition probability matrix; the information enhancement module 330 may be configured to:
and determining an enhancement coefficient based on the number of new words in the new word data set, and enhancing the transition probability matrix based on the enhancement coefficient.
On the basis of any optional technical solution in the embodiment of the present invention, optionally, the entity extraction model includes a first extraction model based on a word vector and/or a second extraction model based on a word vector;
the vector input module 320 may include:
the word vector conversion unit is used for converting each unmarked data in the unmarked data set into a word vector and inputting the word vector to a first extraction model to be trained; and/or the presence of a gas in the gas,
and the word vector conversion unit is used for converting each unlabeled data in the unlabeled data set into a word vector and inputting the word vector into a second extraction model to be trained.
On the basis of any optional technical solution in the embodiment of the present invention, optionally, the prediction entity is a prediction entity output by a first extraction model, or a prediction entity output by a second extraction model, or a prediction entity obtained by fusing a prediction entity output by a first extraction model and a prediction entity output by a second extraction model.
The embodiment further provides an entity extraction apparatus, which may include:
the data conversion module is used for acquiring data to be processed and converting the data to be processed into a preset format vector;
and the entity identification module is used for inputting the preset format vector to a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed, wherein the entity extraction model is obtained by training based on the entity extraction model training method according to any one of claims 1 to 4.
On the basis of any optional technical solution in the embodiment of the present invention, optionally, after the data to be processed is obtained, the data conversion module may further include:
the new word determining unit is used for determining new words in the data to be processed;
the entity identification module may be to:
and inputting the preset format vector to a feature extraction module of the entity extraction model to obtain feature information, enhancing the feature information based on the new words, and inputting the enhanced feature information to a prediction module of the entity extraction model to obtain a target entity.
On the basis of any optional technical solution in the embodiment of the present invention, optionally, the entity extraction model includes a first extraction model based on a word vector and/or a second extraction model based on a word vector;
the entity identification module is specifically operable to:
inputting the word vector converted from the data to be processed into the first extraction model to obtain a first entity; and/or inputting the word vector converted from the data to be processed into the second extraction model to obtain a second entity;
and determining the first entity or the second entity as a target entity, or fusing the first entity and the second entity to obtain the target entity.
The training device for the entity extraction model provided by the embodiment of the invention can execute the training method for the entity extraction model provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 5 is only an example and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.
As shown in FIG. 5, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 5, the network adapter 20 communicates with the other modules of the electronic device 12 via the bus 18. It should be appreciated that although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes programs stored in the system memory 28 to perform various functional applications and data processing, such as a training method for an entity extraction model and/or an entity extraction method provided by the present embodiment.
EXAMPLE five
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for training an entity extraction model and/or a method for entity extraction.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (11)

1. A method for training an entity extraction model is characterized by comprising the following steps:
acquiring an unlabeled data set and a labeled data set corresponding to the unlabeled data set, and determining new words in the unlabeled data set to form a new word data set;
converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained, wherein the entity extraction model comprises a feature extraction module and a prediction module;
based on the new word data set, enhancing the feature information output by the feature extraction module, and inputting the enhanced feature information to the prediction module to obtain a prediction entity;
and generating a loss function based on the predicted entity and the labeled data set, and carrying out iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model.
2. The method of claim 1, wherein the characteristic information comprises a transmit matrix and a transition probability matrix;
the enhancing processing of the feature information output by the feature extraction module based on the new word data set comprises:
and determining an enhancement coefficient based on the number of new words in the new word data set, and enhancing the transition probability matrix based on the enhancement coefficient.
3. The method according to any one of claims 1-2, wherein the entity extraction model comprises a first extraction model based on word vectors and/or a second extraction model based on word vectors;
the converting each unmarked data in the unmarked data set into a preset format vector, and inputting the preset format vector into an entity extraction model to be trained includes:
converting each unlabeled data in the unlabeled data set into a word vector, and inputting the word vector to a first extraction model to be trained; and/or the presence of a gas in the gas,
and converting each unlabeled data in the unlabeled data set into a word vector, and inputting the word vector to a second extraction model to be trained.
4. The method according to claim 3, wherein the prediction entity is a prediction entity output by the first extraction model, or a prediction entity output by the second extraction model, or a prediction entity obtained by fusing a prediction entity output by the first extraction model and a prediction entity output by the second extraction model.
5. An entity extraction method, comprising:
acquiring data to be processed, and converting the data to be processed into a preset format vector;
inputting the preset format vector into a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed, wherein the entity extraction model is obtained by training based on the entity extraction model training method according to any one of claims 1 to 4.
6. The method of claim 5, wherein after acquiring the data to be processed, the method further comprises:
determining new words in the data to be processed;
the inputting the preset format vector into a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed includes:
and inputting the preset format vector to a feature extraction module of the entity extraction model to obtain feature information, enhancing the feature information based on the new words, and inputting the enhanced feature information to a prediction module of the entity extraction model to obtain a target entity.
7. The method according to claim 5 or 6, wherein the entity extraction model comprises a first extraction model based on word vectors and/or a second extraction model based on word vectors;
the inputting the preset format vector into a pre-trained entity extraction model to obtain a target entity corresponding to the data to be processed includes:
inputting the word vector converted from the data to be processed into the first extraction model to obtain a first entity; and/or inputting the word vector converted from the data to be processed into the second extraction model to obtain a second entity;
and determining the first entity or the second entity as a target entity, or fusing the first entity and the second entity to obtain the target entity.
8. An apparatus for training an entity extraction model, comprising:
the new word determining module is used for acquiring an unlabeled data set and a labeled data set corresponding to the unlabeled data set, determining new words in the unlabeled data set and forming a new word data set;
the vector input module is used for converting each unmarked data in the unmarked data set into a preset format vector and inputting the preset format vector into an entity extraction model to be trained, wherein the entity extraction model comprises a feature extraction module and a prediction module;
the information enhancement module is used for enhancing the feature information output by the feature extraction module based on the new word data set, and inputting the enhanced feature information to the prediction module to obtain a prediction entity;
and the model generation module generates a loss function based on the predicted entity and the labeled data set, and performs iterative parameter adjustment on the entity extraction model to obtain a target entity extraction model.
9. An entity extraction apparatus, comprising:
the data conversion module is used for acquiring data to be processed and converting the data to be processed into a preset format vector;
and the entity identification module is used for inputting the preset format vector to a pre-trained entity extraction model and identifying a target entity corresponding to the data to be processed, wherein the entity extraction model is obtained by training based on the entity extraction model training method according to any one of claims 1 to 4.
10. An electronic device, characterized in that the electronic device comprises:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement a method of training an entity extraction model as claimed in any one of claims 1 to 4, and/or a method of entity extraction as claimed in any one of claims 5 to 7.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of training an entity extraction model according to any one of claims 1 to 4, and/or a method of entity extraction according to any one of claims 5 to 7.
CN202110569742.6A 2021-05-25 2021-05-25 Entity extraction method, device, equipment and storage medium Active CN113268452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110569742.6A CN113268452B (en) 2021-05-25 2021-05-25 Entity extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110569742.6A CN113268452B (en) 2021-05-25 2021-05-25 Entity extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113268452A true CN113268452A (en) 2021-08-17
CN113268452B CN113268452B (en) 2024-02-02

Family

ID=77232623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110569742.6A Active CN113268452B (en) 2021-05-25 2021-05-25 Entity extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113268452B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438190A (en) * 2022-09-06 2022-12-06 国家电网有限公司 Power distribution network fault decision-making assisting knowledge extraction method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818080A (en) * 2017-09-22 2018-03-20 新译信息科技(北京)有限公司 Term recognition methods and device
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN111295670A (en) * 2019-04-25 2020-06-16 阿里巴巴集团控股有限公司 Identification of entities in electronic medical records
WO2020133039A1 (en) * 2018-12-27 2020-07-02 深圳市优必选科技有限公司 Entity identification method and apparatus in dialogue corpus, and computer device
CN111639498A (en) * 2020-04-21 2020-09-08 平安国际智慧城市科技股份有限公司 Knowledge extraction method and device, electronic equipment and storage medium
CN111651994A (en) * 2020-06-03 2020-09-11 浙江同花顺智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
WO2020232861A1 (en) * 2019-05-20 2020-11-26 平安科技(深圳)有限公司 Named entity recognition method, electronic device and storage medium
WO2021051871A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Text extraction method, apparatus, and device, and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818080A (en) * 2017-09-22 2018-03-20 新译信息科技(北京)有限公司 Term recognition methods and device
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
WO2020133039A1 (en) * 2018-12-27 2020-07-02 深圳市优必选科技有限公司 Entity identification method and apparatus in dialogue corpus, and computer device
CN111295670A (en) * 2019-04-25 2020-06-16 阿里巴巴集团控股有限公司 Identification of entities in electronic medical records
WO2020232861A1 (en) * 2019-05-20 2020-11-26 平安科技(深圳)有限公司 Named entity recognition method, electronic device and storage medium
WO2021051871A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Text extraction method, apparatus, and device, and storage medium
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN111639498A (en) * 2020-04-21 2020-09-08 平安国际智慧城市科技股份有限公司 Knowledge extraction method and device, electronic equipment and storage medium
CN111651994A (en) * 2020-06-03 2020-09-11 浙江同花顺智能科技有限公司 Information extraction method and device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
杨丹浩;吴岳辛;范春晓;: "一种基于注意力机制的中文短文本关键词提取模型", 计算机科学, no. 01 *
田家源;杨东华;王宏志;: "面向互联网资源的医学命名实体识别研究", 计算机科学与探索, no. 06 *
黄胜;李伟;张剑;: "基于深度学习的简历信息实体抽取方法", 计算机工程与设计, no. 12 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438190A (en) * 2022-09-06 2022-12-06 国家电网有限公司 Power distribution network fault decision-making assisting knowledge extraction method and system

Also Published As

Publication number Publication date
CN113268452B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
US7493251B2 (en) Using source-channel models for word segmentation
CN110245348B (en) Intention recognition method and system
CN101707873B (en) Large language models in machine translation
US8744834B2 (en) Optimizing parameters for machine translation
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
US8380488B1 (en) Identifying a property of a document
JP4974470B2 (en) Representation of deleted interpolation N-gram language model in ARPA standard format
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
US20090326916A1 (en) Unsupervised chinese word segmentation for statistical machine translation
CN111739514B (en) Voice recognition method, device, equipment and medium
CN113053367B (en) Speech recognition method, speech recognition model training method and device
CN111079432B (en) Text detection method and device, electronic equipment and storage medium
JP7133002B2 (en) Punctuation prediction method and apparatus
CN112287680B (en) Entity extraction method, device and equipment of inquiry information and storage medium
CN109710951B (en) Auxiliary translation method, device, equipment and storage medium based on translation history
CN111488742B (en) Method and device for translation
CN107111607B (en) System and method for language detection
CN111401078A (en) Running method, device, equipment and medium of neural network text translation model
US11681880B2 (en) Auto transformation of network data models using neural machine translation
CN111326144A (en) Voice data processing method, device, medium and computing equipment
CN112214595A (en) Category determination method, device, equipment and medium
CN113268452B (en) Entity extraction method, device, equipment and storage medium
CN111460224B (en) Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium
WO2023116572A1 (en) Word or sentence generation method and related device
US20220230633A1 (en) Speech recognition method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant