CN113128234A

CN113128234A - Method and system for establishing entity recognition model, electronic equipment and medium

Info

Publication number: CN113128234A
Application number: CN202110669805.5A
Authority: CN
Inventors: 姚娟娟
Original assignee: Mingpinyun Beijing Data Technology Co Ltd
Current assignee: Shanghai Mingping Medical Data Technology Co ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-07-16
Anticipated expiration: 2041-06-17
Also published as: CN113128234B

Abstract

The invention is suitable for the technical field of data processing, and provides a method, a system, electronic equipment and a medium for establishing an entity recognition model, wherein the method comprises the steps of obtaining text data in a target field to obtain an entity data set, and dividing the entity data set into a data set to be marked and a data set to be processed; carrying out synonym replacement on the data set to be processed to obtain a processed data set; determining a new word data set in the processing data set according to the word meaning similarity of the data set to be labeled and the processing data set, and labeling the data set to be labeled and the new word data set to obtain a labeled data set; pre-training a data set to be processed by adopting an information extraction method based on the labeled data set to obtain a pre-training data set; the initial entity recognition model is trained by adopting the pre-training data set, and the target entity recognition model is output, so that the problems of small scale and the like of high-quality labeled corpus in the prior art are solved.

Description

Method and system for establishing entity recognition model, electronic equipment and medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and a system for establishing an entity recognition model, an electronic device, and a medium.

Background

Entity recognition is one of the core fundamental tasks of the natural language processing field, which is the extraction of specific types of entities from text. The method has important scientific significance and wide application value in downstream natural language processing tasks such as information retrieval, question-answering systems, information extraction, text mining and the like. From the existing research results at present, the recognition result of the named entity recognition research in some fields is difficult to be under the current conditions of small scale, slow recognition speed and low recognition accuracy of high-quality labeled corpus, and the performance of the named entity recognition research is worse than that of the named entity recognition research in the traditional fields.

Disclosure of Invention

The invention provides a method, a system, electronic equipment and a medium for establishing an entity recognition model, which aim to solve the problems of small scale, low recognition speed and the like of high-quality labeled corpus in the prior art.

The method for establishing the entity recognition model comprises the following steps: acquiring text data of a target field to obtain an entity data set, and dividing the entity data set into a data set to be labeled and a data set to be processed;

performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processed data set;

determining a new word data set in the processing data set according to the word meaning similarity of the data set to be labeled and the processing data set, and labeling the data set to be labeled and the new word data set to obtain a labeled data set;

pre-training the data set to be processed by adopting an information extraction method based on the labeled data set to obtain a pre-training data set;

and training an initial entity recognition model by adopting the pre-training data set, and outputting a target entity recognition model.

Optionally, the synonym replacement is performed on the data set to be processed according to the part of speech and the word co-occurrence degree to obtain a processed data set, and the method specifically includes:

removing stop words from the data set to be processed after word segmentation processing and part-of-speech tagging to obtain a data set to be classified;

clustering the data sets to be classified according to the semantic similarity to obtain a plurality of classified data sets;

and carrying out synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, and combining the classification processing set to generate a processing data set.

Optionally, the synonym replacement is performed on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, which specifically includes:

determining position evaluation parameters of the words according to the positions of the words in the classification data set and preset position weights;

obtaining words of the same part of speech in the classified data set to obtain a part of speech data set, obtaining words of the same position evaluation parameter in the part of speech data set, and determining word co-occurrence according to context semantic similarity between the words of the same position evaluation parameter;

and carrying out synonym replacement on the part-of-speech data set according to the word co-occurrence degree to obtain a classification processing set.

Optionally, the determining a new word data set in the processing data set according to the word sense similarity between the data set to be labeled and the processing data set specifically includes:

determining the word sense similarity of the data set to be labeled and the processing data set according to the synonymy relation and the antisense relation;

and if the word meaning similarity of the data set to be labeled and the processing data set is smaller than a similarity threshold value, obtaining a new word data set.

Optionally, the determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy relationship and the antisense relationship specifically includes:

determining a synonymy evaluation parameter according to the synonymy relation between the data set to be labeled and the processing data set and a preset synonymy relation weight;

determining antisense evaluation parameters according to the antisense relation between the data set to be labeled and the processing data set and preset antisense relation weight;

and determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy evaluation parameter and the antisense evaluation parameter.

Optionally, the labeling the data set to be labeled and the new word data set to obtain a labeled data set specifically includes:

if the data volume of the data set to be labeled is larger than a preset data volume threshold value, dividing the data set to be labeled into a sub data set to be labeled and a sub data set to be trained;

determining a difference data set according to the word meaning similarity of the sub data set to be labeled and the sub data set to be trained;

performing semantic annotation on the sub data set to be annotated and the difference data set to obtain a semantic annotation set;

pre-training the sub data set to be trained by adopting an information extraction method based on the semantic annotation set to obtain a pre-training sub data set;

and marking the new word data set, and combining the pre-training data set and the marked new word data set to obtain a marked data set.

Optionally, the labeling the data set to be labeled and the new word data set to obtain a labeled data set, further includes:

if the data volume of the data set to be labeled is smaller than a preset data volume threshold value, performing word segmentation processing on the data set to be labeled and the new word data set;

performing part-of-speech tagging on the data set to be tagged and the new word data set after word segmentation processing;

and performing semantic annotation on the data set to be annotated and the new word data set after the part of speech annotation to obtain an annotated data set.

The invention also provides a system for establishing the entity recognition model, which comprises the following steps:

the entity data set acquisition module is used for acquiring text data of a target field to obtain an entity data set and dividing the entity data set into a data set to be labeled and a data set to be processed;

the processing data set acquisition module is used for performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processing data set;

a labeling data set obtaining module, configured to determine a new word data set in the processing data set according to the word meaning similarity between the data set to be labeled and the processing data set, label the data set to be labeled and the new word data set, and obtain a labeling data set;

a pre-training data set acquisition module, configured to pre-train the to-be-processed data set by using an information extraction method based on the labeled data set to obtain a pre-training data set;

and the target model establishing module is used for training an initial entity recognition model by adopting the pre-training data set and outputting a target entity recognition model.

The present invention also provides an electronic device comprising: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the entity identification model building method.

The present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for building an entity identification model as described above.

The invention has the beneficial effects that: the method for establishing the entity identification model comprises the steps of firstly obtaining an entity data set by obtaining text data of a target field, and dividing the entity data set into a data set to be marked and a data set to be processed; secondly, synonym replacement is carried out on the data set to be processed, a new word data set is determined according to the word meaning similarity, and the data set to be labeled and the new word data set are labeled to obtain a labeled data set; pre-training the data set to be processed by adopting an information extraction method again to obtain a pre-training data set; and finally, training an initial entity recognition model by adopting the pre-training data set, and outputting a target entity recognition model. Synonym replacement is carried out on the data set to be processed, so that the data processing efficiency of determining a new word data set according to the word meaning similarity can be improved, and the recognition speed is improved; the entity data set is divided into a data set to be marked and a data set to be processed, and a partial marking method is adopted, so that the data processing speed and the recognition speed are improved; the data quality can be improved by pre-training the data set to be processed by adopting an information extraction method, so that the identification accuracy is improved; the entity data set is pre-trained, then the pre-trained data set is adopted to train an initial entity recognition model, and a target entity recognition model is output, so that large-scale and high-quality marking of text data in a target field is realized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating a method for building an entity recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram of a method for processing a data set acquisition in an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for creating a new word data set according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a system for building an entity recognition model according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

First embodiment

Fig. 1 is a flowchart illustrating a method for building an entity recognition model according to an embodiment of the present invention.

As shown in fig. 1, the method for building the entity recognition model includes steps S110 to S150:

s110, acquiring text data of a target field to obtain a real data set, and dividing the real data set into a data set to be labeled and a data set to be processed;

s120, performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processed data set;

s130, determining a new word data set in the processing data set according to the word meaning similarity of the data set to be labeled and the processing data set, and labeling the data set to be labeled and the new word data set to obtain a labeled data set;

s140, pre-training the data set to be processed by adopting an information extraction method based on the labeled data set to obtain a pre-training data set;

s150, training an initial entity recognition model by adopting the pre-training data set, and outputting a target entity recognition model.

In step S110 of this embodiment, taking text data processing in the medical field as an example, the text data in the target field is derived from publicly available medical text data at home and abroad, such as a periodical in the medical field, an inquiry record, a doctor' S advice, an electronic medical record, and the like; and various paper disease diagnosis records can be recorded by scanning or other modes to form text data in the medical field. The real data set can be divided into a data set to be labeled and a data set to be processed according to the condition of the acquired real data set in a proper proportion, and the real data set needs to be preprocessed before being divided, wherein the preprocessing comprises data cleaning, special punctuation mark processing and the like. The data cleaning is mainly to carry out re-examination and inspection on the data, delete repeated data and correct error data so as to ensure the consistency of the data. Common data cleaning methods include mathematical statistics, regression statistics, etc., and may be selected according to actual application requirements, which is not limited herein.

In step S120 of this embodiment, word segmentation processing and part-of-speech tagging are performed on the data set to be processed, and then synonym replacement is performed on the data set to be processed according to the part-of-speech and the word co-occurrence degree, so as to obtain a processed data set. Please refer to fig. 2 for a specific implementation method of performing synonym replacement on the to-be-processed data set according to the part of speech and the word co-occurrence degree to obtain the processed data set, where fig. 2 is a schematic flow diagram of the method for acquiring the processed data set according to an embodiment of the present invention.

As shown in fig. 2, performing synonym replacement on the to-be-processed data set according to the part of speech and the word co-occurrence degree to obtain a processed data set may include the following steps S210 to S230:

s210, deactivating words of the data set to be processed subjected to word segmentation processing and part-of-speech tagging to obtain a data set to be classified;

s220, clustering the data sets to be classified according to the semantic similarity to obtain a plurality of classified data sets;

and S230, carrying out synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, and combining the classification processing set to generate a processing data set.

In step S220 of this embodiment, before classifying the to-be-classified data sets, a non-supervised machine learning method may be adopted to perform simple semantic recognition on the to-be-classified data sets, obtain semantic similarity of words in the to-be-classified data sets according to recognition results of the non-supervised machine learning method, and perform clustering processing on the to-be-classified data sets according to the semantic similarity to obtain a plurality of classified data sets; clustering algorithms include, but are not limited to, K-means clustering algorithms. And classifying the to-be-classified data sets by adopting a K-means clustering algorithm to obtain a plurality of classified data sets. Specifically, the K value is set according to actual experience, and after clustering processing is carried out on the data sets to be classified, the obtained classified data sets are similar in semantics, so that synonym replacement in the subsequent data sets to be processed is facilitated, the data processing speed is increased, and the recognition speed of the entity recognition model is increased.

In step S230 of this embodiment, performing synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, which specifically includes: determining position evaluation parameters of the words according to the positions of the words in the classification data set and preset position weights; obtaining words of the same part of speech in the classified data set to obtain a part of speech data set, obtaining words of the same position evaluation parameter in the part of speech data set, and determining word co-occurrence according to context semantic similarity between the words of the same position evaluation parameter; and carrying out synonym replacement on the part-of-speech data set according to the word co-occurrence degree to obtain a classification processing set. Specifically, the position evaluation parameter of a word may be determined according to the position of the word in a sentence in the classification data set, for example, a word is at the head position of a sentence, the preset position weight may be 5, a word is at the other positions of a sentence except the head position, the preset position weight may be 3, all positions of the same word are counted, all the preset position weights are added, and then the word frequency is removed, so as to obtain the position evaluation parameter of the word. The degree of co-occurrence between different words may be obtained according to the recognition result of the unsupervised machine learning method in step S220, the degree of co-occurrence between words with the same position evaluation parameter may be the context semantic similarity between words with the same position evaluation parameter, and if the degree of co-occurrence between two words with the same position evaluation parameter is greater than a certain value (which may be 80%, 90%, etc.), one of the words is substituted for the other word to complete the replacement of the synonym. By adopting synonym replacement processing, the word sense similarity acquisition speed of the data set to be labeled and the data set to be processed is increased, the data processing speed is increased, and the recognition speed of the entity model is increased.

In step S130 of this embodiment, please refer to fig. 3 for a specific implementation method for determining a new word data set in the processing data set according to the word sense similarity between the data set to be labeled and the processing data set, where fig. 3 is a schematic flow chart of a new word data set obtaining method provided in an embodiment of the present invention.

As shown in fig. 3, determining a new word data set in the processing data set according to the word sense similarity between the data set to be labeled and the processing data set may include the following steps S310 to S320:

s310, determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy relation and the antisense relation;

and S320, if the word meaning similarity of the data set to be labeled and the processed data set is smaller than a similarity threshold, obtaining a new word data set.

In this embodiment, the purpose of obtaining the new word data set is to facilitate the pre-training of the data set to be processed by using an information extraction method subsequently, and avoid the problem that the identification effect is poor due to unreasonable division in the process of dividing the entity data set into the data set to be labeled and the data set to be processed; therefore, the synonymy relation and the antisense relation are adopted to determine the word sense similarity of the data set to be labeled and the processing data set, so that new words in the processing data set are obtained, and a new word data set is obtained.

In step S310 of this embodiment, the determining the word sense similarity of the to-be-labeled data set and the processed data set according to the synonymy relationship and the antisense relationship specifically includes: determining a synonymy evaluation parameter according to the synonymy relation between the data set to be labeled and the processing data set and a preset synonymy relation weight; determining antisense evaluation parameters according to the antisense relation between the data set to be labeled and the processing data set and preset antisense relation weight; and determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy evaluation parameter and the antisense evaluation parameter. Specifically, the preset synonymous relationship weight may be 60%, 70%, etc., the preset antisense relationship weight may be 40%, 30%, etc., and the sum of the preset synonymous relationship weight and the preset antisense relationship weight is 1; acquiring the maximum synonymy similarity of a word in the data set to be labeled and a word in the data set to be labeled, and multiplying the maximum synonymy similarity by a preset synonymy relation weight to obtain a synonymy evaluation parameter of the word; obtaining the maximum antisense similarity of a word in a data set to be labeled, multiplying the maximum antisense similarity by a preset antisense relation weight to obtain an antisense evaluation parameter of the word, and adding the synonymy evaluation parameter and the antisense evaluation parameter into a word sense similarity; and if the word meaning similarity of the word is smaller than the similarity threshold, obtaining a new word, thereby obtaining a new word data set.

In step S130 of this embodiment, labeling the data set to be labeled and the new word data set, and obtaining a labeled data set specifically includes: if the data volume of the data set to be labeled is larger than a preset data volume threshold value, dividing the data set to be labeled into a sub data set to be labeled and a sub data set to be trained; determining a difference data set according to the word meaning similarity of the sub data set to be labeled and the sub data set to be trained; performing semantic annotation on the sub data set to be annotated and the difference data set to obtain a semantic annotation set; pre-training the sub data set to be trained by adopting an information extraction method based on the semantic annotation set to obtain a pre-training sub data set; and marking the new word data set, and combining the pre-training data set and the marked new word data set to obtain a marked data set. Specifically, the data set to be labeled is divided and processed to obtain the sub data set to be labeled and the difference data set, and then the sub data set to be labeled and the difference data set are labeled without labeling all words of the data set to be labeled, so that the data processing speed is increased. The information extraction method can adopt Bootstrapper and other algorithms to realize simple identification of the training subdata set.

In step S130 of this embodiment, labeling the data set to be labeled and the new word data set to obtain a labeled data set further includes: if the data volume of the data set to be labeled is smaller than a preset data volume threshold value, performing word segmentation processing on the data set to be labeled and the new word data set; performing part-of-speech tagging on the data set to be tagged and the new word data set after word segmentation processing; and performing semantic annotation on the data set to be annotated and the new word data set after the part of speech annotation to obtain an annotated data set.

In step S140 of this embodiment, the information extraction method may adopt an algorithm such as boottrapper to pre-train the to-be-processed data set, so as to facilitate realization of an identification effect on large-scale text data.

In step S150 of this embodiment, the initial entity identification model may adopt a CRF model, and the CRF model may model an implicit state to learn a relationship between labeling contexts. By adopting a mode of combining a Bootstrapper algorithm and a CRF model, the labeling of large-scale medical text data is realized, and entities in the medical field are effectively identified.

Second embodiment

Based on the same inventive concept as the method in the first embodiment, correspondingly, the embodiment also provides a system for establishing the entity recognition model.

Fig. 4 is a schematic flow chart of the system for establishing an entity recognition model according to the present invention.

As shown in fig. 4, the system 4 shown comprises: 41 entity data set acquisition module, 42 processing data set acquisition module, 43 annotation data set acquisition module, 44 pre-training data set acquisition module and 45 target model building module.

The system comprises a real data set acquisition module, a real data set processing module and a real data set processing module, wherein the real data set acquisition module is used for acquiring text data of a target field to obtain a real data set and dividing the real data set into a data set to be labeled and a data set to be processed;

In some exemplary embodiments, the process data set acquisition module includes:

the classification data set acquisition unit is used for carrying out stop words and repeated words on the classification data set subjected to word segmentation processing and part of speech tagging to obtain a classification data set;

the classification data set acquisition unit is used for clustering the data sets to be classified according to the semantic similarity to obtain a plurality of classification data sets;

and the processing data set acquisition unit is used for carrying out synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, and combining the classification processing set to generate a processing data set.

In some exemplary embodiments, the process data set acquisition unit includes:

the position parameter acquiring subunit is used for determining position evaluation parameters of the words according to the positions of the words in the classification data set and preset position weights;

the word co-occurrence degree acquisition subunit is used for acquiring words of the same part of speech in the classified data set to obtain a part of speech data set, acquiring words with the same position evaluation parameters in the part of speech data set, and determining word co-occurrence degree according to the context semantic similarity between the words with the same position evaluation parameters;

and the classification processing set acquisition subunit is used for carrying out synonym replacement on the part-of-speech data set according to the word co-occurrence degree to obtain a classification processing set.

In some exemplary embodiments, the annotation data set acquisition module comprises:

the word sense similarity obtaining unit is used for determining the word sense similarity of the data set to be labeled and the processing data set according to the synonymy relation and the antisense relation;

and the new word data set acquisition unit is used for obtaining a new word data set if the similarity between the data set to be labeled and the processed data set is smaller than a similarity threshold value.

In some exemplary embodiments, the word sense similarity obtaining unit includes:

the synonymy evaluation parameter obtaining subunit is used for determining a synonymy evaluation parameter according to the synonymy relationship between the data set to be labeled and the processing data set and a preset synonymy relationship weight;

the antisense evaluation parameter acquisition subunit is used for determining antisense evaluation parameters according to the antisense relation between the data set to be labeled and the processing data set and preset antisense relation weight;

and the word sense similarity obtaining subunit is used for determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy evaluation parameter and the antisense evaluation parameter.

In some exemplary embodiments, the annotation data set acquisition module further comprises:

the data set to be marked is divided into a sub data set to be marked and a sub data set to be trained if the data volume of the data set to be marked is larger than a preset data volume threshold;

a difference data set obtaining unit, configured to determine a difference data set according to the word sense similarity of the to-be-labeled sub data set and the to-be-trained sub data set;

a semantic annotation set acquisition unit, configured to perform semantic annotation on the to-be-annotated sub-data set and the difference data set to obtain a semantic annotation set;

a pre-training sub data set obtaining unit, configured to pre-train the to-be-trained sub data set by using an information extraction method based on the semantic annotation set to obtain a pre-training sub data set;

and the labeling data set acquisition first unit is used for labeling the new word data set, and combining the pre-training data set and the labeled new word data set to obtain a labeling data set.

the word segmentation processing unit is used for performing word segmentation processing on the data set to be labeled and the new word data set if the data volume of the data set to be labeled is smaller than a preset data volume threshold;

a part-of-speech tagging unit, configured to perform part-of-speech tagging on the data set to be tagged and the new word data set after the word segmentation processing;

and the tagged data set acquisition second unit is used for performing semantic tagging on the part-of-speech tagged data set to be tagged and the new word data set to obtain a tagged data set.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment also provides an electronic device, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the method in the embodiment.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for realizing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program to enable the electronic device to execute the steps of the method.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In the above-described embodiments, reference in the specification to "the embodiment," "an embodiment," "another embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The various appearances of the phrase "the present embodiment," "one embodiment," or "another embodiment" are not necessarily all referring to the same embodiment.

In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for building an entity recognition model, the method comprising:

acquiring text data of a target field to obtain an entity data set, and dividing the entity data set into a data set to be labeled and a data set to be processed;

2. The method for establishing an entity recognition model according to claim 1, wherein the synonym replacement is performed on the data set to be processed according to the part of speech and the word co-occurrence degree to obtain a processed data set, and specifically comprises:

3. The method for establishing an entity recognition model according to claim 2, wherein the classifying data set is subjected to synonym replacement according to part of speech and word co-occurrence degree to obtain a classification processing set, and specifically comprises:

4. The method for building an entity recognition model according to claim 1, wherein the determining a new word data set in the processed data set according to the word sense similarity between the data set to be labeled and the processed data set specifically comprises:

5. The method for building an entity recognition model according to claim 4, wherein the determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy relationship and the antisense relationship specifically comprises:

6. The method for establishing an entity recognition model according to claim 1, wherein the labeling of the data set to be labeled and the new word data set to obtain a labeled data set specifically comprises:

7. The method for building an entity recognition model according to claim 1, wherein the labeling of the data set to be labeled and the new word data set to obtain a labeled data set further comprises:

8. A system for building an entity recognition model, the system comprising:

9. An electronic device comprising a processor, a memory, and a communication bus;

the communication bus is used for connecting the processor and the memory;

the processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any one of claims 1-7.