CN113128234A - Method and system for establishing entity recognition model, electronic equipment and medium - Google Patents

Method and system for establishing entity recognition model, electronic equipment and medium Download PDF

Info

Publication number
CN113128234A
CN113128234A CN202110669805.5A CN202110669805A CN113128234A CN 113128234 A CN113128234 A CN 113128234A CN 202110669805 A CN202110669805 A CN 202110669805A CN 113128234 A CN113128234 A CN 113128234A
Authority
CN
China
Prior art keywords
data set
labeled
word
processed
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110669805.5A
Other languages
Chinese (zh)
Other versions
CN113128234B (en
Inventor
姚娟娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mingping Medical Data Technology Co ltd
Original Assignee
Mingpinyun Beijing Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mingpinyun Beijing Data Technology Co Ltd filed Critical Mingpinyun Beijing Data Technology Co Ltd
Priority to CN202110669805.5A priority Critical patent/CN113128234B/en
Publication of CN113128234A publication Critical patent/CN113128234A/en
Application granted granted Critical
Publication of CN113128234B publication Critical patent/CN113128234B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention is suitable for the technical field of data processing, and provides a method, a system, electronic equipment and a medium for establishing an entity recognition model, wherein the method comprises the steps of obtaining text data in a target field to obtain an entity data set, and dividing the entity data set into a data set to be marked and a data set to be processed; carrying out synonym replacement on the data set to be processed to obtain a processed data set; determining a new word data set in the processing data set according to the word meaning similarity of the data set to be labeled and the processing data set, and labeling the data set to be labeled and the new word data set to obtain a labeled data set; pre-training a data set to be processed by adopting an information extraction method based on the labeled data set to obtain a pre-training data set; the initial entity recognition model is trained by adopting the pre-training data set, and the target entity recognition model is output, so that the problems of small scale and the like of high-quality labeled corpus in the prior art are solved.

Description

Method and system for establishing entity recognition model, electronic equipment and medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and a system for establishing an entity recognition model, an electronic device, and a medium.
Background
Entity recognition is one of the core fundamental tasks of the natural language processing field, which is the extraction of specific types of entities from text. The method has important scientific significance and wide application value in downstream natural language processing tasks such as information retrieval, question-answering systems, information extraction, text mining and the like. From the existing research results at present, the recognition result of the named entity recognition research in some fields is difficult to be under the current conditions of small scale, slow recognition speed and low recognition accuracy of high-quality labeled corpus, and the performance of the named entity recognition research is worse than that of the named entity recognition research in the traditional fields.
Disclosure of Invention
The invention provides a method, a system, electronic equipment and a medium for establishing an entity recognition model, which aim to solve the problems of small scale, low recognition speed and the like of high-quality labeled corpus in the prior art.
The method for establishing the entity recognition model comprises the following steps: acquiring text data of a target field to obtain an entity data set, and dividing the entity data set into a data set to be labeled and a data set to be processed;
performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processed data set;
determining a new word data set in the processing data set according to the word meaning similarity of the data set to be labeled and the processing data set, and labeling the data set to be labeled and the new word data set to obtain a labeled data set;
pre-training the data set to be processed by adopting an information extraction method based on the labeled data set to obtain a pre-training data set;
and training an initial entity recognition model by adopting the pre-training data set, and outputting a target entity recognition model.
Optionally, the synonym replacement is performed on the data set to be processed according to the part of speech and the word co-occurrence degree to obtain a processed data set, and the method specifically includes:
removing stop words from the data set to be processed after word segmentation processing and part-of-speech tagging to obtain a data set to be classified;
clustering the data sets to be classified according to the semantic similarity to obtain a plurality of classified data sets;
and carrying out synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, and combining the classification processing set to generate a processing data set.
Optionally, the synonym replacement is performed on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, which specifically includes:
determining position evaluation parameters of the words according to the positions of the words in the classification data set and preset position weights;
obtaining words of the same part of speech in the classified data set to obtain a part of speech data set, obtaining words of the same position evaluation parameter in the part of speech data set, and determining word co-occurrence according to context semantic similarity between the words of the same position evaluation parameter;
and carrying out synonym replacement on the part-of-speech data set according to the word co-occurrence degree to obtain a classification processing set.
Optionally, the determining a new word data set in the processing data set according to the word sense similarity between the data set to be labeled and the processing data set specifically includes:
determining the word sense similarity of the data set to be labeled and the processing data set according to the synonymy relation and the antisense relation;
and if the word meaning similarity of the data set to be labeled and the processing data set is smaller than a similarity threshold value, obtaining a new word data set.
Optionally, the determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy relationship and the antisense relationship specifically includes:
determining a synonymy evaluation parameter according to the synonymy relation between the data set to be labeled and the processing data set and a preset synonymy relation weight;
determining antisense evaluation parameters according to the antisense relation between the data set to be labeled and the processing data set and preset antisense relation weight;
and determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy evaluation parameter and the antisense evaluation parameter.
Optionally, the labeling the data set to be labeled and the new word data set to obtain a labeled data set specifically includes:
if the data volume of the data set to be labeled is larger than a preset data volume threshold value, dividing the data set to be labeled into a sub data set to be labeled and a sub data set to be trained;
determining a difference data set according to the word meaning similarity of the sub data set to be labeled and the sub data set to be trained;
performing semantic annotation on the sub data set to be annotated and the difference data set to obtain a semantic annotation set;
pre-training the sub data set to be trained by adopting an information extraction method based on the semantic annotation set to obtain a pre-training sub data set;
and marking the new word data set, and combining the pre-training data set and the marked new word data set to obtain a marked data set.
Optionally, the labeling the data set to be labeled and the new word data set to obtain a labeled data set, further includes:
if the data volume of the data set to be labeled is smaller than a preset data volume threshold value, performing word segmentation processing on the data set to be labeled and the new word data set;
performing part-of-speech tagging on the data set to be tagged and the new word data set after word segmentation processing;
and performing semantic annotation on the data set to be annotated and the new word data set after the part of speech annotation to obtain an annotated data set.
The invention also provides a system for establishing the entity recognition model, which comprises the following steps:
the entity data set acquisition module is used for acquiring text data of a target field to obtain an entity data set and dividing the entity data set into a data set to be labeled and a data set to be processed;
the processing data set acquisition module is used for performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processing data set;
a labeling data set obtaining module, configured to determine a new word data set in the processing data set according to the word meaning similarity between the data set to be labeled and the processing data set, label the data set to be labeled and the new word data set, and obtain a labeling data set;
a pre-training data set acquisition module, configured to pre-train the to-be-processed data set by using an information extraction method based on the labeled data set to obtain a pre-training data set;
and the target model establishing module is used for training an initial entity recognition model by adopting the pre-training data set and outputting a target entity recognition model.
The present invention also provides an electronic device comprising: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the entity identification model building method.
The present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for building an entity identification model as described above.
The invention has the beneficial effects that: the method for establishing the entity identification model comprises the steps of firstly obtaining an entity data set by obtaining text data of a target field, and dividing the entity data set into a data set to be marked and a data set to be processed; secondly, synonym replacement is carried out on the data set to be processed, a new word data set is determined according to the word meaning similarity, and the data set to be labeled and the new word data set are labeled to obtain a labeled data set; pre-training the data set to be processed by adopting an information extraction method again to obtain a pre-training data set; and finally, training an initial entity recognition model by adopting the pre-training data set, and outputting a target entity recognition model. Synonym replacement is carried out on the data set to be processed, so that the data processing efficiency of determining a new word data set according to the word meaning similarity can be improved, and the recognition speed is improved; the entity data set is divided into a data set to be marked and a data set to be processed, and a partial marking method is adopted, so that the data processing speed and the recognition speed are improved; the data quality can be improved by pre-training the data set to be processed by adopting an information extraction method, so that the identification accuracy is improved; the entity data set is pre-trained, then the pre-trained data set is adopted to train an initial entity recognition model, and a target entity recognition model is output, so that large-scale and high-quality marking of text data in a target field is realized.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating a method for building an entity recognition model according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram of a method for processing a data set acquisition in an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for creating a new word data set according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a system for building an entity recognition model according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
First embodiment
Fig. 1 is a flowchart illustrating a method for building an entity recognition model according to an embodiment of the present invention.
As shown in fig. 1, the method for building the entity recognition model includes steps S110 to S150:
s110, acquiring text data of a target field to obtain a real data set, and dividing the real data set into a data set to be labeled and a data set to be processed;
s120, performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processed data set;
s130, determining a new word data set in the processing data set according to the word meaning similarity of the data set to be labeled and the processing data set, and labeling the data set to be labeled and the new word data set to obtain a labeled data set;
s140, pre-training the data set to be processed by adopting an information extraction method based on the labeled data set to obtain a pre-training data set;
s150, training an initial entity recognition model by adopting the pre-training data set, and outputting a target entity recognition model.
In step S110 of this embodiment, taking text data processing in the medical field as an example, the text data in the target field is derived from publicly available medical text data at home and abroad, such as a periodical in the medical field, an inquiry record, a doctor' S advice, an electronic medical record, and the like; and various paper disease diagnosis records can be recorded by scanning or other modes to form text data in the medical field. The real data set can be divided into a data set to be labeled and a data set to be processed according to the condition of the acquired real data set in a proper proportion, and the real data set needs to be preprocessed before being divided, wherein the preprocessing comprises data cleaning, special punctuation mark processing and the like. The data cleaning is mainly to carry out re-examination and inspection on the data, delete repeated data and correct error data so as to ensure the consistency of the data. Common data cleaning methods include mathematical statistics, regression statistics, etc., and may be selected according to actual application requirements, which is not limited herein.
In step S120 of this embodiment, word segmentation processing and part-of-speech tagging are performed on the data set to be processed, and then synonym replacement is performed on the data set to be processed according to the part-of-speech and the word co-occurrence degree, so as to obtain a processed data set. Please refer to fig. 2 for a specific implementation method of performing synonym replacement on the to-be-processed data set according to the part of speech and the word co-occurrence degree to obtain the processed data set, where fig. 2 is a schematic flow diagram of the method for acquiring the processed data set according to an embodiment of the present invention.
As shown in fig. 2, performing synonym replacement on the to-be-processed data set according to the part of speech and the word co-occurrence degree to obtain a processed data set may include the following steps S210 to S230:
s210, deactivating words of the data set to be processed subjected to word segmentation processing and part-of-speech tagging to obtain a data set to be classified;
s220, clustering the data sets to be classified according to the semantic similarity to obtain a plurality of classified data sets;
and S230, carrying out synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, and combining the classification processing set to generate a processing data set.
In step S220 of this embodiment, before classifying the to-be-classified data sets, a non-supervised machine learning method may be adopted to perform simple semantic recognition on the to-be-classified data sets, obtain semantic similarity of words in the to-be-classified data sets according to recognition results of the non-supervised machine learning method, and perform clustering processing on the to-be-classified data sets according to the semantic similarity to obtain a plurality of classified data sets; clustering algorithms include, but are not limited to, K-means clustering algorithms. And classifying the to-be-classified data sets by adopting a K-means clustering algorithm to obtain a plurality of classified data sets. Specifically, the K value is set according to actual experience, and after clustering processing is carried out on the data sets to be classified, the obtained classified data sets are similar in semantics, so that synonym replacement in the subsequent data sets to be processed is facilitated, the data processing speed is increased, and the recognition speed of the entity recognition model is increased.
In step S230 of this embodiment, performing synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, which specifically includes: determining position evaluation parameters of the words according to the positions of the words in the classification data set and preset position weights; obtaining words of the same part of speech in the classified data set to obtain a part of speech data set, obtaining words of the same position evaluation parameter in the part of speech data set, and determining word co-occurrence according to context semantic similarity between the words of the same position evaluation parameter; and carrying out synonym replacement on the part-of-speech data set according to the word co-occurrence degree to obtain a classification processing set. Specifically, the position evaluation parameter of a word may be determined according to the position of the word in a sentence in the classification data set, for example, a word is at the head position of a sentence, the preset position weight may be 5, a word is at the other positions of a sentence except the head position, the preset position weight may be 3, all positions of the same word are counted, all the preset position weights are added, and then the word frequency is removed, so as to obtain the position evaluation parameter of the word. The degree of co-occurrence between different words may be obtained according to the recognition result of the unsupervised machine learning method in step S220, the degree of co-occurrence between words with the same position evaluation parameter may be the context semantic similarity between words with the same position evaluation parameter, and if the degree of co-occurrence between two words with the same position evaluation parameter is greater than a certain value (which may be 80%, 90%, etc.), one of the words is substituted for the other word to complete the replacement of the synonym. By adopting synonym replacement processing, the word sense similarity acquisition speed of the data set to be labeled and the data set to be processed is increased, the data processing speed is increased, and the recognition speed of the entity model is increased.
In step S130 of this embodiment, please refer to fig. 3 for a specific implementation method for determining a new word data set in the processing data set according to the word sense similarity between the data set to be labeled and the processing data set, where fig. 3 is a schematic flow chart of a new word data set obtaining method provided in an embodiment of the present invention.
As shown in fig. 3, determining a new word data set in the processing data set according to the word sense similarity between the data set to be labeled and the processing data set may include the following steps S310 to S320:
s310, determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy relation and the antisense relation;
and S320, if the word meaning similarity of the data set to be labeled and the processed data set is smaller than a similarity threshold, obtaining a new word data set.
In this embodiment, the purpose of obtaining the new word data set is to facilitate the pre-training of the data set to be processed by using an information extraction method subsequently, and avoid the problem that the identification effect is poor due to unreasonable division in the process of dividing the entity data set into the data set to be labeled and the data set to be processed; therefore, the synonymy relation and the antisense relation are adopted to determine the word sense similarity of the data set to be labeled and the processing data set, so that new words in the processing data set are obtained, and a new word data set is obtained.
In step S310 of this embodiment, the determining the word sense similarity of the to-be-labeled data set and the processed data set according to the synonymy relationship and the antisense relationship specifically includes: determining a synonymy evaluation parameter according to the synonymy relation between the data set to be labeled and the processing data set and a preset synonymy relation weight; determining antisense evaluation parameters according to the antisense relation between the data set to be labeled and the processing data set and preset antisense relation weight; and determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy evaluation parameter and the antisense evaluation parameter. Specifically, the preset synonymous relationship weight may be 60%, 70%, etc., the preset antisense relationship weight may be 40%, 30%, etc., and the sum of the preset synonymous relationship weight and the preset antisense relationship weight is 1; acquiring the maximum synonymy similarity of a word in the data set to be labeled and a word in the data set to be labeled, and multiplying the maximum synonymy similarity by a preset synonymy relation weight to obtain a synonymy evaluation parameter of the word; obtaining the maximum antisense similarity of a word in a data set to be labeled, multiplying the maximum antisense similarity by a preset antisense relation weight to obtain an antisense evaluation parameter of the word, and adding the synonymy evaluation parameter and the antisense evaluation parameter into a word sense similarity; and if the word meaning similarity of the word is smaller than the similarity threshold, obtaining a new word, thereby obtaining a new word data set.
In step S130 of this embodiment, labeling the data set to be labeled and the new word data set, and obtaining a labeled data set specifically includes: if the data volume of the data set to be labeled is larger than a preset data volume threshold value, dividing the data set to be labeled into a sub data set to be labeled and a sub data set to be trained; determining a difference data set according to the word meaning similarity of the sub data set to be labeled and the sub data set to be trained; performing semantic annotation on the sub data set to be annotated and the difference data set to obtain a semantic annotation set; pre-training the sub data set to be trained by adopting an information extraction method based on the semantic annotation set to obtain a pre-training sub data set; and marking the new word data set, and combining the pre-training data set and the marked new word data set to obtain a marked data set. Specifically, the data set to be labeled is divided and processed to obtain the sub data set to be labeled and the difference data set, and then the sub data set to be labeled and the difference data set are labeled without labeling all words of the data set to be labeled, so that the data processing speed is increased. The information extraction method can adopt Bootstrapper and other algorithms to realize simple identification of the training subdata set.
In step S130 of this embodiment, labeling the data set to be labeled and the new word data set to obtain a labeled data set further includes: if the data volume of the data set to be labeled is smaller than a preset data volume threshold value, performing word segmentation processing on the data set to be labeled and the new word data set; performing part-of-speech tagging on the data set to be tagged and the new word data set after word segmentation processing; and performing semantic annotation on the data set to be annotated and the new word data set after the part of speech annotation to obtain an annotated data set.
In step S140 of this embodiment, the information extraction method may adopt an algorithm such as boottrapper to pre-train the to-be-processed data set, so as to facilitate realization of an identification effect on large-scale text data.
In step S150 of this embodiment, the initial entity identification model may adopt a CRF model, and the CRF model may model an implicit state to learn a relationship between labeling contexts. By adopting a mode of combining a Bootstrapper algorithm and a CRF model, the labeling of large-scale medical text data is realized, and entities in the medical field are effectively identified.
Second embodiment
Based on the same inventive concept as the method in the first embodiment, correspondingly, the embodiment also provides a system for establishing the entity recognition model.
Fig. 4 is a schematic flow chart of the system for establishing an entity recognition model according to the present invention.
As shown in fig. 4, the system 4 shown comprises: 41 entity data set acquisition module, 42 processing data set acquisition module, 43 annotation data set acquisition module, 44 pre-training data set acquisition module and 45 target model building module.
The system comprises a real data set acquisition module, a real data set processing module and a real data set processing module, wherein the real data set acquisition module is used for acquiring text data of a target field to obtain a real data set and dividing the real data set into a data set to be labeled and a data set to be processed;
the processing data set acquisition module is used for performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processing data set;
a labeling data set obtaining module, configured to determine a new word data set in the processing data set according to the word meaning similarity between the data set to be labeled and the processing data set, label the data set to be labeled and the new word data set, and obtain a labeling data set;
a pre-training data set acquisition module, configured to pre-train the to-be-processed data set by using an information extraction method based on the labeled data set to obtain a pre-training data set;
and the target model establishing module is used for training an initial entity recognition model by adopting the pre-training data set and outputting a target entity recognition model.
In some exemplary embodiments, the process data set acquisition module includes:
the classification data set acquisition unit is used for carrying out stop words and repeated words on the classification data set subjected to word segmentation processing and part of speech tagging to obtain a classification data set;
the classification data set acquisition unit is used for clustering the data sets to be classified according to the semantic similarity to obtain a plurality of classification data sets;
and the processing data set acquisition unit is used for carrying out synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, and combining the classification processing set to generate a processing data set.
In some exemplary embodiments, the process data set acquisition unit includes:
the position parameter acquiring subunit is used for determining position evaluation parameters of the words according to the positions of the words in the classification data set and preset position weights;
the word co-occurrence degree acquisition subunit is used for acquiring words of the same part of speech in the classified data set to obtain a part of speech data set, acquiring words with the same position evaluation parameters in the part of speech data set, and determining word co-occurrence degree according to the context semantic similarity between the words with the same position evaluation parameters;
and the classification processing set acquisition subunit is used for carrying out synonym replacement on the part-of-speech data set according to the word co-occurrence degree to obtain a classification processing set.
In some exemplary embodiments, the annotation data set acquisition module comprises:
the word sense similarity obtaining unit is used for determining the word sense similarity of the data set to be labeled and the processing data set according to the synonymy relation and the antisense relation;
and the new word data set acquisition unit is used for obtaining a new word data set if the similarity between the data set to be labeled and the processed data set is smaller than a similarity threshold value.
In some exemplary embodiments, the word sense similarity obtaining unit includes:
the synonymy evaluation parameter obtaining subunit is used for determining a synonymy evaluation parameter according to the synonymy relationship between the data set to be labeled and the processing data set and a preset synonymy relationship weight;
the antisense evaluation parameter acquisition subunit is used for determining antisense evaluation parameters according to the antisense relation between the data set to be labeled and the processing data set and preset antisense relation weight;
and the word sense similarity obtaining subunit is used for determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy evaluation parameter and the antisense evaluation parameter.
In some exemplary embodiments, the annotation data set acquisition module further comprises:
the data set to be marked is divided into a sub data set to be marked and a sub data set to be trained if the data volume of the data set to be marked is larger than a preset data volume threshold;
a difference data set obtaining unit, configured to determine a difference data set according to the word sense similarity of the to-be-labeled sub data set and the to-be-trained sub data set;
a semantic annotation set acquisition unit, configured to perform semantic annotation on the to-be-annotated sub-data set and the difference data set to obtain a semantic annotation set;
a pre-training sub data set obtaining unit, configured to pre-train the to-be-trained sub data set by using an information extraction method based on the semantic annotation set to obtain a pre-training sub data set;
and the labeling data set acquisition first unit is used for labeling the new word data set, and combining the pre-training data set and the labeled new word data set to obtain a labeling data set.
In some exemplary embodiments, the annotation data set acquisition module further comprises:
the word segmentation processing unit is used for performing word segmentation processing on the data set to be labeled and the new word data set if the data volume of the data set to be labeled is smaller than a preset data volume threshold;
a part-of-speech tagging unit, configured to perform part-of-speech tagging on the data set to be tagged and the new word data set after the word segmentation processing;
and the tagged data set acquisition second unit is used for performing semantic tagging on the part-of-speech tagged data set to be tagged and the new word data set to obtain a tagged data set.
The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.
The present embodiment also provides an electronic device, including: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the method in the embodiment.
The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for realizing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program to enable the electronic device to execute the steps of the method.
In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In the above-described embodiments, reference in the specification to "the embodiment," "an embodiment," "another embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The various appearances of the phrase "the present embodiment," "one embodiment," or "another embodiment" are not necessarily all referring to the same embodiment.
In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A method for building an entity recognition model, the method comprising:
acquiring text data of a target field to obtain an entity data set, and dividing the entity data set into a data set to be labeled and a data set to be processed;
performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processed data set;
determining a new word data set in the processing data set according to the word meaning similarity of the data set to be labeled and the processing data set, and labeling the data set to be labeled and the new word data set to obtain a labeled data set;
pre-training the data set to be processed by adopting an information extraction method based on the labeled data set to obtain a pre-training data set;
and training an initial entity recognition model by adopting the pre-training data set, and outputting a target entity recognition model.
2. The method for establishing an entity recognition model according to claim 1, wherein the synonym replacement is performed on the data set to be processed according to the part of speech and the word co-occurrence degree to obtain a processed data set, and specifically comprises:
removing stop words from the data set to be processed after word segmentation processing and part-of-speech tagging to obtain a data set to be classified;
clustering the data sets to be classified according to the semantic similarity to obtain a plurality of classified data sets;
and carrying out synonym replacement on the classification data set according to the part of speech and the word co-occurrence degree to obtain a classification processing set, and combining the classification processing set to generate a processing data set.
3. The method for establishing an entity recognition model according to claim 2, wherein the classifying data set is subjected to synonym replacement according to part of speech and word co-occurrence degree to obtain a classification processing set, and specifically comprises:
determining position evaluation parameters of the words according to the positions of the words in the classification data set and preset position weights;
obtaining words of the same part of speech in the classified data set to obtain a part of speech data set, obtaining words of the same position evaluation parameter in the part of speech data set, and determining word co-occurrence according to context semantic similarity between the words of the same position evaluation parameter;
and carrying out synonym replacement on the part-of-speech data set according to the word co-occurrence degree to obtain a classification processing set.
4. The method for building an entity recognition model according to claim 1, wherein the determining a new word data set in the processed data set according to the word sense similarity between the data set to be labeled and the processed data set specifically comprises:
determining the word sense similarity of the data set to be labeled and the processing data set according to the synonymy relation and the antisense relation;
and if the word meaning similarity of the data set to be labeled and the processing data set is smaller than a similarity threshold value, obtaining a new word data set.
5. The method for building an entity recognition model according to claim 4, wherein the determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy relationship and the antisense relationship specifically comprises:
determining a synonymy evaluation parameter according to the synonymy relation between the data set to be labeled and the processing data set and a preset synonymy relation weight;
determining antisense evaluation parameters according to the antisense relation between the data set to be labeled and the processing data set and preset antisense relation weight;
and determining the word sense similarity of the data set to be labeled and the processed data set according to the synonymy evaluation parameter and the antisense evaluation parameter.
6. The method for establishing an entity recognition model according to claim 1, wherein the labeling of the data set to be labeled and the new word data set to obtain a labeled data set specifically comprises:
if the data volume of the data set to be labeled is larger than a preset data volume threshold value, dividing the data set to be labeled into a sub data set to be labeled and a sub data set to be trained;
determining a difference data set according to the word meaning similarity of the sub data set to be labeled and the sub data set to be trained;
performing semantic annotation on the sub data set to be annotated and the difference data set to obtain a semantic annotation set;
pre-training the sub data set to be trained by adopting an information extraction method based on the semantic annotation set to obtain a pre-training sub data set;
and marking the new word data set, and combining the pre-training data set and the marked new word data set to obtain a marked data set.
7. The method for building an entity recognition model according to claim 1, wherein the labeling of the data set to be labeled and the new word data set to obtain a labeled data set further comprises:
if the data volume of the data set to be labeled is smaller than a preset data volume threshold value, performing word segmentation processing on the data set to be labeled and the new word data set;
performing part-of-speech tagging on the data set to be tagged and the new word data set after word segmentation processing;
and performing semantic annotation on the data set to be annotated and the new word data set after the part of speech annotation to obtain an annotated data set.
8. A system for building an entity recognition model, the system comprising:
the entity data set acquisition module is used for acquiring text data of a target field to obtain an entity data set and dividing the entity data set into a data set to be labeled and a data set to be processed;
the processing data set acquisition module is used for performing word segmentation processing and part-of-speech tagging on the data set to be processed, and performing synonym replacement on the data set to be processed according to the part-of-speech and the word co-occurrence degree to obtain a processing data set;
a labeling data set obtaining module, configured to determine a new word data set in the processing data set according to the word meaning similarity between the data set to be labeled and the processing data set, label the data set to be labeled and the new word data set, and obtain a labeling data set;
a pre-training data set acquisition module, configured to pre-train the to-be-processed data set by using an information extraction method based on the labeled data set to obtain a pre-training data set;
and the target model establishing module is used for training an initial entity recognition model by adopting the pre-training data set and outputting a target entity recognition model.
9. An electronic device comprising a processor, a memory, and a communication bus;
the communication bus is used for connecting the processor and the memory;
the processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-7.
10. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any one of claims 1-7.
CN202110669805.5A 2021-06-17 2021-06-17 Method and system for establishing entity recognition model, electronic equipment and medium Active CN113128234B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110669805.5A CN113128234B (en) 2021-06-17 2021-06-17 Method and system for establishing entity recognition model, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110669805.5A CN113128234B (en) 2021-06-17 2021-06-17 Method and system for establishing entity recognition model, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN113128234A true CN113128234A (en) 2021-07-16
CN113128234B CN113128234B (en) 2021-11-02

Family

ID=76783031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110669805.5A Active CN113128234B (en) 2021-06-17 2021-06-17 Method and system for establishing entity recognition model, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN113128234B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444498A (en) * 2021-12-20 2022-05-06 奇安信科技集团股份有限公司 Text duplicate checking method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
CN108287822A (en) * 2018-01-23 2018-07-17 北京容联易通信息技术有限公司 A kind of Chinese Similar Problems generation System and method for
CN109522415A (en) * 2018-10-17 2019-03-26 厦门快商通信息技术有限公司 A kind of corpus labeling method and device
US20200250376A1 (en) * 2019-12-13 2020-08-06 Beijing Xiaomi Intelligent Technology Co., Ltd. Keyword extraction method, keyword extraction device and computer-readable storage medium
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
CN108287822A (en) * 2018-01-23 2018-07-17 北京容联易通信息技术有限公司 A kind of Chinese Similar Problems generation System and method for
CN109522415A (en) * 2018-10-17 2019-03-26 厦门快商通信息技术有限公司 A kind of corpus labeling method and device
US20200250376A1 (en) * 2019-12-13 2020-08-06 Beijing Xiaomi Intelligent Technology Co., Ltd. Keyword extraction method, keyword extraction device and computer-readable storage medium
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冯建周,等: "基于迁移学习的细粒度实体分类方法的研究", 《自动化学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444498A (en) * 2021-12-20 2022-05-06 奇安信科技集团股份有限公司 Text duplicate checking method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113128234B (en) 2021-11-02

Similar Documents

Publication Publication Date Title
JP6634515B2 (en) Question clustering processing method and apparatus in automatic question answering system
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN113127605B (en) Method and system for establishing target recognition model, electronic equipment and medium
US8452772B1 (en) Methods, systems, and articles of manufacture for addressing popular topics in a socials sphere
CN111709243A (en) Knowledge extraction method and device based on deep learning
CN105701084A (en) Characteristic extraction method of text classification on the basis of mutual information
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN112329460B (en) Text topic clustering method, device, equipment and storage medium
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN113836938B (en) Text similarity calculation method and device, storage medium and electronic device
WO2023108993A1 (en) Product recommendation method, apparatus and device based on deep clustering algorithm, and medium
CN111241230A (en) Method and system for identifying string mark risk based on text mining
CN112328655B (en) Text label mining method, device, equipment and storage medium
CN112579729B (en) Training method and device for document quality evaluation model, electronic equipment and medium
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN110728135B (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN113128234B (en) Method and system for establishing entity recognition model, electronic equipment and medium
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN108846142A (en) A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing
CN113743079A (en) Text similarity calculation method and device based on co-occurrence entity interaction graph
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN108288172A (en) Advertisement DSP orientations launch the method and terminal of advertisement
CN116881425A (en) Universal document question-answering implementation method, system, device and storage medium
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN110717029A (en) Information processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220714

Address after: 201615 room 1904, G60 Kechuang building, No. 650, Xinzhuan Road, Songjiang District, Shanghai

Patentee after: Shanghai Mingping Medical Data Technology Co.,Ltd.

Address before: 102400 no.86-n3557, Wanxing Road, Changyang, Fangshan District, Beijing

Patentee before: Mingpinyun (Beijing) data Technology Co.,Ltd.

TR01 Transfer of patent right