CN111368547A - Entity identification method, device, equipment and storage medium based on semantic analysis - Google Patents

Entity identification method, device, equipment and storage medium based on semantic analysis Download PDF

Info

Publication number
CN111368547A
CN111368547A CN202010156694.3A CN202010156694A CN111368547A CN 111368547 A CN111368547 A CN 111368547A CN 202010156694 A CN202010156694 A CN 202010156694A CN 111368547 A CN111368547 A CN 111368547A
Authority
CN
China
Prior art keywords
word
index
suffix
words
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010156694.3A
Other languages
Chinese (zh)
Inventor
张灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010156694.3A priority Critical patent/CN111368547A/en
Publication of CN111368547A publication Critical patent/CN111368547A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, in particular to an entity identification method, device, equipment and storage medium based on semantic analysis. The method comprises the following steps: and acquiring the corpus to be recognized, matching the corpus to be recognized in the suffix dictionary, and recording all suffix word indexes to obtain a suffix word index set. Segmenting and extracting linguistic data to be recognized between two adjacent suffix word indexes in the suffix word index set to obtain a sub-character string text, matching the sub-character string text in a candidate dictionary, recording all candidate word indexes, and combining a target candidate word and the adjacent suffix words one by one to obtain a plurality of entity words. The invention defines the starting boundary of the entity word by adopting the dictionary based on the prefix dictionary, the suffix dictionary and the candidate dictionary, has high matching process efficiency, small searching difficulty, wide identification range and strong portability, and can identify the new word and the entity word containing the special symbol to the maximum extent.

Description

Entity identification method, device, equipment and storage medium based on semantic analysis
Technical Field
The invention relates to the technical field of voice semantics in the field of artificial intelligence, in particular to an entity identification method, device, equipment and storage medium based on semantic analysis.
Background
Named Entity Recognition (NER) is a fundamental task in natural language processing. The purpose is to identify entity words with special meaning in the text corpus: such as place name, work or organizational terminology, and the like. The system tasks of intention identification, knowledge graph construction, information extraction and retrieval and the like are all indispensable technical components.
Unlike English, Chinese does not have a space as a boundary mark of a word, does not have the case of the word to help the word segmentation, and has variable word modes and complicated word range, thereby increasing the difficulty of entity identification. However, the NER identification methods in the prior art generally include the following categories:
1. constructing a rule template in advance, and matching the text with the template to search an entity: manual writing rules are needed, so that a large amount of linguistic knowledge and vertical field background knowledge are required, and the expandability and the transportability are poor;
2. entity recognition is performed in a dictionary-based manner: aiming at entity words possibly existing in a text, matching is carried out in a dictionary, and in order to guarantee the recognition accuracy, a large-scale dictionary needs to be constructed, so that the searching efficiency is low, and new words outside the dictionary cannot be recognized;
3. the method based on statistical machine learning and deep learning comprises the following steps: according to characteristics obtained by artificial construction or neural network learning, the entity in the corpus is identified by combining the model, but the method needs to label a large amount of training data in advance, and the model has long training time, large dependence on parameter setting and poor model interpretability.
Therefore, whether a rule template is adopted, or a dictionary-based or statistical machine learning and deep learning mode is adopted, the method has the obvious defects, and the application of NER identification is restricted.
Disclosure of Invention
In view of this, it is necessary to provide an entity identification method, an entity identification device, an entity identification apparatus, and a storage medium based on semantic parsing, for solving the problems of high identification difficulty and poor expandability in the existing named entity identification.
An entity identification method based on semantic parsing comprises the following steps:
obtaining a corpus to be recognized, matching the corpus to be recognized in a suffix dictionary, recording suffix word indexes of all target suffix words in the corpus to be recognized if the target suffix words are matched, and sequencing and combining all suffix word indexes to obtain a suffix word index set;
segmenting and extracting the corpus corresponding to two adjacent suffix word indexes in the suffix word index set to obtain a substring text, matching the substring text in a candidate dictionary, recording candidate word indexes of all target candidate words in the corpus to be identified if target candidate words are matched, and storing all the target candidate words in a data structure;
and combining all the target candidate words in the data structure to obtain a plurality of combined target candidate words, and combining the target candidate words and the suffix words adjacent to the rear of the corresponding candidate word index one by one to obtain a plurality of entity words.
In one possible design, the obtaining the corpus to be recognized includes:
establishing connection with a service system, and receiving a push request sent by the service system;
and analyzing the data in the pushing request in a preset analysis mode to obtain the linguistic data to be identified.
In one possible design, the sorting and merging all suffix word indices includes:
searching whether completely overlapped indexing intervals exist in all suffix word indexes, and if so, merging the suffix word indexes into a suffix word index with the maximum interval;
searching whether partially overlapped index intervals exist or not, if so, merging the index intervals into the same suffix word index, wherein the index starting bit of the suffix word index is the minimum value in the partially overlapped index intervals, and the index ending bit of the suffix word index is the maximum value in the partially overlapped index intervals;
and searching whether adjacent index intervals exist or not, if so, merging the index intervals into the same suffix word index, wherein the index starting bit of the suffix word index is the minimum value in the adjacent index intervals, and the index ending bit of the suffix word index is the maximum value in the adjacent index intervals.
In one possible design, the matching the substring text in a candidate dictionary, and if a target candidate word is matched, recording candidate word indexes of all target candidate words in the corpus to be recognized, and storing all target candidate words in a data structure includes:
matching the substring text in a candidate dictionary;
if a target candidate word is matched, storing the target candidate word in a data structure, recording a candidate word index of the target candidate word in the corpus to be recognized, judging whether the total length of all matched target candidate words is equal to the length of an initial sub-character string text, if so, completing matching, if not, removing the target candidate word and all characters behind the target candidate word from the initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step of matching the sub-character string text in a candidate dictionary;
if the target candidate word is not matched, matching the text of the sub character string in a prefix dictionary, if the target prefix word is matched, recording a prefix word index of the target prefix word in the corpus to be recognized, and finishing matching; if the target prefix word is not matched, judging whether the sub character string text has one character, if so, removing the character from the initial sub character string text to obtain a new sub character string text, and continuing to execute the step of matching the sub character string text in the candidate dictionary, otherwise, removing the first character of the initial sub character string text to obtain a new sub character string text, and continuing to execute the step of matching the sub character string text in the candidate dictionary.
In one possible design, the matching the substring text in a candidate dictionary, and if a target candidate word is matched, recording candidate word indexes of all target candidate words in the corpus to be recognized, and storing all target candidate words in a data structure includes:
and matching the sub-character string texts in a prefix dictionary, finishing matching if a target prefix word is matched, and executing the step of matching the sub-character string texts in a candidate dictionary if the target prefix word is not matched.
In one possible design, the merging all the target candidate words in the data structure to obtain multiple merged target candidate words includes:
searching whether partially overlapped index intervals exist in candidate word indexes corresponding to all target candidate words, if so, combining the candidate word indexes into the same candidate word index, wherein the index start bit of the candidate word index is the minimum value in the partially overlapped index intervals, and the index end bit of the candidate word index is the maximum value in the partially overlapped index intervals;
and searching whether adjacent index intervals exist or not, if so, combining the adjacent index intervals into the same candidate word index, wherein the index start bit of the candidate word index is the minimum value in the adjacent index intervals, and the index end bit of the candidate word index is the maximum value in the adjacent index intervals.
In one possible design, after the target candidate word and the suffix word adjacent to the rear of the corresponding candidate word index are combined one by one to obtain a plurality of entity words, the method further includes:
and searching the entity words in preset data to be pushed one by one, judging whether the data to be pushed contains matched target data, if so, extracting the target data, returning the target data to the service system, and if not, returning an error prompt.
An entity recognition device based on semantic parsing, comprising:
the suffix word matching module is used for acquiring the corpus to be recognized, matching the corpus to be recognized in a suffix dictionary, recording suffix word indexes of all target suffix words in the corpus to be recognized if the target suffix words are matched, and sequencing and combining all the suffix word indexes to obtain a suffix word index set;
the matching candidate word module is used for segmenting and extracting the corpus between two adjacent suffix word indexes in the suffix word index set to obtain a substring text, matching the substring text in a candidate dictionary, recording candidate word indexes of all target candidate words in the corpus to be identified if target candidate words are matched, and storing all the target candidate words in a data structure;
and the entity word determining module is used for merging all the target candidate words in the data structure to obtain a plurality of merged target candidate words, and combining the target candidate words and the adjacent suffix words behind the corresponding candidate word indexes one by one to obtain a plurality of entity words.
A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by the processor, cause the processor to perform the steps of the above-described entity recognition method based on semantic parsing.
A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the semantic parsing based entity identification method described above.
The entity identification method, device, equipment and storage medium based on semantic analysis comprise the steps of obtaining a corpus to be identified, matching the corpus to be identified in a suffix dictionary, recording suffix word indexes of all target suffix words in the corpus to be identified if target suffix words are matched, and sequencing and combining all the suffix word indexes to obtain a suffix word index set; segmenting and extracting the corpus between two adjacent suffix word indexes in the suffix word index set to obtain a sub-character string text, matching the sub-character string text in a candidate dictionary, recording candidate word indexes of all target candidate words in the corpus to be identified if target candidate words are matched, and storing all the target candidate words in a data structure; and combining all the target candidate words in the data structure to obtain a plurality of combined target candidate words, and combining the target candidate words and the suffix words adjacent to the rear of the corresponding candidate word index one by one to obtain a plurality of entity words. The invention defines the starting boundary of the entity word by adopting a dictionary based on a prefix dictionary, a suffix dictionary and a candidate dictionary, and determines the entity word after respectively matching the main content of the corpus to be recognized in the dictionaries. Due to the adoption of the combined matching mode, the entity recognition can be completed only by maintaining a prefix dictionary, a suffix dictionary and a candidate dictionary which are small in scale. The matching process has high efficiency, small searching difficulty, wide identification range and strong portability, and can identify new words and entity words containing special symbols to the maximum extent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
FIG. 1 is a flow chart of a semantic parsing based entity identification method in one embodiment of the invention;
FIG. 2 is a block diagram of an entity recognition apparatus based on semantic parsing according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is a flowchart of an entity identification method based on semantic parsing according to an embodiment of the present invention, and as shown in fig. 1, an entity identification method based on semantic parsing includes the following steps:
step S1, matching suffix words: and acquiring a corpus to be recognized, matching the corpus to be recognized in the suffix dictionary, recording suffix word indexes of all target suffix words in the corpus to be recognized if the target suffix words are matched, and sequencing and combining all the suffix word indexes to obtain a suffix word index set.
The suffix dictionary in this step is a dictionary that is constructed and set in advance, and the suffix dictionary may be plural and may be classified by industry or category. For example, the work category actual words are taken as an example, and the suffix dictionary includes suffix words such as "teacher", "education", "worker", and the like. In the step, whether the corpus to be recognized contains the suffix words in the suffix dictionary or not is searched in the suffix dictionary. During searching, methods such as character string matching and the like can be adopted to match whether the corpus to be identified contains the target suffix words. If the target suffix words are contained, suffix word indexing is also needed to be carried out on the target suffix words. Taking a text sentence as an example, suffix word indexes corresponding to a plurality of target suffix words of { idx _1, idx _2, … idx _ n } are recorded, wherein idx _ n is an index interval [ idx _ n _ start, idx _ n _ end ] representing an index interval of an nth target suffix word, and idx _ n _ start and idx _ n _ end represent an index start bit and an index end bit of the nth target suffix word in the text sentence respectively.
For example, for the vocabulary to be recognized as the profession word "stainless steel smelting worker", matching is performed in the suffix dictionary, and since the suffix dictionary contains "worker", the target suffix word "worker" is finally found in the profession word "stainless steel smelting worker", and since in the profession word "stainless steel smelting worker", when the index starts from 0, the 5 th to 6 th bits are "workers", the suffix word index corresponding to the target suffix word "worker" is recorded: [5,6].
In one embodiment, the corpus to be identified of step S1 is obtained by:
and establishing connection with a service system, receiving a push request sent by the service system, analyzing data in the push request in a preset analysis mode, and acquiring the linguistic data to be identified.
The business system can be a business system with data pushing, a user inputs the linguistic data to be recognized through a pushing interface of the business system, for example, a stainless steel smelting worker is input into a professional field, and the system acquires the professional field through the pushing interface of the business system. The service system can also be a system for identifying, constructing a knowledge graph, extracting or retrieving information and the like, and after communication is realized by establishing connection with the service system, the pushing request is sent to the system. In the step, after the connection with the service system is established, the push request sent by the service system is obtained, and the push request usually has a preset format, so that the push request sent by the service system can be analyzed in an analysis mode such as a regular expression, and the like, such as a professional field stainless steel smelting worker, which is the corpus to be identified, is directly obtained from the request.
The corpus to be recognized in this embodiment can be directly obtained from the push request of the external service system, and the external service system can communicate only by establishing connection with the entity recognition system in this step, so that the entity recognition system has better expandability.
If a plurality of suffix words may be found in the corpus to be recognized, a plurality of suffix word indexes may exist, and the suffix word indexes are index intervals and may be adjacent, may not be adjacent, or may be overlapped. When the suffix words are matched in the suffix dictionary by using the reverse maximum matching method, and the matching is searched forward, the suffix word indexes are arranged from back to front, for example, suffix word indexes [7, 11], [1, 5], [2, 3] and the like are found, and the suffix word indexes need to be sorted for subsequent calculation. When all found suffix word indexes are sorted, the sorting result is [1, 5], [2, 3], [7, 11] according to the rule of sorting the suffix word indexes in ascending order of the index starting bit size.
In one embodiment, the sorting and merging of all suffix word indexes in step S1 includes:
step S101, merging the completely overlapped indexes: searching whether completely overlapped index intervals exist in all suffix word indexes, and if so, merging the index intervals into a suffix word index with the maximum interval.
For fully overlapping suffix word indices, merge into a maximum interval index, such as suffix word index [1, 5], suffix word index [2, 3] merge into suffix word index [1, 5 ].
Step S102, merging the partially overlapped indexes: and searching whether the partially overlapped indexing intervals exist or not, if so, merging the indexing intervals into the same suffix word index, wherein the index starting bit of the suffix word index is the minimum value in the partially overlapped indexing intervals, and the index ending bit of the suffix word index is the maximum value in the partially overlapped indexing intervals.
For partially overlapped suffix word indexes, merging into the same interval index, selecting the minimum value as the index start bit and the maximum value as the index end bit, such as suffix word indexes [1, 6], and merging suffix word indexes [4, 11] into suffix word indexes [1, 11 ].
Step S103, merging adjacent indexes: and searching whether adjacent index intervals exist or not, if so, merging the index intervals into the same suffix word index, wherein the index starting bit of the suffix word index is the minimum value in the adjacent index intervals, and the index ending bit of the suffix word index is the maximum value in the adjacent index intervals.
And for adjacent suffix word indexes, merging the indexes into the same interval index, wherein the index start bit is selected to be the minimum value, the index end bit is selected to be the maximum value, such as the suffix word index [1, 6], and the suffix word indexes [7, 11] are merged into the suffix word index [1, 11 ].
For suffix word indices other than the above three cases, all index intervals are reserved, such as suffix word index [1, 3], suffix word index [7, 11], and are still saved as suffix word index [1, 3], suffix word index [7, 11 ].
In this embodiment, by combining the centralized suffix word indexes, the related index intervals are integrated together, so that the subsequent segmentation and extraction of a more complete and accurate substring text according to the suffix word indexes are facilitated.
Step S2, matching candidate words: segmenting and extracting linguistic data to be recognized corresponding to two adjacent suffix word indexes in the suffix word index set to obtain a substring text, matching the substring text in a candidate dictionary, recording candidate word indexes of all target candidate words in the linguistic data to be recognized if the target candidate words are matched, and storing all the target candidate words in a data structure.
The candidate dictionary in this step is a dictionary that is constructed and set in advance, and the candidate dictionary may be a plurality of, and may be classified by industry or category. For example, the work category actual words are taken as examples, and the candidate dictionary comprises the words to be selected such as "repair", "mathematics", "Chinese", "English", "stainless steel" and "smelting".
The segmentation and extraction in the step are that suffix words in the corpus to be recognized are removed from the corpus, and only texts between two adjacent suffix words are obtained and recorded as substring texts. When the corpus content to be recognized is long, a plurality of suffix words may exist, and after segmentation and extraction, a plurality of substring texts exist. In the step, the multiple substring texts are respectively matched in the candidate dictionary to obtain multiple target candidate words.
For example, the corpus to be recognized is the vocational word "stainless steel smelting worker", and "worker" is obtained as a suffix word through step 1). In the step, workers are removed from the corpus, and finally, the stainless steel smelting is obtained and is used as the substring to be identified. And searching all words in the stainless steel smelting in a candidate dictionary to match whether any target candidate word identical to the target candidate word in the candidate dictionary exists.
When the text of the sub-character string to be recognized is searched and matched in the candidate dictionary, the principle of a maximum matching algorithm is adopted, and the method specifically comprises the following steps: segmenting a sub character string text, comparing the sub character string text with a candidate dictionary, recording a candidate word index if a word is contained, otherwise, continuing to compare by increasing or decreasing a character until a character is left, and ending the search if the character string cannot be segmented and is not in the candidate dictionary. The step preferably selects a forward maximum matching algorithm, and the matching is carried out after the existing forward maximum matching method is modified.
In one embodiment, in step S2, matching the substring text in the candidate dictionary, and if the target candidate word is matched, recording candidate word indexes of all target candidate words in the corpus to be recognized, and storing all target candidate words in a data structure, includes:
step S201, matching: the substring text is matched in a candidate dictionary.
According to the principle of a forward maximum matching algorithm, when a substring text is matched with a candidate dictionary, an initial substring text is firstly matched in the candidate dictionary, and the initial substring text is the substring text which is extracted by segmenting the linguistic data to be recognized between two adjacent suffix word indexes.
Step S202, first judgment: if the target candidate words are matched, storing the target candidate words in a data structure, recording candidate word indexes of the target candidate words in the corpus to be identified, judging whether the total length of all the matched target candidate words is equal to the length of the initial substring text, and if so, completing matching; if not, the target candidate word and all characters behind the target candidate word are removed from the initial sub-character string text to obtain a new sub-character string text, and the step S201 is continuously executed;
and when the target candidate word is matched, judging whether all matching is finished or not on the basis of the text length of the initial sub-character string. When all matching is not completed, the matched target candidate word needs to be removed from the initial substring text, and then matching is performed.
Step S203, second judgment: if the target candidate word is not matched, matching the text of the sub character string in a prefix dictionary, if the target prefix word is matched, recording a prefix word index of the target prefix word in the corpus to be recognized, and finishing matching; if the target prefix word is not matched, judging whether the sub-character string text has one character, if so, removing the character from the initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step S201, otherwise, removing the first character of the initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step S201.
When the target candidate word is not matched, the sub-character string text needs to be matched with the prefix dictionary, and whether the matching is finished or not is determined through the prefix dictionary. The prefix dictionary in this step is a pre-constructed and set dictionary, and the prefix dictionary may be a plurality of prefixes and may be classified by industry or category. For example, the work category occupational entity words are taken as an example, and the prefix dictionary contains prefix words such as "engage", "read", "I am", and the like. For example, the corpus to be recognized is a vocational word "i is engaged in stainless steel smelting work", candidate words "smelting" and "stainless steel" are recognized, when the word "engaged" is searched, the word is not matched in the candidate dictionary, the word is searched in the prefix dictionary, and the matching is finished because the word "engaged" exists in the prefix dictionary.
For example, the corpus to be recognized is a vocational word "stainless steel smelting worker", the target suffix word is "worker", the target suffix word is cut off from the vocational word, and the initial substring text is "stainless steel smelting". Suppose that only two words of "stainless steel" and "smelting" are contained in the candidate dictionary. By adopting the matching mode of the embodiment, the word of stainless steel smelting is firstly searched in the candidate dictionary, and if the target candidate word is not matched, the words of stainless steel smelting and steel smelting are continuously searched until the first target candidate word smelting is matched, and the corresponding candidate word index [3, 4] is recorded; then, the word "smelt" is cut off from the original substring text "stainless steel smelt", at this time, the new substring text becomes "stainless steel", the matching mode of the embodiment is continuously adopted, and the new substring is further searched and matched in the candidate dictionary, so that the second target candidate word "stainless steel" and the candidate word index [0, 2] are finally obtained. The combination of the first target candidate word "smelting" and the second target candidate word "stainless steel" is "stainless steel smelting", and the length of the combination is the same as that of the initial sub-character string "stainless steel smelting", namely the whole search matching process is completed.
In the matching process, the prefix dictionary may be matched first, and then the candidate dictionary may be matched:
step S211, matching: matching the sub-character string texts in a prefix dictionary, and finishing matching if a target prefix word is matched; and if the target prefix words are not matched, matching the sub-character string texts in the candidate dictionary.
Step S212, first judgment: if the target candidate words are matched, storing the target candidate words in a data structure, recording candidate word indexes of the target candidate words in the corpus to be identified, judging whether the total length of all the matched target candidate words is equal to the length of the initial substring text, and if so, completing matching; if not, the target candidate word and all characters behind the target candidate word are removed from the initial sub-character string text to obtain a new sub-character string text, and the step S211 is continuously executed;
step S213, second judgment: if the target candidate word is not matched, judging whether the sub-character string text has one character, if so, removing the character from the initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step S211, otherwise, removing the first character of the initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step S211.
Through two different matching modes of the embodiment, more accurate target candidate words can be obtained. For example, for the language materials to be recognized, such as "physical education", "mathematical education", "language education", etc., the target suffix word "education" is obtained in step 1), and then the target candidate word "math"/"physical"/"language" is finally obtained in the matching manner in step 2), and the target prefix word "pursuit" is finally obtained.
Step S3, determining entity words: and combining all the target candidate words in the data structure to obtain a plurality of combined target candidate words, and combining the target candidate words and the suffix words adjacent to the rear of the corresponding candidate word indexes one by one to obtain a plurality of entity words.
For example, for the corpus to be recognized, "i" is a work engaged in stainless steel smelting ", the candidate word index [3, 4] of the target candidate word" smelting ", and the candidate word index [0, 2] of the target candidate word" stainless steel ", while the target candidate word" stainless steel "and the target candidate word" smelting "are adjacent, a merging process should be performed, and it is more accurate and reasonable to finally obtain the target candidate word" stainless steel smelting ". Finally, the stainless steel smelting is combined with the rear target suffix work to obtain the solid word of the stainless steel smelting work.
In one embodiment, in step S3, the merging all the target candidate words in the data structure to obtain multiple merged target candidate words includes:
step S301, merging the partially overlapped indexes: and searching whether partially overlapped index intervals exist in the candidate word indexes corresponding to all the target candidate words, if so, combining the candidate word indexes into the same candidate word index, wherein the index start bit of the candidate word index is the minimum value in the partially overlapped index intervals, and the index end bit of the candidate word index is the maximum value in the partially overlapped index intervals.
The merging manner of this step is similar to step S102, that is: for partially overlapped candidate word indexes, merging the candidate word indexes into the same interval index, selecting the minimum value as the index start bit, and selecting the maximum value as the index end bit, such as the candidate word index [1, 6], and merging the candidate word indexes [4, 11] into the candidate word index [1, 11 ].
Step S302, merging adjacent indexes: and searching whether adjacent index intervals exist or not, if so, combining the adjacent index intervals into the same candidate word index, wherein the index start bit of the candidate word index is the minimum value in the adjacent index intervals, and the index end bit of the candidate word index is the maximum value in the adjacent index intervals.
The merging manner of this step is similar to step S103, that is: for adjacent candidate word indexes, merging into the same interval index, selecting the minimum value as the index start bit and the maximum value as the index end bit, such as the candidate word index [1, 6], and merging the candidate word indexes [7, 11] into the candidate word index [1, 11 ].
For candidate word indexes other than the above two cases, all index intervals are retained, such as candidate word index [1, 3], candidate word index [7, 11], and still saved as candidate word index [1, 3], candidate word index [7, 11 ].
In this embodiment, by means of the above manner of merging the candidate word index sets, the related index regions are integrated together, so that subsequent combination through the candidate word index and the suffix word index adjacent thereto is facilitated, and finally, an accurately completed entity word is obtained.
In one embodiment, after step S3, the method further includes:
and searching the entity words in preset data to be pushed one by one, judging whether the data to be pushed contains matched target data, if so, extracting the target data, returning the target data to the service system, and if not, returning an error prompt.
The push data of this embodiment includes a plurality of keywords corresponding to the entity words, for example, when the data to be pushed is a certain project product, the project product includes a plurality of keywords, such as being suitable for people and suitable for profession, and whether the entity words obtained in this step are included in suitable profession is matched. When matching, a fuzzy search mode can be adopted, as long as partial names in entity words are included, matching can be regarded as successful, and the pushed data is target data. And if the target data contains a plurality of matched target data, pushing all the target data. When target data are pushed, if the corpus to be identified is acquired from the pushing interface in the step 1), the target data are also displayed through the pushing interface. And if the linguistic data to be identified is acquired from the business system in the step 1), returning the target data to the corresponding business system through the established connection.
In one embodiment, after step S3, the obtained entity word may be further subjected to check correction:
obtaining a plurality of entity words after the step 3), wherein two adjacent entity words may be combined into the same entity word, for example, the obtained entity words are "building industry" and "stair plasterer", and the index positions of the two entity words in the corpus to be recognized are adjacent, so that the two entity words may be further combined into the entity word "building industry stair plasterer". Therefore, after a plurality of entity words are identified, the candidate word index and the suffix word index of any entity word are combined into an entity word index, then whether an adjacent entity word index exists in the entity word index is searched, and if the adjacent entity word indexes are adjacent, the two entity word indexes are combined into one entity word index.
Obtaining a plurality of entity words after the step 3), and correcting the entity word content of the entity words according to the candidate dictionary of the obtained entity words and the industry corresponding to the suffix dictionary: searching whether a preset ambiguous word exists in the entity words, if so, searching a corresponding correction word needing to be added in a preset ambiguous word industry mapping table, and adding the correction word in a bracket form after the ambiguous word. For example, the solid word "apple salesman" is recognized from the corpus to be recognized in the electronic industry, and since the solid word is searched in the dictionary of the electronic industry, the "apple" here refers to the brand, not the fruit, and can be further corrected to be "apple (equipment) salesman".
Obtaining a plurality of entity words after the step 3), comparing the entity words with the entity words marked in advance, if the entity words have differences, extracting the differences, and expanding the differences to the corresponding dictionary. Or segmenting the marked entity words, adopting an algorithm such as a term frequency-inverse text frequency algorithm (TF-IDF) to count the word frequency of each phrase after segmentation, and extracting prefixes, candidate words and suffix words so as to expand the corresponding dictionary.
The entity words extracted in the embodiment can be applied to the process of processing the insurance application information, the connection with the insurance application service system is established, the information of the insurance application is received, the information of the insurance application is the linguistic data to be identified, a plurality of entity words are finally identified through the embodiment, and the entity words are the professional information of the insurance application, so that insurance application suggestions related to the occupation are fed back or insurance application products are pushed.
The entity words extracted in the embodiment can also be applied to a knowledge graph constructing process, a connection with a knowledge graph processing system is established, career fields sent by the knowledge graph processing system are received, the career fields are linguistic data to be identified, a plurality of entity words are finally identified through the embodiment, the entity words are fed back to the knowledge graph processing system, the entity words received by the knowledge graph are career attributes of a certain entity, and the career attributes can be linked to the knowledge graph of the certain entity to enrich the knowledge graph.
In the entity recognition method based on semantic analysis, the initial boundary of the entity word is defined for the corpus to be recognized through the pre-established candidate dictionary, the prefix dictionary and the suffix dictionary, and entity recognition can be perfected only by maintaining three dictionaries with smaller scales subsequently. The scales of the three dictionaries are one tenth of those used by the traditional recognition method, so that the searching and matching process of the linguistic data to be recognized is extremely high in efficiency, small in searching difficulty, strong in transportability and simple and convenient to maintain.
In one embodiment, an entity recognition apparatus based on semantic parsing is provided, as shown in fig. 2, including:
the suffix word matching module 10 is configured to obtain a corpus to be recognized, match the corpus to be recognized in a suffix dictionary, record suffix word indexes of all target suffix words in the corpus to be recognized if a target suffix word is matched, and sequence and combine all suffix word indexes to obtain a suffix word index set;
the matching candidate word module 20 is configured to segment and extract a corpus between two adjacent suffix word indexes in the suffix word index set to obtain a substring text, match the substring text in a candidate dictionary, record candidate word indexes of all target candidate words in the corpus to be identified if a target candidate word is matched, and store all the target candidate words in a data structure;
and the entity word determining module 30 is configured to perform merging processing on all target candidate words in the data structure to obtain multiple merged target candidate words, and combine the target candidate words and suffix words adjacent to the rear of the corresponding candidate word indexes one by one to obtain multiple entity words.
In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor implements the steps in the entity identification method based on semantic parsing of the above embodiments.
In one embodiment, a storage medium storing computer-readable instructions is provided, and the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the steps of the semantic parsing based entity identification method of the above embodiments. The storage medium may be a nonvolatile storage medium.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express some exemplary embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An entity identification method based on semantic parsing is characterized by comprising the following steps:
obtaining a corpus to be recognized, matching the corpus to be recognized in a suffix dictionary, recording suffix word indexes of all target suffix words in the corpus to be recognized if the target suffix words are matched, and sequencing and combining all suffix word indexes to obtain a suffix word index set;
segmenting and extracting the corpus corresponding to two adjacent suffix word indexes in the suffix word index set to obtain a substring text, matching the substring text in a candidate dictionary, recording candidate word indexes of all target candidate words in the corpus to be identified if target candidate words are matched, and storing all the target candidate words in a data structure;
and combining all the target candidate words in the data structure to obtain a plurality of combined target candidate words, and combining the target candidate words and the suffix words adjacent to the rear of the corresponding candidate word index one by one to obtain a plurality of entity words.
2. The entity recognition method based on semantic parsing as claimed in claim 1, wherein the obtaining the corpus to be recognized comprises:
establishing connection with a service system, and receiving a push request sent by the service system;
and analyzing the data in the pushing request in a preset analysis mode to obtain the linguistic data to be identified.
3. The entity identification method based on semantic parsing as claimed in claim 1, wherein the sorting and merging all suffix word indexes comprises:
searching whether completely overlapped indexing intervals exist in all suffix word indexes, and if so, merging the suffix word indexes into a suffix word index with the maximum interval;
searching whether partially overlapped index intervals exist or not, if so, merging the index intervals into the same suffix word index, wherein the index starting bit of the suffix word index is the minimum value in the partially overlapped index intervals, and the index ending bit of the suffix word index is the maximum value in the partially overlapped index intervals;
and searching whether adjacent index intervals exist or not, if so, merging the index intervals into the same suffix word index, wherein the index starting bit of the suffix word index is the minimum value in the adjacent index intervals, and the index ending bit of the suffix word index is the maximum value in the adjacent index intervals.
4. The entity recognition method based on semantic parsing according to claim 1, wherein the matching the substring text in a candidate dictionary, and if a target candidate word is matched, recording candidate word indexes of all target candidate words in the corpus to be recognized, and storing all target candidate words in a data structure, comprises:
matching the substring text in a candidate dictionary;
if the target candidate words are matched, storing the target candidate words in a data structure, recording candidate word indexes of the target candidate words in the corpus to be identified, judging whether the total length of all the matched target candidate words is equal to the length of the initial substring text, and if so, completing matching; if not, removing the target candidate word and all characters behind the target candidate word from an initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step of matching the sub-character string text in a candidate dictionary;
if the target candidate word is not matched, matching the text of the sub character string in a prefix dictionary, if the target prefix word is matched, recording a prefix word index of the target prefix word in the corpus to be recognized, and finishing matching; if the target prefix word is not matched, judging whether the sub character string text has one character, if so, removing the character from the initial sub character string text to obtain a new sub character string text, and continuing to execute the step of matching the sub character string text in the candidate dictionary, otherwise, removing the first character of the initial sub character string text to obtain a new sub character string text, and continuing to execute the step of matching the sub character string text in the candidate dictionary.
5. The entity recognition method based on semantic parsing according to claim 1, wherein the matching the substring text in a candidate dictionary, and if a target candidate word is matched, recording candidate word indexes of all target candidate words in the corpus to be recognized, and storing all target candidate words in a data structure, comprises:
matching the sub character string texts in a prefix dictionary, if a target prefix word is matched, finishing the matching, and if the target prefix word is not matched, matching the sub character string texts in a candidate dictionary;
if the target candidate words are matched, storing the target candidate words in a data structure, recording candidate word indexes of the target candidate words in the corpus to be identified, judging whether the total length of all the matched target candidate words is equal to the length of the initial substring text, and if so, completing matching; if not, removing the target candidate word and all characters behind the target candidate word from an initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step of matching the sub-character string text in a prefix dictionary;
if the target candidate word is not matched, judging whether the sub-character string text has one character, if so, removing the character from the initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step of matching the sub-character string text in a prefix dictionary, otherwise, removing the first character of the initial sub-character string text to obtain a new sub-character string text, and continuing to execute the step of matching the sub-character string text in the prefix dictionary.
6. The entity identification method based on semantic parsing according to any one of claims 1 to 5, wherein the merging all the target candidate words in the data structure to obtain multiple merged target candidate words comprises:
searching whether partially overlapped index intervals exist in candidate word indexes corresponding to all target candidate words, if so, combining the candidate word indexes into the same candidate word index, wherein the index start bit of the candidate word index is the minimum value in the partially overlapped index intervals, and the index end bit of the candidate word index is the maximum value in the partially overlapped index intervals;
and searching whether adjacent index intervals exist or not, if so, combining the adjacent index intervals into the same candidate word index, wherein the index start bit of the candidate word index is the minimum value in the adjacent index intervals, and the index end bit of the candidate word index is the maximum value in the adjacent index intervals.
7. The entity identification method based on semantic parsing according to claim 2, wherein the step of combining the target candidate words and the suffix words adjacent to the rear of the corresponding candidate word indexes one by one to obtain a plurality of entity words further comprises:
and searching the entity words in preset data to be pushed one by one, judging whether the data to be pushed contains matched target data, if so, extracting the target data, returning the target data to the service system, and if not, returning an error prompt.
8. An entity recognition device based on semantic parsing, comprising:
the suffix word matching module is used for acquiring the corpus to be recognized, matching the corpus to be recognized in a suffix dictionary, recording suffix word indexes of all target suffix words in the corpus to be recognized if the target suffix words are matched, and sequencing and combining all the suffix word indexes to obtain a suffix word index set;
the matching candidate word module is used for segmenting and extracting the linguistic data to be recognized corresponding to two adjacent suffix word indexes in the suffix word index set to obtain a substring text, matching the substring text in a candidate dictionary, recording candidate word indexes of all target candidate words in the linguistic data to be recognized if target candidate words are matched, and storing all the target candidate words in a data structure;
and the entity word determining module is used for merging all the target candidate words in the data structure to obtain a plurality of merged target candidate words, and combining the target candidate words and the adjacent suffix words behind the corresponding candidate word indexes one by one to obtain a plurality of entity words.
9. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by the processor, cause the processor to carry out the steps of the semantic resolution based entity recognition method according to any one of claims 1 to 7.
10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the semantic resolution based entity recognition method according to any one of claims 1 to 7.
CN202010156694.3A 2020-03-09 2020-03-09 Entity identification method, device, equipment and storage medium based on semantic analysis Pending CN111368547A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010156694.3A CN111368547A (en) 2020-03-09 2020-03-09 Entity identification method, device, equipment and storage medium based on semantic analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010156694.3A CN111368547A (en) 2020-03-09 2020-03-09 Entity identification method, device, equipment and storage medium based on semantic analysis

Publications (1)

Publication Number Publication Date
CN111368547A true CN111368547A (en) 2020-07-03

Family

ID=71208635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010156694.3A Pending CN111368547A (en) 2020-03-09 2020-03-09 Entity identification method, device, equipment and storage medium based on semantic analysis

Country Status (1)

Country Link
CN (1) CN111368547A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569573A (en) * 2021-06-28 2021-10-29 浙江工业大学 Method and system for identifying generalization entity facing financial field
CN114138945A (en) * 2022-01-19 2022-03-04 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569573A (en) * 2021-06-28 2021-10-29 浙江工业大学 Method and system for identifying generalization entity facing financial field
CN114138945A (en) * 2022-01-19 2022-03-04 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis
CN114138945B (en) * 2022-01-19 2022-06-14 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis

Similar Documents

Publication Publication Date Title
CN108460014B (en) Enterprise entity identification method and device, computer equipment and storage medium
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN106649783B (en) Synonym mining method and device
CN106650943B (en) Auxiliary writing method and device based on artificial intelligence
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
CN111475623A (en) Case information semantic retrieval method and device based on knowledge graph
CN106776564B (en) Semantic recognition method and system based on knowledge graph
CN113254574A (en) Method, device and system for auxiliary generation of customs official documents
Chen et al. A study of language modeling for Chinese spelling check
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
CN111782892B (en) Similar character recognition method, device, apparatus and storage medium based on prefix tree
CN110705261B (en) Chinese text word segmentation method and system thereof
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
CN112380848A (en) Text generation method, device, equipment and storage medium
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN108664464B (en) Method and device for determining semantic relevance
US20130013604A1 (en) Method and System for Making Document Module
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
KR101663038B1 (en) Entity boundary detection apparatus in text by usage-learning on the entity's surface string candidates and mtehod thereof
CN113553853B (en) Named entity recognition method and device, computer equipment and storage medium
CN113392189B (en) News text processing method based on automatic word segmentation
CN109727591B (en) Voice search method and device
Tongtep et al. Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction
US20170116175A1 (en) Method and system for searching words in documents written in a source language as transcript of words in an origin language
JP6181890B2 (en) Literature analysis apparatus, literature analysis method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination