CN113807099B - Entity information identification method, device, electronic equipment and storage medium - Google Patents

Entity information identification method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113807099B
CN113807099B CN202111111471.6A CN202111111471A CN113807099B CN 113807099 B CN113807099 B CN 113807099B CN 202111111471 A CN202111111471 A CN 202111111471A CN 113807099 B CN113807099 B CN 113807099B
Authority
CN
China
Prior art keywords
candidate information
word segment
information
target
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111111471.6A
Other languages
Chinese (zh)
Other versions
CN113807099A (en
Inventor
张惠蒙
黄昉
史亚冰
蒋烨
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111111471.6A priority Critical patent/CN113807099B/en
Publication of CN113807099A publication Critical patent/CN113807099A/en
Application granted granted Critical
Publication of CN113807099B publication Critical patent/CN113807099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides an entity information identification method, an entity information identification device, electronic equipment and a storage medium, and relates to the technical field of data processing, in particular to the technical field of artificial intelligence such as natural language processing and deep learning. The specific implementation scheme is as follows: carrying out named entity recognition on the text to be recognized to obtain at least one piece of candidate information; extracting the characteristics of each piece of candidate information to obtain at least one piece of characteristic information; carrying out deep semantic recognition on each feature information to obtain at least one semantic recognition result; and determining entity information corresponding to the text to be recognized from at least one candidate information according to at least one semantic recognition result.

Description

Entity information identification method, device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to the field of artificial intelligence technologies such as NLP (natural language processing ), deep learning, and the like, and in particular, to a method and apparatus for identifying entity information, an electronic device, and a storage medium.
Background
Named entity recognition (Named Entity Recognition; NER) is one of the fundamental and important tasks in natural language processing and has a very wide range of applications. The NER system can extract entity information from unstructured input text and can identify more categories of entity information according to business needs. In each industrial application scenario, the identification of entity information is the basis of tasks such as knowledge graph construction, text understanding, dialog intention understanding and the like.
Disclosure of Invention
The disclosure provides an entity information identification method, an entity information identification device, electronic equipment and a storage medium.
According to an aspect of the present disclosure, there is provided an entity information identification method, including: carrying out named entity recognition on the text to be recognized to obtain at least one piece of candidate information; extracting the characteristics of each piece of candidate information to obtain at least one piece of characteristic information; carrying out deep semantic recognition on each piece of characteristic information to obtain at least one semantic recognition result; and determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result.
According to another aspect of the present disclosure, there is provided an entity information identifying apparatus including: the entity recognition module is used for carrying out named entity recognition on the text to be recognized to obtain at least one piece of candidate information; the feature extraction module is used for carrying out feature extraction on each piece of candidate information to obtain at least one piece of feature information; the semantic recognition module is used for carrying out deep semantic recognition on each piece of characteristic information to obtain at least one semantic recognition result; and the determining module is used for determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the entity information identification method as described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the entity information identification method as described above.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the entity information identification method as described above.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates an exemplary system architecture to which entity information identification methods and apparatus may be applied, according to embodiments of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a method of entity information identification according to one embodiment of the present disclosure;
FIG. 3 schematically illustrates a schematic diagram of an entity candidate extraction model according to an embodiment of the disclosure;
FIG. 4 schematically illustrates a schematic diagram of an entity discriminant model according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a representation of text feature information in text to be identified according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of an entity information identification method according to another embodiment of the present disclosure;
fig. 7 schematically illustrates a block diagram of an entity information identification apparatus according to an embodiment of the present disclosure; and
FIG. 8 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.
Schemes for implementing named entity recognition include dictionary matching-based schemes, rule matching-based schemes, schemes based on the BERT (Bidirectional Encoder Representations from Transformers, language representation model) +crf (Conditional Random Fields, conditional random field) model, and schemes based on the bert+mrc (Machine Reading Comprehension, machine reading understanding) model, and the like.
Based on a dictionary matching scheme, some entity information can be predefined in a dictionary, then forward maximum matching and reverse maximum matching are carried out on the text according to the entity information, a candidate information set for entity information recognition is obtained, and then a result for entity information recognition is screened and output based on word frequency.
Based on the rule matching scheme, a rule template manually constructed by linguistic experts is mostly adopted, and features such as statistical information, punctuation marks, keywords, indicator words, direction words, position words, center words and the like can be selected in the template, and a candidate information set for entity information recognition is obtained by taking pattern and character string matching as a main means so as to output an entity information recognition result.
The scheme based on the BERT+CRF model is a scheme based on the deep learning model. BERT can learn a good feature representation for a single word by running a self-supervision learning method on the basis of massive corpora. After extracting the semantic representation of each word, the word sequence can be labeled by CRF, thereby extracting the entity in the sentence. Compared with a scheme based on rules and dictionary matching, the scheme based on the BERT+CRF model has better expressive power and generalization capability and has higher universality.
The scheme based on the BERT+MRC model is a scheme based on the deep learning model, and the BERT+MRC model is slightly better than the BERT+CRF model. The strategy based on the BERT+MRC model is to convert the tasks of sequence labeling into predictions of the starting position and the ending position of entity information. After extracting and understanding the semantics of the input text by using ERNIE (Enhanced Representation through Knowledge Integration, chinese pre-training model), a sentence-level encoding is obtained, i.e. each word vector is represented by a 768-dimensional vector. And then the word vector is subjected to dimension reduction to the dimension of the number of the sequence labels through a full connection layer, and the starting position and the ending position of the entity information are predicted. And finally, screening the candidate information set identified by the entity information in a threshold value and rule mode to obtain an entity information identification result.
The inventor finds that in the process of realizing the conception of the present disclosure, the above scheme mainly uses the word-based vector information pre-trained by ERNIE in the process of model coding, and ignores the important role of vocabulary information on entity information identification. In the decoding process, the method firstly generates the positions of the starting node and the ending node of each type of entity information, and in the screening process of the starting node and the ending node pairs, a mode of adding rules and multiple thresholds and rules is adopted to screen the candidate set. The screening strategy is too hard, and when facing different tasks, the rules need to be modified to a certain extent. For example, for different fields, there is a large difference in average length of entity information, and the nesting situation of entity information in the same text is different, so that the same rule cannot be applied to entity information identification in all fields.
Fig. 1 schematically illustrates an exemplary system architecture to which entity information identification methods and apparatuses may be applied according to embodiments of the present disclosure.
It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the entity information identification method and apparatus may be applied may include a terminal device, but the terminal device may implement the entity information identification method and apparatus provided by the embodiments of the present disclosure without interaction with a server.
As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be noted that, the entity information identifying method provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the entity information identifying apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.
Alternatively, the entity information identification method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the entity information identifying apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The entity information identification method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the entity information identifying apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
For example, when entity information in the text to be identified needs to be identified, the terminal device 101, 102, 103 may identify the entity information in the text to be identified, to obtain at least one candidate information. And then, extracting the characteristics of each piece of candidate information to obtain at least one piece of characteristic information. And then, carrying out deep semantic recognition on each piece of characteristic information to obtain at least one semantic recognition result, and determining entity information corresponding to the text to be recognized from at least one piece of candidate information according to the at least one semantic recognition result. Or by a server or a cluster of servers capable of communicating with the terminal devices 101, 102, 103 and/or the server 105, analyzing the text to be identified and enabling to determine entity information corresponding to the text to be identified.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 schematically illustrates a flowchart of an entity information identification method according to one embodiment of the present disclosure.
As shown in fig. 2, the method includes operations S210 to S240.
In operation S210, named entity recognition is performed on the text to be recognized, and at least one candidate information is obtained.
In operation S220, feature extraction is performed on each candidate information to obtain at least one feature information.
In operation S230, deep semantic recognition is performed on each feature information to obtain at least one semantic recognition result.
In operation S240, entity information corresponding to the text to be recognized is determined from at least one candidate information according to at least one semantic recognition result.
According to an embodiment of the present disclosure, the text to be recognized may include various texts including or not including entity information, and the entity information may refer to information having a specific meaning or having a strong meaning in the text, and may include at least one of person name information, place name information, organization name information, proper noun information, product name information, model information, date and time information, price information, and the like.
According to the embodiment of the disclosure, the process of carrying out named entity recognition on the text to be recognized by various entity candidate extraction models to obtain at least one candidate information can be realized. The method comprises the steps of extracting features of each candidate information by various entity discrimination models to obtain at least one feature information, carrying out deep semantic recognition on each feature information to obtain at least one semantic recognition result, and determining entity information corresponding to a text to be recognized from at least one candidate information according to the at least one semantic recognition result. Both the entity candidate extraction model and the entity discriminant model may be generated based on ERNIE. ERNIE is a knowledge-based enhanced pre-trained semantic representation model. The model enables the model to learn the semantic representation of the complete concept by encoding semantic units such as words, entities and the like. By using the ERNIE model pre-trained on a large-scale corpus as a coding layer, semantic characterization of sentences can be extracted for downstream tasks better.
According to embodiments of the present disclosure, the entity candidate extraction model may include at least one of an ernie+mrc based model, a flag (word information) +ernie+mrc based model, or the like, and may also include a bert+mrc based model, or the like. The entity discriminant model may include an ernie+softmax based model or the like.
Fig. 3 schematically illustrates a schematic diagram of an entity candidate extraction model according to an embodiment of the disclosure.
As shown in fig. 3, the entity candidate extraction model includes an input layer 310, an encoding layer 320, and a decoding layer 330. The input layer 310 is configured to receive text feature information of the text to be identified, which is input into the entity candidate extraction model, and includes at least one of word segment information and other feature information that constitute the text to be identified. The encoding layer 320 is configured to extract semantic features of each word segment in the text to be identified according to the text feature information, and by mapping each word segment into vector information, the decoding layer 330 is convenient to calculate probabilities that each word segment belongs to various labels. The various types of labels may include, for example, at least one of a start point and an end point of a person name, a start point and an end point of a place name, a start point and an end point of an organization name, and the like. In case it is determined that a word segment belongs to a certain category label according to the probability of the word segment, the corresponding category label may be marked. Candidate information identified from the text to be identified can be determined from the marking result of the decoding layer.
According to embodiments of the present disclosure, the category label to which the word segment belongs may be labeled with a "1". For example, as can be determined according to the marking result shown in fig. 3, the category label corresponding to the word segment 1 may be a starting point of a place name, the category label corresponding to the word segment 2 may be an ending point of a place name, the category label corresponding to the word segment 3 may be an ending point of a place name, the category label corresponding to the word segment 5 may be a starting point of a person name, the category label corresponding to the word segment 7 may be an ending point of a person name, and so on. Three candidate information may be determined from word segment 1 and word segment 2, word segments 1 through 3, and word segments 5 through 7.
Fig. 4 schematically illustrates a schematic diagram of a solid discriminant model according to an embodiment of the present disclosure.
As shown in fig. 4, the entity discriminant model includes an input layer 410, an encoding layer 420, and a Sofimax layer 430. The input layer 410 may receive candidate information output by the entity candidate extraction model or input into the entity discriminant model in the form of word segments, [ CLS ] and [ SEP ] may be used to determine a text to be recognized, and ner_o_ S, NER _o_e may be used to tag candidate information in the text to be recognized. After the candidate information is input into the entity discrimination model, the encoding layer 420 may first perform feature extraction on the candidate information to obtain feature information. And extracting deep semantics through an ERNIE pre-training language model to obtain a semantic recognition result. Then, a classification model may be provided through the Softmax layer 430 having a classifier function, and the semantic recognition result may be judged to determine whether the input candidate information belongs to entity information. Thereby determining entity information of the text to be recognized.
According to embodiments of the present disclosure, text to be identified is, for example, "NJCJDQ," and "NJCJDQ" global semantics may be information characterizing a certain type of attraction in a region. "NJ", "NJS" may be place name information, "CJ" may be scenic spot information, "DQ", "Q" may be proper noun information. Further, "SC" may be job title information, and "JDQ" may be person name information.
For example, "NJCJDQ" is input into the entity candidate extraction model, and "NJ", "NJS", "SC", "CJ", "JD", "JDQ", "DQ", "Q" waiting selection information can be obtained. The information waiting selection information of NJ, NJS, SC, CJ, JD, JDQ, DQ and Q is input into an entity judgment model to perform feature extraction and deep semantic recognition, so that judgment results of the entity information of NJ, NJS, SC, CJ, JDQ, DQ and Q can be obtained, and the judgment results of the entity information of JD are not obtained. The entity information in the text to be recognized "NJSCJDQ" may thus be obtained to include "NJ", "NJs", "SC", "CJ", "JDQ", "DQ", "Q".
For example, the "city suburban park" is input into the entity candidate extraction model, and "city", "suburban", "park" waiting selection information can be obtained. The information waiting for selection of the "city", "suburb", "park" is input into the entity discrimination model, and feature extraction and deep semantic recognition are performed, for example, discrimination results of the "city", "suburb", "park" belonging to the entity information and discrimination results of the "suburb" not belonging to the entity information can be obtained. The entity information in the text "suburban park" that can be identified thus may include "city", "suburban", "park".
Through the embodiment of the disclosure, the result obtained by carrying out named entity recognition on the text to be recognized is taken as candidate information, and the methods of feature extraction, semantic recognition and the like are added to determine entity information from the candidate information, so that the accuracy of the entity information recognition result can be effectively improved as the entity information obtained by recognizing the text to be recognized. Especially, compared with the entity information determining mode of screening the candidate information according to the preset rule, the entity information screening method can effectively reduce the occurrence of the condition that entity information which does not belong to an entity is screened as entity information of a text to be identified.
The method shown in fig. 2 is further described below with reference to specific examples.
According to an embodiment of the present disclosure, performing named entity recognition on a text to be recognized, obtaining at least one candidate information includes: and generating a head position identifier and a tail position identifier for each word segment and each word segment in the text to be identified. And carrying out named entity recognition on the text to be recognized according to the head position identification, the tail position identification and the text to be recognized to obtain at least one piece of candidate information.
According to embodiments of the present disclosure, a word segment may characterize information of each word in the text to be recognized, and a word segment may characterize information of each word (which may include a plurality of words) in the text to be recognized. The head position and tail position identifications generated for a word segment may be equal to the position of the word segment in the text to which the word segment corresponds. The head position identifier generated for a word segment may be equal to the position of the first word segment in the word segment's corresponding text, and the tail position identifier generated for a word segment may be equal to the position of the last word segment in the word segment's corresponding text. Each word segment and word segment in the text to be recognized can be determined according to the head position identification and the tail position identification, so that text characteristic information of the text to be recognized is determined. The text information in the text to be recognized can form word segments, can be determined in a word segmentation mode, and also can be determined according to a preset word list, wherein the word list can comprise all preset possible words.
Fig. 5 schematically illustrates a representation of text feature information in a text to be recognized according to an embodiment of the present disclosure.
As shown in fig. 5, the text to be recognized is, for example, "NJSCJDQ", and the word segments determined for the text in the vocabulary include, for example, "NJ", "SC", "CJ", "JDQ", and "DQ". The text characteristic information determined from the text may include word segment information of "N", "J", "S", "C", "J", "D", and "Q", and word segment information of "NJ", "SC", "CJ", "JDQ", and "DQ". From the position of "N" in "NJSCJDQ" being the 1 st bit, it can be determined that the head position identification and the tail position identification of "N" are both 1. From the position of "J" in "JDQ" in "NJSCJDQ" being the 5 th bit and the position of "Q" in "JDQ" in "NJSCJDQ" being the 7 th bit, it can be determined that the head position of "JDQ" is identified as 5 and the tail position is identified as 7. The head position identification and the tail position identification of other word segment information and word segment information can be determined according to the same manner.
According to an embodiment of the present disclosure, the text to be identified is, for example, "urban suburban park", and the word segments determined for the text in the vocabulary include, for example, "city", "suburban", "park". The text characteristic information determined according to the text may include word segment information such as "city", "suburb", "wild", "public", and "garden", and word segment information such as "city", "suburb", "park". According to the position of the city in the urban suburban park is 1 st position, the head position mark and the tail position mark of the city can be determined to be 1. According to the position of suburb in suburb park is 3 rd bit and the position of suburb in suburb park is 4 th bit, the head position of suburb is 3 and the tail position is 4. The head position identification and the tail position identification of other word segment information and word segment information can be determined according to the same manner.
According to the embodiment of the invention, the head position mark and the tail position mark are generated for the word fragments and the word fragments in the text to be recognized, the word vector information and the word vector information in the text to be recognized can be fused in the encoding stage of named entity recognition, the boundary of the entity information is reinforced by richer text characteristic information, and particularly, the recognition of the entity information with longer fragments can have higher recognition accuracy.
According to an embodiment of the present disclosure, performing named entity recognition on a text to be recognized, obtaining at least one candidate information includes: a first class probability is determined for each word segment in the text to be identified as a start word segment of the predefined class and a second class probability is determined as a stop word segment of the predefined class. And determining the word segment corresponding to the first class probability as a starting word segment under the condition that the value of the first class probability is larger than or equal to a first preset threshold value. And determining the word segment corresponding to the second class probability as a termination word segment under the condition that the value of the second class probability is larger than or equal to a second preset threshold value. Candidate information is determined based on the start word segment and the stop word segment.
According to an embodiment of the present disclosure, the predefined categories may include at least one of a person name category, a place name category, an organization category, a proprietary name category, a product name category, a model category, a date category, a time category, a price category, and the like.
According to embodiments of the present disclosure, the first preset threshold and the second preset threshold may set different values corresponding to different predefined categories. For example, in the case where the predefined category is a person name category, the first preset threshold may be set to 0.9 and the second preset threshold may be set to 0.5. In case the predefined category is a place name, the first preset threshold may be set to 0.8, the second preset threshold may be set to 0.7, etc. The first and second preset thresholds may be used to determine the category to which the word segment belongs, in particular in case one word segment may correspond to a plurality of predefined categories, the category probabilities calculated for one word segment may comprise a plurality of different first or second category probabilities, in which case the category of the word segment may be determined from the first and second preset thresholds.
According to an embodiment of the present disclosure, for example, in a process of recognizing a text to be recognized as "NJSCJDQ" as a named entity according to a corresponding word segment and word segment, a first class probability of recognizing "N" as a name class start word segment is, for example, 0.3, which is smaller than a first preset threshold value of 0.9 when a predefined class is a name class, a first class probability of recognizing "N" as a place name class start word segment is, for example, 0.8, which is equal to a first preset threshold value of 0.8 when a predefined class is a place name class, a second class probability of recognizing "J" as a place name class end word segment is, for example, 0.8, which is larger than a second preset threshold value of 0.7 when a predefined class is a place name class, it may be determined that "N" in "njdq" may be a place name class start word segment, "J" may be a place name class end word segment, and candidate information obtained for "njjdq" recognition may include "NJ".
According to an embodiment of the present disclosure, for example, a text to be identified is "suburban park", in a process of identifying a named entity of the "suburban park" according to a corresponding word segment and word segment, a first class probability of identifying the "suburban" as a name class start word segment is, for example, 0.1, which is smaller than a first preset threshold value of 0.9 when the predefined class is a name class, a first class probability of identifying the "suburban" as a place name class start word segment is, for example, 0.8, which is equal to the first preset threshold value of the predefined class, which is equal to 0.8, and a second class probability of identifying the "suburban" as a place name class end word segment is, which is larger than a second preset threshold value of 0.7 when the predefined class is a place name class, it may be determined that the "suburban park" may be a place name class start word segment, and the candidate information obtained for the "suburban park" identification may include the "suburban park".
By introducing the preset threshold value and comparing the class probability of the word segment belonging to various predefined classes, the embodiment of the invention can enhance the recognition effect when the word segment is recognized as the starting word segment or the ending word segment, and improve the probability that the candidate information can be used as entity information.
According to the embodiment of the disclosure, after the identifying of the named entity is performed on the text to be identified to obtain at least one candidate information, the at least one candidate information may also be selected according to a preset rule to obtain the target candidate information. And then, extracting features and carrying out semantic recognition on the target candidate information, and determining entity information corresponding to the text to be recognized according to the target candidate information.
Fig. 6 schematically illustrates a flowchart of an entity information identification method according to another embodiment of the present disclosure.
As shown in fig. 6, the method includes operations S610 to S650.
In operation S610, named entity recognition is performed on the text to be recognized to obtain at least one candidate information.
In operation S620, at least one candidate information is selected according to a preset rule, so as to obtain at least one target candidate information.
In operation S630, feature extraction is performed on each target candidate information to obtain at least one target feature information.
In operation S640, deep semantic recognition is performed on each target feature information, so as to obtain at least one target semantic recognition result.
In operation S650, entity information corresponding to the text to be recognized is determined from at least one target candidate information according to at least one target semantic recognition result.
According to an embodiment of the present disclosure, the preset rules may include various custom rules for performing preliminary selection on candidate information to obtain target candidate information. For example, the preset rule may include filtering candidate information that does not conform to the semantics expressed by the semantic information according to the semantic information of the text to be recognized. The preset rule may further include filtering candidate information existing in a list of the preset blacklist according to the preset blacklist, and the like.
Through the embodiment of the disclosure, the candidate information is secondarily selected according to the named entity recognition and the preset rule, and the entity information is determined, so that the accuracy of the entity information recognized by the text to be recognized can be effectively improved.
According to an embodiment of the present disclosure, selecting at least one candidate information according to a preset rule, and obtaining target candidate information includes: for each candidate information, firstly, carrying out normalization processing on the first target number of word segments in the candidate information to obtain a normalization value, and determining the first class probability of the initial word segment of the candidate information as the initial word segment of the predefined class and the second class probability of the termination word segment of the candidate information as the termination word segment of the predefined class. Then, a target sum of the first class probability, the second class probability and the normalized value is calculated, and the candidate information is determined as target candidate information in the case where the target sum is greater than or equal to a third preset threshold.
According to an embodiment of the present disclosure, the first target number may be used to characterize the number of word segments included in the candidate information. The normalization processing may include at least one of normalization processing according to a preset program, normalization processing according to a first target number of word segments of all candidate information, and the like, and parameters related to semantic features, category features, and the like of the candidate information may be included in the preset program. The third preset threshold is used, for example, to select reliable target candidate information from candidate information. For example, the target and candidate information smaller than the third preset threshold may be filtered, and the target candidate information may be determined according to the remaining candidate information.
According to an embodiment of the present disclosure, the text to be recognized is, for example, "NJSCJDQ", the candidate information includes, for example, "CJ", "JD", and the third preset threshold is, for example, 2.0. The normalization processing is performed on the first target number of the word segments of "CJ" and "JD" through a preset program, for example, the normalized value of "CJ" is 0.8, and the normalized value of "JD" is 0.3. Further, for example, "C" of "CJ" may be determined as a first class probability of 0.8 for the place-class-start word segment, and "J" of "CJ" may be determined as a second class probability of 0.8 for the place-class-end word segment. For example, "J" of "JD" may also be determined as a first class probability of 0.5 for a name class start word segment and "D" of "JD" may be determined as a second class probability of 0.3 for a name class end word segment. Then the calculated target sum for "JD" may be obtained to be 2.4, the calculated target sum for "JD" may be 1.1, and in combination with the third preset threshold of 2.0, the "JD" may be filtered out of the candidate information, and a target candidate information may be determined based on the "CJ".
According to an embodiment of the present disclosure, the text to be identified is, for example, "suburban park", the candidate information includes, for example, "suburban", and the third preset threshold is, for example, 2.1. The first target number of the word segments of suburb and suburb is normalized by a preset program, for example, the normalized value of suburb is 0.4, and the normalized value of suburb is 0.9. Further, for example, it may be determined that "city" of "suburb" is 0.5 as the first class probability of the place-class start word segment, and "suburb" is 0.4 as the second class probability of the place-class end word segment. For example, it may be determined that the first class probability of the "suburb" as the name class start word segment is 0.7, and the second class probability of the "suburb" as the name class end word segment is 0.8. The calculated target sum for "suburb" may be obtained to be 1.3, and the calculated target sum for "suburb" may be 2.4, in combination with the third preset threshold 2.1, the "suburb" may be filtered out of the candidate information, and a target candidate information may be determined according to the "suburb".
Through the above embodiment of the present disclosure, in combination with the first target number information of the word segments included in the candidate information, the start word segment and the end word segment, which are respectively used as the category probabilities of the start word segment and the end word segment of the predefined category, an implementation method of a preset rule is provided, so that the candidate information can be selected, and more reliable target candidate information can be obtained.
According to an embodiment of the present disclosure, selecting at least one candidate information according to a preset rule, and obtaining target candidate information includes: for each candidate information, determining a second target number of word segments included in the candidate information, and determining the candidate information as target candidate information in the case that the second target number is smaller than or equal to a fourth preset threshold.
According to an embodiment of the present disclosure, in the case where the candidate information is the same candidate information, the second target number is equal to the first target number described above. The fourth preset threshold may be used to limit the length of the entity, for example, candidate information including too many word segments may be filtered out. The value of the fourth preset threshold value can be determined according to the number of word segments included in each word segment in the text to be recognized, which is the same as the semantic information of the text to be recognized, and can also be determined in a self-defining manner.
For example, according to the semantic information of "NJSCJDQ", the value of the fourth preset threshold may be preset to be 3, and the candidate information identified for "NJSCJDQ" further includes, for example, "NJSC" and "SCJDQ", which include word segments of 4 and 5, respectively, so that "NJSC" and "SCJDQ" may be filtered from the candidate information, and the target candidate information may be determined according to the remaining candidate information.
According to the embodiment of the disclosure, according to the semantic information of the "suburban park", the value of the fourth preset threshold may be preset to be 4, and the candidate information identified for the "suburban park" further includes, for example, the "suburban park" and the "suburban park", which include word segments of 4 and 5, respectively, so that the "suburban park" may be filtered out of the candidate information, and the target candidate information may be determined according to the remaining candidate information.
Through the above embodiment of the present disclosure, in combination with the second target number of word segments included in the candidate information, another implementation method of a preset rule is provided, and the candidate information may be selected to obtain target candidate information meeting the requirement of entity length.
According to an embodiment of the present disclosure, selecting at least one candidate information according to a preset rule, and obtaining target candidate information includes: and filtering the candidate information comprising the predefined word segment information to obtain new candidate information under the condition that the candidate information comprising the predefined word segment information exists in at least one candidate information. And determining target candidate information according to the new candidate information.
According to an embodiment of the present disclosure, a word segment vocabulary may be preset for filtering contents of candidate information, where predefined word segment information that cannot occur in an entity, such as "stop words", "in", and the like, and symbols of "," - "and the like, may be included in the table. In the case that the candidate information includes the predefined word segment information in the filtered word segment vocabulary, the corresponding candidate information may be filtered, and the remaining candidate information is used as new candidate information to determine target candidate information.
For example, candidate information identified for "NJS, CJDQ" may include "S, C", which may be filtered out as the candidate information includes the "segment information" in the segment vocabulary.
According to embodiments of the present disclosure, candidate information identified for "city, suburban park" may include "city, suburban", which may be filtered out as the candidate information includes the "piece of information" in the piece of word vocabulary.
Through the above embodiment of the present disclosure, in combination with predefined word segment information, another implementation method of a preset rule is provided, and candidate information may be selected to obtain target candidate information meeting the requirement of entity content.
According to an embodiment of the present disclosure, selecting at least one candidate information according to a preset rule, and obtaining target candidate information includes: and calculating the initial word segment of the candidate information as a first class probability of the initial word segment of the predefined class, and the termination word segment of the candidate information as a target sum of a second class probability of the termination word segment of the predefined class and a normalization value corresponding to the number of word segments in the candidate information. And sorting the candidate information according to the target and to obtain a sorting result. And determining a predetermined number of candidate information as target candidate information according to the sorting result.
According to embodiments of the present disclosure, the predetermined number may be used to determine the maximum number of entity information identified for one text to be identified, the number of entity information of the same category identified for one text to be identified, and so on.
For example, the predetermined number is used to determine that the entity information recognized for the text to be recognized as "NJSCJDQ" is 0 pieces of person name class entity information, 3 pieces of place name class entity information, or the like. The candidate information includes, for example, a place name class "NJ", a place name class "NJs", a person name class "SC", a place name class "CJ", a person name class "JDQ", a place name class "DQ", a place name class "Q", and the like, and the targets calculated for the candidate information are, for example, 2.3, 2.5, 1.0, 2.4, 0.7, 2.0, 1.7, and the like in this order. The ranking result of ranking the candidate information may be "NJS", "CJ", "NJ", "DQ", "Q", "SC", "JDQ", and the target candidate information may be determined to include "NJS", "CJ", "NJ" according to a predetermined number.
According to an embodiment of the present disclosure, the predetermined number may be, for example, 2 pieces of place name class entity information or the like, corresponding to the text to be recognized, which is "suburban park". The candidate information includes, for example, a place name class "city", a place name class "suburb", a place name class "park", and the like, and the targets calculated for the candidate information are, for example, 2.0, 2.5, 2.3, and the like in order. The ranking result of ranking the candidate information may be "suburb", "park", "city", and the target candidate information may be determined to include "suburb", "park" based on a predetermined number.
Through the above embodiment of the present disclosure, in combination with the predetermined number, another implementation method of a preset rule is provided, which can select candidate information to obtain a predetermined number of target candidate information, reduce the data volume, and simplify subsequent calculation.
According to an embodiment of the disclosure, the entity information identification method may be implemented using a deep learning model based on pallet 2.0, where the model may include a flat+ernie+mrc-based entity candidate extraction model, the preset rule, and an ernie+softmax-based entity discrimination model, and the two models may be trained jointly.
Through the embodiment of the disclosure, the mode of adding the multiple threshold values to the mode of adding the threshold values to the model is converted from the original mode of adding the multiple threshold values to the mode of adding the threshold values in the entity candidate screening stage, so that the method has better identification effect and can improve the accuracy and recall rate of selecting the candidate information when the task is identified by the entity information facing different fields and different characteristics.
Fig. 7 schematically illustrates a block diagram of an entity information identification apparatus according to an embodiment of the present disclosure.
As shown in fig. 7, the entity information recognition apparatus 700 includes an entity recognition module 710, a feature extraction module 720, a semantic recognition module 730, and a determination module 740.
The entity recognition module 710 is configured to perform named entity recognition on the text to be recognized, so as to obtain at least one candidate information.
The feature extraction module 720 is configured to perform feature extraction on each candidate information to obtain at least one feature information.
The semantic recognition module 730 is configured to perform deep semantic recognition on each feature information to obtain at least one semantic recognition result.
The determining module 740 is configured to determine, according to the at least one semantic recognition result, entity information corresponding to the text to be recognized from the at least one candidate information.
According to an embodiment of the present disclosure, an entity identification module includes a generation unit and an entity identification unit.
And the generating unit is used for generating a head position identifier and a tail position identifier for each word segment and each word segment in the text to be identified.
And the entity identification unit is used for carrying out named entity identification on the text to be identified according to the head position identification, the tail position identification and the text to be identified to obtain at least one candidate information.
According to an embodiment of the present disclosure, the entity identification module includes a first determination unit, a second determination unit, a third determination unit, and a fourth determination unit.
A first determining unit for determining a first class probability of each word segment in the text to be identified as a start word segment of the predefined class and a second class probability of each word segment of the predefined class as a stop word segment of the predefined class.
And a second determining unit configured to determine, as the start word segment, a word segment corresponding to the first class probability, in a case where the value of the first class probability is greater than or equal to a first preset threshold.
And a third determining unit configured to determine, as the termination word segment, a word segment corresponding to the second class probability if the value of the second class probability is greater than or equal to a second preset threshold.
And a fourth determining unit for determining candidate information according to the start word segment and the end word segment.
According to an embodiment of the present disclosure, the entity information identifying apparatus further includes a selection module.
And the selection module is used for selecting at least one candidate information according to a preset rule to obtain target candidate information.
The feature extraction module is also used for extracting features of the target candidate information.
According to an embodiment of the present disclosure, the selection module comprises a fifth determination unit.
A fifth determining unit configured to, for each candidate information: and carrying out normalization processing on the first target number of the word segments in the candidate information to obtain a normalized value. A first class probability is determined for the start word segment of the candidate information as the start word segment of the predefined class and for the end word segment of the candidate information as the second class probability for the end word segment of the predefined class. And calculating a target sum of the first class probability, the second class probability and the normalized value. And determining the candidate information as target candidate information in the case that the target sum is greater than or equal to a third preset threshold value.
According to an embodiment of the present disclosure, the selection module comprises a sixth determination unit.
A sixth determining unit configured to, for each candidate information: a second target number of word segments included in the candidate information is determined. And determining the candidate information as target candidate information in the case that the second target number is less than or equal to a fourth preset threshold value.
According to an embodiment of the disclosure, the selection module comprises a filtering unit and a seventh determination unit.
And the filtering unit is used for filtering the candidate information comprising the predefined word segment information to obtain new candidate information under the condition that the candidate information comprising the predefined word segment information exists in at least one piece of candidate information.
And a seventh determining unit for determining target candidate information according to the new candidate information.
According to an embodiment of the present disclosure, the selection module includes a calculation unit, a sorting unit, and an eighth determination unit.
The calculating unit is used for calculating the target sum of the first class probability of the initial word segment of the candidate information serving as the initial word segment of the predefined class, the second class probability of the termination word segment of the candidate information serving as the termination word segment of the predefined class and the normalization value corresponding to the number of the word segments in the candidate information.
And the sequencing unit is used for sequencing the candidate information according to the target and to obtain a sequencing result.
An eighth determining unit for determining a predetermined number of candidate information as target candidate information based on the sorting result.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the entity information identification method as described above.
According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the entity information identification method as described above.
According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the entity information identification method as described above.
Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed in this patent.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as the entity information identification method. For example, in some embodiments, the entity information identification method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When a computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the entity information identification method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the entity information identification method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (14)

1. An entity information identification method, comprising:
carrying out named entity recognition on the text to be recognized to obtain at least one piece of candidate information;
extracting the characteristics of each piece of candidate information to obtain at least one piece of characteristic information;
carrying out deep semantic recognition on each piece of characteristic information to obtain at least one semantic recognition result; and
determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result;
The identifying of the named entity to the text to be identified, and obtaining at least one candidate information includes:
determining a first class probability of each word segment in the text to be identified as a start word segment of a predefined class and a second class probability of each word segment of the predefined class as a stop word segment of the predefined class;
determining a word segment corresponding to the first class probability as a starting word segment under the condition that the value of the first class probability is larger than or equal to a first preset threshold value;
determining a word segment corresponding to the second class probability as a termination word segment under the condition that the value of the second class probability is larger than or equal to a second preset threshold value; and
and determining the candidate information according to the starting word segment and the ending word segment.
2. The method of claim 1, further comprising:
selecting the at least one candidate information according to a preset rule to obtain target candidate information;
the feature extraction of each piece of candidate information comprises the following steps:
and extracting the characteristics of the target candidate information.
3. The method of claim 2, wherein the selecting the at least one candidate information according to a preset rule, to obtain target candidate information, comprises:
For each of the candidate information:
normalizing the first target number of the word segments in the candidate information to obtain a normalized value;
determining a first class probability of a starting word segment of the candidate information as a starting word segment of a predefined class and a second class probability of a terminating word segment of the candidate information as a terminating word segment of the predefined class;
calculating a target sum of the first class probability, the second class probability and the normalized value; and
and determining the candidate information as the target candidate information under the condition that the target sum is greater than or equal to a third preset threshold value.
4. The method of claim 2, wherein the selecting the at least one candidate information according to a preset rule, to obtain target candidate information, comprises:
for each of the candidate information:
determining a second target number of word segments included in the candidate information; and
and determining the candidate information as the target candidate information under the condition that the second target number is smaller than or equal to a fourth preset threshold value.
5. The method of claim 2, wherein the selecting the at least one candidate information according to a preset rule, to obtain target candidate information, comprises:
Filtering candidate information comprising predefined word segment information under the condition that the candidate information comprising the predefined word segment information exists in the at least one candidate information, so as to obtain new candidate information; and
and determining the target candidate information according to the new candidate information.
6. The method of claim 2, wherein the selecting the at least one candidate information according to a preset rule, to obtain target candidate information, comprises:
calculating a first class probability of the initial word segment of the candidate information as the initial word segment of a predefined class, and a second class probability of the termination word segment of the candidate information as a target sum of normalization values corresponding to the number of word segments in the candidate information;
sorting the candidate information according to the target and the target to obtain a sorting result; and
and determining a preset number of candidate information as the target candidate information according to the sorting result.
7. An entity information identification apparatus, comprising:
the entity recognition module is used for carrying out named entity recognition on the text to be recognized to obtain at least one piece of candidate information;
The feature extraction module is used for carrying out feature extraction on each piece of candidate information to obtain at least one piece of feature information;
the semantic recognition module is used for carrying out deep semantic recognition on each piece of characteristic information to obtain at least one semantic recognition result; and
the determining module is used for determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result;
wherein, the entity identification module includes:
a first determining unit configured to determine a first category probability of each word segment in the text to be recognized as a start word segment of a predefined category, and a second category probability of each word segment as a stop word segment of the predefined category;
a second determining unit, configured to determine, as a starting word segment, a word segment corresponding to the first class probability if the value of the first class probability is greater than or equal to a first preset threshold;
a third determining unit, configured to determine, as a termination word segment, a word segment corresponding to the second class probability if the value of the second class probability is greater than or equal to a second preset threshold; and
And a fourth determining unit, configured to determine the candidate information according to the start word segment and the end word segment.
8. The apparatus of claim 7, further comprising:
the selection module is used for selecting the at least one candidate information according to a preset rule to obtain target candidate information;
the feature extraction module is used for extracting features of the target candidate information.
9. The apparatus of claim 8, wherein the selection module comprises:
a fifth determining unit configured to, for each of the candidate information:
normalizing the first target number of the word segments in the candidate information to obtain a normalized value;
determining a first class probability of a starting word segment of the candidate information as a starting word segment of a predefined class and a second class probability of a terminating word segment of the candidate information as a terminating word segment of the predefined class;
calculating a target sum of the first class probability, the second class probability and the normalized value; and
and determining the candidate information as the target candidate information under the condition that the target sum is greater than or equal to a third preset threshold value.
10. The apparatus of claim 8, wherein the selection module comprises:
a sixth determining unit configured to, for each of the candidate information:
determining a second target number of word segments included in the candidate information; and
and determining the candidate information as the target candidate information under the condition that the second target number is smaller than or equal to a fourth preset threshold value.
11. The apparatus of claim 8, wherein the selection module comprises:
the filtering unit is used for filtering the candidate information comprising the predefined word segment information to obtain new candidate information when the candidate information comprising the predefined word segment information exists in the at least one candidate information; and
and a seventh determining unit configured to determine the target candidate information according to the new candidate information.
12. The apparatus of claim 8, wherein the selection module comprises:
a calculating unit, configured to calculate a target sum of a first class probability of a start word segment of the candidate information as a start word segment of a predefined class, a second class probability of a stop word segment of the candidate information as a stop word segment of the predefined class, and a normalization value corresponding to the number of word segments in the candidate information;
The sorting unit is used for sorting the candidate information according to the targets to obtain a sorting result; and
an eighth determining unit configured to determine a predetermined number of candidate information as the target candidate information based on the sorting result.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
CN202111111471.6A 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium Active CN113807099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111111471.6A CN113807099B (en) 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111111471.6A CN113807099B (en) 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113807099A CN113807099A (en) 2021-12-17
CN113807099B true CN113807099B (en) 2024-02-13

Family

ID=78896263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111111471.6A Active CN113807099B (en) 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113807099B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
US10629186B1 (en) * 2013-03-11 2020-04-21 Amazon Technologies, Inc. Domain and intent name feature identification and processing
CN112215008A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Entity recognition method and device based on semantic understanding, computer equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080052262A1 (en) * 2006-08-22 2008-02-28 Serhiy Kosinov Method for personalized named entity recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10629186B1 (en) * 2013-03-11 2020-04-21 Amazon Technologies, Inc. Domain and intent name feature identification and processing
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
CN112215008A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Entity recognition method and device based on semantic understanding, computer equipment and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A Survey on Deep Learning for Named Entity Recognition";Jing Li 等;《arxiv》;全文 *
基于BERT-BiLSTM-CRF模型的中文实体识别;谢腾;杨俊安;刘辉;;计算机***应用(07);全文 *
预训练语言模型在中文电子病历命名实体识别上的应用;吴小雪;张庆辉;;电子质量(09);全文 *

Also Published As

Publication number Publication date
CN113807099A (en) 2021-12-17

Similar Documents

Publication Publication Date Title
CN113553412B (en) Question-answering processing method, question-answering processing device, electronic equipment and storage medium
CN113836925B (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN113053367B (en) Speech recognition method, speech recognition model training method and device
CN114416943A (en) Training method and device for dialogue model, electronic equipment and storage medium
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN114861637B (en) Spelling error correction model generation method and device, and spelling error correction method and device
CN114416976A (en) Text labeling method and device and electronic equipment
CN112989235A (en) Knowledge base-based internal link construction method, device, equipment and storage medium
CN116975400B (en) Data classification and classification method and device, electronic equipment and storage medium
CN115658903B (en) Text classification method, model training method, related device and electronic equipment
CN116662484A (en) Text regularization method, device, equipment and storage medium
CN114880520B (en) Video title generation method, device, electronic equipment and medium
CN113807099B (en) Entity information identification method, device, electronic equipment and storage medium
CN115827867A (en) Text type detection method and device
CN113553833B (en) Text error correction method and device and electronic equipment
CN113204616B (en) Training of text extraction model and text extraction method and device
CN112948573B (en) Text label extraction method, device, equipment and computer storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN112925912A (en) Text processing method, and synonymous text recall method and device
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium
CN115035890B (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN116244432B (en) Pre-training method and device for language model and electronic equipment
CN113032540B (en) Man-machine interaction method, device, equipment and storage medium
CN114417871B (en) Model training and named entity recognition method, device, electronic equipment and medium
CN115374779B (en) Text language identification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant