CN112307766A - Method, apparatus, electronic device and medium for identifying preset category entities - Google Patents

Method, apparatus, electronic device and medium for identifying preset category entities Download PDF

Info

Publication number
CN112307766A
CN112307766A CN202010999268.6A CN202010999268A CN112307766A CN 112307766 A CN112307766 A CN 112307766A CN 202010999268 A CN202010999268 A CN 202010999268A CN 112307766 A CN112307766 A CN 112307766A
Authority
CN
China
Prior art keywords
text
preset
preset category
matched
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010999268.6A
Other languages
Chinese (zh)
Inventor
杨帅
张亚
文豪
谢佩
徐晓涵
闫盈盈
翟所迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Peking University Third Hospital
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Peking University Third Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd, Peking University Third Hospital filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010999268.6A priority Critical patent/CN112307766A/en
Publication of CN112307766A publication Critical patent/CN112307766A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, electronic devices, and media for identifying preset category entities. One embodiment of the method comprises: acquiring a text to be identified; acquiring a preset category entity identification template, wherein the preset category entity identification template comprises at least one text matching structure; analyzing the text to be recognized by using the preset category entity recognition template to generate a pattern string to be matched, wherein the pattern string to be matched comprises a text matching structure identification sequence; and generating a recognition result according to the matching between the preset pattern string set matched with the preset category entity and the pattern string to be matched, wherein the recognition result is used for indicating the preset category entity contained in the text to be recognized. The embodiment realizes that the accuracy of the identification method can be ensured and the efficiency is improved.

Description

Method, apparatus, electronic device and medium for identifying preset category entities
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for identifying a preset category entity.
Background
In the field of Natural Language Processing (NLP), text segments with certain types of features are generally referred to as entities. The process of marking such a text fragment from the original text is called entity recognition. The current entity identification techniques mainly include two main categories: an entity identification method based on matching and an entity identification method based on Deep Learning (DL) technology. The matching-based entity recognition method mainly obtains recognition results by comparing preset dictionaries containing all entities or preset collocation patterns among words/words with fragments in texts to be recognized one by one. The entity recognition method based on the deep learning technology mainly performs end-to-end entity recognition by combining bottom-layer language models such as BERT (bidirectional Encoder reproduction from transformations) and the like with models such as RNN (Recurrent Neural Network), CNN (Convolutional Neural Network), CRF (Conditional Random Field) and the like.
However, for entity recognition (e.g., usage amount of a drug instruction book) with certain matching mode and certain flexibility between characters/words, the conventional entity recognition method needs to write a large number of rule templates, and the efficiency is very low. The end-to-end entity recognition method based on the deep learning technology needs long model training time, and the accuracy is difficult to meet the requirement.
Disclosure of Invention
Embodiments of the present disclosure propose methods, apparatuses, electronic devices, and media for identifying preset category entities.
In a first aspect, an embodiment of the present disclosure provides a method for identifying a preset category entity, where the method includes: acquiring a text to be identified; acquiring a preset category entity identification template, wherein the preset category entity identification template comprises at least one text matching structure; analyzing a text to be recognized by using a preset category entity recognition template to generate a pattern string to be matched, wherein the pattern string to be matched comprises a text matching structure identification sequence; and generating a recognition result according to the matching between the preset pattern string set matched with the preset category entity and the pattern string to be matched, wherein the recognition result is used for indicating the preset category entity contained in the text to be recognized.
In a second aspect, an embodiment of the present disclosure provides an apparatus for identifying a preset category entity, the apparatus including: a first acquisition unit configured to acquire a text to be recognized; the second acquisition unit is configured to acquire a preset category entity identification template, wherein the preset category entity identification template comprises at least one text matching structure; the analysis unit is configured to analyze the text to be recognized by using a preset category entity recognition template to generate a pattern string to be matched, wherein the pattern string to be matched comprises a text matching structure identification sequence; the generating unit is configured to generate a recognition result according to matching between a preset pattern string set matched with a preset category entity and a pattern string to be matched, wherein the recognition result is used for indicating the preset category entity contained in the text to be recognized.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor implements the method as described in any of the implementations of the first aspect.
The method, the device, the electronic equipment and the medium for identifying the preset category entities provided by the embodiments of the present disclosure greatly reduce the workload of template writing by presetting a text matching structure instead of specific words. Moreover, the waiting time of the method for using is reduced because end-to-end model training is not needed. Therefore, the accuracy of the identification method can be ensured, and the efficiency is improved.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for identifying preset category entities according to the present disclosure;
fig. 3 is a schematic diagram of an application scenario of a method for identifying preset category entities according to an embodiment of the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a method for identifying preset category entities in accordance with the present disclosure;
FIG. 5 is a block diagram illustrating one embodiment of an apparatus for identifying entities of preset categories according to the present disclosure;
FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows an exemplary architecture 100 to which the method for identifying a preset category entity or the apparatus for identifying a preset category entity of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, an entity identification application, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting human-computer interaction, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a background server providing support for entity identification type applications on the terminal devices 101, 102, 103. The background server can analyze and process the received text to be recognized, generate a recognition result according to the preset category entity recognition template, and feed back the generated recognition result to the terminal equipment.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the method for identifying the entity in the preset category provided by the embodiment of the present disclosure is generally performed by the server 105, and accordingly, the means for identifying the entity in the preset category is generally disposed in the server 105. Optionally, the method for identifying the entity in the preset category provided in the embodiment of the present application may also be executed by the terminal devices 101, 102, and 103 under the condition that the computing capability is satisfied, and accordingly, the apparatus for identifying the entity in the preset category may also be disposed in the terminal devices 101, 102, and 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for identifying preset category entities in accordance with the present disclosure is shown. The method for identifying the preset category entity comprises the following steps:
step 201, acquiring a text to be recognized.
In this embodiment, an execution subject (such as the server 105 shown in fig. 1) of the method for identifying the preset category entity may acquire the text to be identified through a wired connection manner or a wireless connection manner. The text to be recognized may include various texts from which entities are to be recognized. The preset type entity can be preset according to the actual application scene. As an example, the text to be recognized may include a drug instruction book. Accordingly, the preset category entity may include an entity describing a drug usage amount part. It may include, but is not limited to, at least one of the following: frequency entity, dosage entity, course entity, population entity, indication entity, and combination entity.
In this embodiment, as an example, the execution main body may obtain a text to be recognized, which is stored locally in advance, or may obtain the text to be recognized from an electronic device (for example, a database server or the terminal devices 101, 102, and 103 shown in fig. 1) in communication connection with the execution main body.
Step 202, acquiring a preset category entity identification template.
In this embodiment, the execution subject may acquire the preset category entity recognition template in various ways. The preset category entity recognition template may include at least one text matching structure. The text matching structure can be used for representing words of preset types and combination modes thereof. By way of example, the text matching structure described above may include, but is not limited to, at least one of: a number + a number unit (e.g., "2 tablets", "5 granules"), a number + a time unit (e.g., "1 day", "8 hours"), a meal order relationship (e.g., "before meal", "after meal"), a medication action (e.g., "take with water", "orally"), a range adverb (e.g., "between", "over"), a negation (e.g., "no", "forbidden"), and a conjunction (e.g., "before", "after").
And 203, analyzing the text to be recognized by using a preset category entity recognition template to generate a pattern string to be matched.
In this embodiment, the execution main body may analyze the text to be recognized by using the preset category entity recognition template, and generate a pattern string to be matched. The pattern string to be matched may include a text matching structure identification sequence. The text matching structure identification can be used for indicating different text matching structures. As an example, the text matching structure identification may include (TIME), (FREQ), (UKN), (P). The text matching structure that can be used for indication by the above (TIME) may include "number + TIME unit" and/or "'per' + TIME unit", among others. The text matching structure that the above (FREQ) may be used to indicate may include at least one of "number + ' times", "" ' every ' + ' times ' ", and" "+ number + ' times '". The above (UKN) may be used to indicate text that does not belong to the above pre-set category entity. The text matching structure that can be used for indication in the above (P) may be a preset category punctuation mark (e.g., ", etc. for segmenting text segments). As yet another example, the text matching structure identification may include (UKN, N). Wherein N may be used to indicate the number of characters in the text that do not belong to the preset category entity.
In this embodiment, the executing entity may match the text to be recognized obtained in step 201 with at least one text matching structure included in the preset category entity recognition template obtained in step 202. According to the matching result, the execution subject can generate a pattern string to be matched, which comprises a text matching structure identification sequence. As an example, in response to determining a match, the execution body may generate a text match structure identification corresponding to the matched text match structure. In response to determining that there is no match, the executing agent may generate a text matching structure identifier characterizing that there is no match with at least one text matching structure included in the preset category entity recognition template. Then, the execution subject may arrange the generated text matching structure identifier in order of characters in the text to be recognized. Thus, the execution body may generate the pattern string to be matched.
And 204, generating a recognition result according to the matching between the preset pattern string set matched with the preset category entity and the pattern string to be matched.
In this embodiment, the execution subject may first obtain a preset pattern string set matching the preset category entity. The preset pattern string set matched with the preset category entity may include a preset pattern string corresponding to a text belonging to the preset category entity. Then, according to whether the pattern string to be matched generated in the step 203 matches with the pattern string in the preset pattern string set, the execution subject may generate a recognition result in various ways. The recognition result may be used to indicate the preset category entity included in the text to be recognized obtained in step 201. As an example, in response to determining that a pattern string matching the pattern string to be matched generated in step 203 exists in the preset pattern string set, the executing entity may generate a recognition result indicating that the text to be recognized acquired in step 201 includes the entity of the preset category. Optionally, the recognition result may also be used to indicate a position of the entity of the preset category included in the text to be recognized obtained in the step 201 in the text to be recognized, that is, an entity tagging result. As another example, in response to determining that no pattern string matching the pattern string to be matched generated in step 203 exists in the preset pattern string set, the executing entity may generate a recognition result indicating that the text to be recognized acquired in step 201 does not include the entity in the preset category.
In some optional implementations of this embodiment, the executing entity may generate the recognition result by:
the method comprises the steps of firstly, obtaining vector representations corresponding to pattern strings in a preset pattern string set matched with preset category entities.
In these implementations, the execution subject may obtain, in various ways, a vector representation corresponding to a pattern string in a preset pattern string set that matches the preset category entity. Each pattern string in the preset pattern string set may correspond to a corresponding vector representation.
Optionally, the executing body may further obtain a vector representation corresponding to a pattern string in a preset pattern string set matched with the preset category entity by:
and S1, acquiring a text fragment set matched with the preset category entity.
In these implementations, the execution subject may acquire the set of text segments matching the preset category entity in various ways. The text fragment set may include a text including the preset category entity. As an example, the preset category entity may include a frequency entity. The text fragment set may include text fragments such as "once a day", "once every 12 hours", "2-3 times for each daily dose", and "new drug replacement every 4-6 hours".
And S2, analyzing the text segments in the text segment set by using the preset category entity identification template to generate a preset mode string set.
In these implementations, the executing agent may parse the text segment in the text segment set obtained in step S1 in a manner consistent with step 203, so as to generate the preset pattern string set. As an example, the preset category entity identification template may include: the number, + times ', ' every/single ' + ' times ', ' division ' + numbers + ' times '; time, "number + time unit", "+ per' + time unit"; [ verbs ] replace and take medicine. Thus, the preset pattern string set generated after the execution subject parses the text segment may include: (TIME) (FREQ), (TIME) (UKN,2) (FREQ), (TIME) (VERB) (UKN, 2). Wherein the descriptions of (TIME), (FREQ), (UKN) above may be consistent with the foregoing. The text matching structure that the above (VERB) may be used to indicate may include a preset administration action, such as "replace", "take".
And S3, inputting the pattern strings in the preset pattern string set into a vector generation model trained in advance to obtain corresponding vector representation.
In these implementations, the executing agent may input the pattern strings in the preset pattern string set generated in step S3 to a vector generation model trained in advance, so as to obtain the corresponding vector representation. The vector generation model may include various Deep Neural Networks (DNNs) for vector generation, such as word2vec, glove, ELMo (embedding from Language Models), BERT, and the like.
Based on the optional implementation mode, by the method of generating the pattern string according to the text segment and generating the vector representation by using the pattern string, the vector representation of the pattern string can be generated by combining the setting of a small number of preset category entity recognition templates and the deep learning mode, so that the accuracy and the generation efficiency of the generated vector representation are considered.
And secondly, generating a vector representation to be matched corresponding to the pattern string to be matched.
In these implementations, the executing entity may generate the to-be-matched vector representation corresponding to the to-be-matched pattern string generated in step 203 by using a method consistent with the first step.
And thirdly, generating a recognition result according to the similarity between the obtained vector representation and the vector representation to be matched.
In these implementations, the executing entity may first determine a similarity between the vector representation obtained in the first step and the vector representation to be matched generated in the second step. Then, the execution subject may generate a recognition result according to the similarity. As an example, in response to determining that the similarity between the vector to be matched and the target vector is greater than a preset threshold, the executing entity may generate a recognition result indicating that the text to be recognized acquired in step 201 includes the preset category entity. The target vector may include a preset vector corresponding to a pattern string corresponding to a text belonging to the preset category entity. Optionally, the recognition result may also be used to indicate a position of the entity of the preset category included in the text to be recognized obtained in the step 201 in the text to be recognized, that is, an entity tagging result. As another example, in response to determining that the similarity between the vector to be matched and all target vectors in the preset pattern string set is not greater than a preset threshold, the executing entity may generate a recognition result indicating that the text to be recognized acquired in step 201 does not include the preset category entity.
Based on the optional implementation manner, the similarity between the vectors can be used as a matching basis, and the matching efficiency is improved.
Optionally, the text segment set may include a formal text segment matching the preset category entity. The text fragment of the positive case may include a preset text fragment including a text of the entity of the preset category. As an example, the preset category entity may include a course entity. The positive example text fragment may include "take 3 months consecutively". In response to determining that the similarity between the vector representation to be matched and the vector representation corresponding to the regular text segment is greater than a preset threshold, the execution main body may generate a recognition result indicating that the text to be recognized includes a preset category entity. Optionally, the recognition result may also be used to indicate a position of the entity of the preset category included in the text to be recognized obtained in the step 201 in the text to be recognized, that is, an entity tagging result.
Based on the optional implementation mode, the recognition result can be generated by comparing the text segment with the regular example, so that the matching mode is enriched, and the accuracy of recognition is improved.
Optionally, the text segment set may include negative example text segments matching the preset category entity. The negative example text fragment may include a preset text fragment that is easily recognized by mistake as a text containing the preset category entity. As an example, the preset category entity may include a course entity. The negative example text passage may include "dosage of infant within 3 months reduced by half". In response to determining that the similarity between the vector representation to be matched and the vector representation corresponding to the negative example text segment is greater than a preset threshold, the execution main body may generate a recognition result indicating that the text to be recognized does not include a preset category entity.
Based on the optional implementation mode, the recognition result can be generated by comparing the negative example text segment, so that the matching mode is enriched, and the accuracy of recognition is improved.
In some optional implementation manners of this embodiment, according to an entity identification result, the execution main body may further update the preset pattern string set by using the pattern string to be matched. As an example, in response to determining that the recognition result is used to indicate that the text to be recognized includes a recognition result of a preset category entity, the execution main body may add a to-be-matched pattern string corresponding to the recognized preset category entity to the preset pattern string set to update the preset pattern string set.
Based on the optional implementation mode, the preset mode string set can be enriched by using the identified result, and the accuracy and the self-learning characteristic are ensured under the condition of little manual participation by dynamically updating the preset mode string set.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a method for identifying preset category entities according to an embodiment of the present disclosure. In the application scenario of fig. 3, the background server 301 obtains the text "adult: recommended initial daily dose is 100-150 mg, daily dose is 75-100 mg for mild patients, and daily dose can be taken 2-3 times (as shown in 303 in fig. 3) from the database server 302. The backend server 301 may then retrieve the frequent entity identification template (shown as 304 in fig. 3). The frequent entity identification template 304 may include: [ PROGRAM ] the number + "times", "each/single" + "times", "minutes" + number + "times"; [ TIME ] number + time unit, "per" + time unit; [ verbs ] replace and take medicine. The background server 301 may parse the text to be recognized 303 by using the frequency entity recognition template 304 to generate a pattern string to be matched 305. The meaning of the text matching structure identifier included in the pattern string 305 to be matched may be consistent with the description of step 203 in the foregoing embodiment, and is not described herein again. Based on the matching of the set 306 of frequent pattern strings with the string 305 of patterns to be matched, the backend server 301 may generate a recognition result (as shown at 307 in fig. 3). The text matching structure identifier included in the recognition result 307 may be used to indicate an entity belonging to the frequent entity in the text 303 to be recognized.
At present, one of the prior arts usually adopts a dictionary or template matching method, which results in a large number of rule templates to be written and is inefficient. The method provided by the embodiment of the disclosure greatly reduces the workload of template writing by presetting the text matching structure instead of specific words. Moreover, the waiting time of the method for using is reduced because end-to-end model training is not needed. Therefore, the accuracy of the identification method can be ensured, and the efficiency is improved.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for identifying preset category entities is shown. The process 400 of the method for identifying entities of preset categories comprises the following steps:
step 401, obtaining a text to be recognized.
Step 402, acquiring a preset category entity identification template.
And 403, analyzing the text to be recognized by using the preset category entity recognition target, and generating a pattern string to be matched.
Step 404, acquiring a text segment set matched with the preset category entity.
Step 405, analyzing the text segments in the text segment set by using a preset category entity identification template to generate a preset mode string set.
Step 406, inputting the pattern strings in the preset pattern string set to the pre-trained vector generation model to obtain corresponding vectors.
Step 407, generating a to-be-matched vector corresponding to the to-be-matched pattern string.
And step 408, generating a recognition result according to the similarity between the obtained vector and the vector to be matched.
The above steps 401, 402, 403, and 404 to 408 are respectively consistent with the optional implementations in steps 201, 202, 203, and 204 in the foregoing embodiments, and the above descriptions for step 201, 202, 203, the optional implementations thereof, and the optional implementations in step 204 also apply to step 401, 404, 403, and 404 to 408, which are not described again here.
And step 409, sending the entity identification result to an auditing terminal.
In this embodiment, an executing entity (for example, the server 105 shown in fig. 1) of the method for identifying the entity of the preset category may send the identification result generated in step 408 to the auditing terminal. The audit terminal may be a terminal for rechecking the identification result. As an example, the auditing terminal may be a terminal used by a technician. As another example, the audit terminal may also be a terminal for executing the method for identifying the entity of the preset category.
And step 410, receiving an auditing result sent by the auditing terminal.
In this embodiment, the execution subject may receive an audit result sent by the audit terminal. The audit result may be used to indicate whether the preset category entity identified by the identification result is correct.
And 411, updating the text segment set corresponding to the preset mode string set by using the text to be identified according to the auditing result.
In this embodiment, according to the audit result received in step 410, the execution subject may update the text segment set corresponding to the preset pattern string set by using the text to be recognized through various ways. As an example, in response to determining that the audit result is used to indicate that the recognition result is correct, the executing entity may add a text segment in the text to be recognized that includes the entity of the preset category indicated by the recognition result to a text segment set corresponding to the preset pattern string set.
In some optional implementations of the embodiment, based on that the text fragment set includes a regular text fragment matching a preset category entity, in response to determining that the audit result is used to indicate that the text to be recognized includes an unrecognized preset category entity, the executing entity may add a text fragment in which the unrecognized preset category entity is located to the text fragment set as the regular text fragment.
In some optional implementation manners of this embodiment, based on that the text fragment set includes a negative example text fragment matching a preset category entity, in response to determining that the audit result is used to indicate that the recognition result includes a misrecognized preset category entity, the executing entity may add the text fragment in which the misrecognized preset category entity is located to the text fragment set as the negative example text fragment.
Based on the optional implementation mode, the positive example text segment set and/or the negative example text segment set can be updated according to the recognition result, so that the self-learning characteristic is fully embodied, and the accuracy of the matching method can be gradually improved.
In some optional implementation manners of this embodiment, according to an entity identification result, the execution main body may further update the preset pattern string set by using the pattern string to be matched.
As can be seen from fig. 4, the process 400 of the method for identifying entities in preset categories in this embodiment represents a step of auditing the identification result by using an auditing terminal, and a step of updating the text segment set corresponding to the preset pattern string set by using the text to be identified according to the auditing result. Therefore, the scheme described in this embodiment can correct the entity with the identification error through the auditing structure, so as to ensure a very high accuracy, and is particularly suitable for serious fields (such as medical fields). Meanwhile, the method can further learn according to the feedback of the recognition result, and can automatically expand the text segment set, thereby realizing self-learning and greatly reducing the workload of manpower.
With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for identifying entities in preset categories, which corresponds to the method embodiment shown in fig. 2 or fig. 4, and which may be applied in various electronic devices.
As shown in fig. 5, the apparatus 500 for identifying a preset category entity provided in this embodiment includes a first obtaining unit 501, a second obtaining unit 502, an analyzing unit 503, and a generating unit 504. The first obtaining unit 501 is configured to obtain a text to be recognized; a second obtaining unit 502 configured to obtain a preset category entity identification template, where the preset category entity identification template includes at least one text matching structure therein; the parsing unit 503 is configured to parse the text to be recognized by using a preset category entity recognition template, and generate a pattern string to be matched, where the pattern string to be matched includes a text matching structure identification sequence; the generating unit 504 is configured to generate a recognition result according to matching between a preset pattern string set matched with a preset category entity and a pattern string to be matched, where the recognition result is used for indicating the preset category entity included in the text to be recognized.
In the present embodiment, in the apparatus 500 for identifying a preset category entity: the specific processing of the first obtaining unit 501, the second obtaining unit 502, the analyzing unit 503 and the generating unit 504 and the technical effects thereof can refer to the related descriptions of step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.
In some optional implementations of the present embodiment, the generating unit 504 may include an obtaining subunit (not shown in the figure), a first generating subunit (not shown in the figure), and a second generating subunit (not shown in the figure). The obtaining subunit may be configured to obtain a vector representation corresponding to a pattern string in a preset pattern string set that matches a preset category entity. The first generating subunit may be configured to generate a vector representation to be matched corresponding to the pattern string to be matched. The second generating subunit may be configured to generate the recognition result according to a similarity between the acquired vector representation and the vector representation to be matched.
In some optional implementations of this embodiment, the obtaining subunit (not shown in the figure) may include an obtaining module (not shown in the figure), a first generating module (not shown in the figure), and a second generating module (not shown in the figure). The obtaining module may be configured to obtain a set of text segments matching a preset category entity. The first generating module may be configured to parse the text segments in the text segment set by using the entity recognition template of the preset category to generate a set of preset pattern strings. The second generation module may be configured to input pattern strings in the preset pattern string set to a vector generation model trained in advance, resulting in corresponding vector representations.
In some optional implementations of the present embodiment, the text fragment set may include a regular text fragment matching a preset category entity. The second generating subunit may be further configured to: and generating a recognition result for indicating that the text to be recognized contains a preset category entity in response to determining that the similarity between the vector representation to be matched and the vector representation corresponding to the regular text segment is greater than a preset threshold value.
In some optional implementations of the present embodiment, the text segment set may include negative example text segments matching a preset category entity. The second generating subunit may be further configured to: and generating a recognition result for indicating that the text to be recognized does not contain the preset category entity in response to determining that the similarity between the vector representation to be matched and the vector representation corresponding to the negative example text segment is greater than a preset threshold value.
In some optional implementations of the present embodiment, the apparatus 500 for identifying a preset category entity may include: a sending unit (not shown), a receiving unit (not shown), and a first updating unit (not shown). Wherein the sending unit may be configured to send the identification result to the audit terminal. The receiving unit may be configured to receive an audit result sent by the audit terminal. The first updating unit may be configured to update the text segment set corresponding to the preset mode string set with the text to be recognized according to the audit result.
In some optional implementations of the present embodiment, the text segment set may include a regular text segment matching the preset category entity. The first updating unit may be further configured to: and in response to the fact that the auditing result is used for indicating that the text to be recognized comprises the unrecognized preset category entity, adding the text segment in which the unrecognized preset category entity is located to the text segment set as a regular text segment.
In some optional implementations of the present embodiment, the text segment set may include negative example text segments matching the preset category entity. The first updating unit may be further configured to: and in response to the fact that the verification result is used for indicating that the recognition result contains the misrecognized preset category entity, adding the text segment in which the misrecognized preset category entity is located as a negative example text segment to the text segment set.
In some optional implementations of the present embodiment, the apparatus 500 for identifying a preset category entity may include: and a second updating unit (not shown in the figure) configured to update the preset pattern string set with the pattern string to be matched according to the entity identification result.
The apparatus provided by the above embodiment of the present disclosure obtains the preset category entity recognition template including the preset text matching structure instead of the specific word through the second obtaining unit 502, thereby greatly reducing the workload of template writing. Moreover, the waiting time of the method for using is reduced because end-to-end model training is not needed. Therefore, the accuracy of the identification method can be ensured, and the efficiency is improved.
Referring now to FIG. 6, a schematic diagram of an electronic device (e.g., the server of FIG. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be identified; acquiring a preset category entity identification template, wherein the preset category entity identification template comprises at least one text matching structure; analyzing a text to be recognized by using a preset category entity recognition template to generate a pattern string to be matched, wherein the pattern string to be matched comprises a text matching structure identification sequence; and generating a recognition result according to the matching between the preset pattern string set matched with the preset category entity and the pattern string to be matched, wherein the recognition result is used for indicating the preset category entity contained in the text to be recognized.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprises a first acquisition unit, a second acquisition unit, an analysis unit and a generation unit. Where the names of the units do not in some cases constitute a limitation on the units themselves, for example, the first acquiring unit may also be described as a "unit that acquires text to be recognized".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (12)

1. A method for identifying a preset category entity, comprising:
acquiring a text to be identified;
acquiring a preset category entity identification template, wherein the preset category entity identification template comprises at least one text matching structure;
analyzing the text to be recognized by using the preset category entity recognition template to generate a pattern string to be matched, wherein the pattern string to be matched comprises a text matching structure identification sequence;
and generating a recognition result according to the matching between a preset pattern string set matched with the preset category entity and the pattern string to be matched, wherein the recognition result is used for indicating the preset category entity contained in the text to be recognized.
2. The method according to claim 1, wherein the generating a recognition result according to the matching between the preset pattern string set matched with the preset category entity and the pattern string to be matched comprises:
acquiring vector representations corresponding to pattern strings in a preset pattern string set matched with the preset category entities;
generating a vector representation to be matched corresponding to the pattern string to be matched;
and generating the recognition result according to the similarity between the obtained vector representation and the vector representation to be matched.
3. The method of claim 2, wherein the obtaining of the vector representation corresponding to the pattern string in the preset pattern string set matching the preset category entity comprises:
acquiring a text fragment set matched with the preset category entity;
analyzing the text segments in the text segment set by using the preset category entity identification template to generate the preset mode string set;
and inputting the pattern strings in the preset pattern string set into a vector generation model trained in advance to obtain corresponding vector representation.
4. The method of claim 3, wherein the set of text segments includes a regular text segment matching the preset category entity; and
generating a recognition result according to the similarity between the obtained vector representation and the vector representation to be matched, wherein the recognition result comprises:
and generating a recognition result for indicating that the text to be recognized contains the preset category entity in response to determining that the similarity between the vector representation to be matched and the vector representation corresponding to the regular text segment is greater than a preset threshold value.
5. The method of claim 3, wherein the set of text segments includes negative example text segments that match the preset category entity; and
generating a recognition result according to the similarity between the obtained vector representation and the vector representation to be matched, wherein the recognition result comprises:
and generating an identification result for indicating that the text to be identified does not contain the preset category entity in response to determining that the similarity between the vector representation to be matched and the vector representation corresponding to the negative example text segment is greater than a preset threshold.
6. The method of claim 3, wherein the method further comprises:
sending the identification result to an auditing terminal;
receiving an auditing result sent by the auditing terminal;
and updating a text segment set corresponding to the preset mode string set by using the text to be recognized according to the auditing result.
7. The method of claim 6, wherein the set of text segments includes a regular text segment matching the preset category entity; and
the updating the text segment set corresponding to the preset mode string set by using the text to be identified according to the auditing result comprises the following steps:
and in response to the fact that the auditing result is used for indicating that the text to be recognized comprises the unrecognized preset category entity, adding a text segment in which the unrecognized preset category entity is located to the text segment set as a regular text segment.
8. The method of claim 6, wherein the set of text segments includes negative example text segments that match the preset category entity; and
the updating the text segment set corresponding to the preset mode string set by using the text to be identified according to the auditing result comprises the following steps:
and in response to determining that the audit result is used for indicating that the recognition result contains the misrecognized preset category entity, adding the text segment in which the misrecognized preset category entity is located as a negative example text segment to the text segment set.
9. The method according to one of claims 1 to 8, wherein the method further comprises:
and updating the preset pattern string set by using the pattern string to be matched according to the entity identification result.
10. An apparatus for identifying preset category entities, comprising:
a first acquisition unit configured to acquire a text to be recognized;
a second obtaining unit, configured to obtain a preset category entity identification template, wherein the preset category entity identification template comprises at least one text matching structure;
the analysis unit is configured to analyze the text to be recognized by using the preset category entity recognition template to generate a pattern string to be matched, wherein the pattern string to be matched comprises a text matching structure identification sequence;
the generating unit is configured to generate a recognition result according to matching between a preset pattern string set matched with the preset category entity and the pattern string to be matched, wherein the recognition result is used for indicating the preset category entity contained in the text to be recognized.
11. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202010999268.6A 2020-09-22 2020-09-22 Method, apparatus, electronic device and medium for identifying preset category entities Pending CN112307766A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010999268.6A CN112307766A (en) 2020-09-22 2020-09-22 Method, apparatus, electronic device and medium for identifying preset category entities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010999268.6A CN112307766A (en) 2020-09-22 2020-09-22 Method, apparatus, electronic device and medium for identifying preset category entities

Publications (1)

Publication Number Publication Date
CN112307766A true CN112307766A (en) 2021-02-02

Family

ID=74488930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010999268.6A Pending CN112307766A (en) 2020-09-22 2020-09-22 Method, apparatus, electronic device and medium for identifying preset category entities

Country Status (1)

Country Link
CN (1) CN112307766A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101481A (en) * 2018-06-25 2018-12-28 北京奇艺世纪科技有限公司 A kind of name entity recognition method, device and electronic equipment
CN109145294A (en) * 2018-08-07 2019-01-04 北京三快在线科技有限公司 Text entities recognition methods and device, electronic equipment, storage medium
CN110598200A (en) * 2018-06-13 2019-12-20 北京百度网讯科技有限公司 Semantic recognition method and device
CN110750991A (en) * 2019-09-18 2020-02-04 平安科技(深圳)有限公司 Entity identification method, device, equipment and computer readable storage medium
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN111581972A (en) * 2020-03-27 2020-08-25 平安科技(深圳)有限公司 Method, device, equipment and medium for identifying corresponding relation between symptom and part in text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598200A (en) * 2018-06-13 2019-12-20 北京百度网讯科技有限公司 Semantic recognition method and device
CN109101481A (en) * 2018-06-25 2018-12-28 北京奇艺世纪科技有限公司 A kind of name entity recognition method, device and electronic equipment
CN109145294A (en) * 2018-08-07 2019-01-04 北京三快在线科技有限公司 Text entities recognition methods and device, electronic equipment, storage medium
CN110750991A (en) * 2019-09-18 2020-02-04 平安科技(深圳)有限公司 Entity identification method, device, equipment and computer readable storage medium
CN111090987A (en) * 2019-12-27 2020-05-01 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN111581972A (en) * 2020-03-27 2020-08-25 平安科技(深圳)有限公司 Method, device, equipment and medium for identifying corresponding relation between symptom and part in text

Similar Documents

Publication Publication Date Title
KR102554121B1 (en) Method and apparatus for mining entity focus in text
CN110969012B (en) Text error correction method and device, storage medium and electronic equipment
CN109241286B (en) Method and device for generating text
US11758088B2 (en) Method and apparatus for aligning paragraph and video
CN109657251B (en) Method and device for translating sentences
US11475055B2 (en) Artificial intelligence based method and apparatus for determining regional information
WO2022116841A1 (en) Text translation method, apparatus and device, and storage medium
CN111159220B (en) Method and apparatus for outputting structured query statement
CN109858045B (en) Machine translation method and device
CN109829164B (en) Method and device for generating text
CN109933217B (en) Method and device for pushing sentences
CN110738056B (en) Method and device for generating information
CN108595412B (en) Error correction processing method and device, computer equipment and readable medium
CN112188311A (en) Method and apparatus for determining video material of news
CN111104796B (en) Method and device for translation
CN114385780A (en) Program interface information recommendation method and device, electronic equipment and readable medium
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
WO2020052060A1 (en) Method and apparatus for generating correction statement
CN111026849B (en) Data processing method and device
CN112307766A (en) Method, apparatus, electronic device and medium for identifying preset category entities
CN112988996B (en) Knowledge base generation method, device, equipment and storage medium
CN115620726A (en) Voice text generation method, and training method and device of voice text generation model
CN111339790B (en) Text translation method, device, equipment and computer readable storage medium
CN112148751B (en) Method and device for querying data
CN111062201B (en) Method and device for processing information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination