CN111339764A

CN111339764A - Chinese named entity recognition method and device

Info

Publication number: CN111339764A
Application number: CN201911192335.7A
Authority: CN
Inventors: 王喆锋; 郑毅; 李丹; 徐童; 怀宝兴; 袁晶
Original assignee: University of Science and Technology of China USTC; Huawei Technologies Co Ltd
Current assignee: University of Science and Technology of China USTC; Huawei Technologies Co Ltd
Priority date: 2019-09-18
Filing date: 2019-11-28
Publication date: 2020-06-26

Abstract

The embodiment of the application discloses a method and a device for recognizing a Chinese named entity, which are used for improving the accuracy of the Chinese named entity recognition in the medical field. The method in the embodiment of the application comprises the following steps: the Chinese named entity recognition device acquires a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character; then, the Chinese named entity recognition device splices the character vector and the radical vector and inputs the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector; and finally, inputting the first character vector into a conditional random field model and outputting a first entity word in the text to be recognized.

Description

Chinese named entity recognition method and device

The present application claims priority of chinese patent application with application number 201910883676.2, entitled "a method and apparatus for identifying entity words based on radical characteristics" filed by the chinese patent office on 18/09/2019, which is incorporated herein by reference in its entirety.

Technical Field

The application relates to the field of intelligent medical treatment, in particular to a Chinese named entity recognition method and device.

Background

With the rapid development of internet technology, network information shows an exponential growth situation, and a large number of on-line medical communities and medical information question and answer websites emerge, so that a large amount of medical diagnosis information is presented to people in the form of electronic documents. However, unlike databases, most of these medical data texts are in an unstructured state. In order to fully utilize the information contained in the texts in the medical fields, the named entity recognition technology is used for effectively extracting the useful medical entity words, which becomes the premise and the basis for realizing intelligent medical treatment.

However, the current named entity recognition research work of the Chinese electronic medical record does not consider the specificity of the comprehensive Chinese and Chinese medical field, but migrates the model facing the universal data set to the entity type of the medical field, and the analysis effect is limited.

Disclosure of Invention

The embodiment of the application provides a method and a device for recognizing a Chinese named entity, which are used for improving the accuracy of the Chinese named entity recognition in the medical field.

In a first aspect, an embodiment of the present application provides a method for identifying a named entity in chinese, which is applied to identifying a medical record entity in chinese in the medical field, and specifically includes: the Chinese named entity recognition device acquires a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character; then, the Chinese named entity recognition device splices the character vector and the radical vector and inputs the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector; and finally, inputting the first character vector into a conditional random field model and outputting a first entity word in the text to be recognized.

In the embodiment, the Chinese named entity recognition device encodes clear radical features of the named entities in the medical field into the character vector, enriches the input features of the named entity recognition model, and thus improves the entity extraction capability of the named entities. Meanwhile, the radical vectors and the character vectors are spliced and then put into the bidirectional long-short term memory network, so that not only can the relationship between the characters be captured, but also the relationship between the radicals and the characters and the relationship between the radicals and the radicals can be captured, and the entity recognition effect is enhanced.

Optionally, the chinese named entity recognition device converts each character in the text to be recognized into a corresponding first ID/one-hot code, and inputs the first ID/one-hot code into a lookup matrix of a named entity recognition model to obtain a character vector; and simultaneously, the Chinese named entity recognition device searches radicals corresponding to all characters according to the Chinese character-radical mapping table, converts the radicals into corresponding second ID/one-hot codes, and inputs the second ID/one-hot codes into a search matrix of a named entity recognition model to obtain radical vectors.

In this embodiment, the chinese character-radical mapping table is a module newly added to the named entity recognition model, and specifically, the mapping table between the chinese character and the radical is established by crawling an online nova dictionary or a related medical dictionary. For example, the radical corresponding to the lung is month, and the lung and the month are stored in a one-to-one correspondence relationship in the mapping table.

When a mapping table between Chinese characters and radicals is established, because each Chinese character does not have a radical, the Chinese character is stored as the radical when the character does not have the radical. For example, if there is no corresponding radical in the "day", the "day" is stored as its radical, and finally the "day" and the "day" are stored in a one-to-one correspondence relationship in the mapping table. Or, when the character is a special character, the character and the character can be stored in the mapping table as a one-to-one Chinese character-radical relationship. In this embodiment, the special character is used to indicate a non-Chinese character, which includes but is not limited to a punctuation mark, an English word line, or other foreign letters. Optionally, in the device for recognizing a named entity in chinese, each character in the file to be recognized may also be subjected to the pre-training language model to obtain a second word vector of each character; and then the Chinese named entity recognition device splices the first word vector and the second word vector to obtain a final word vector, inputs the word vector into the conditional random field model, and finally outputs a second entity word of the text to be recognized. In this embodiment, the pre-training language model is newly added to obtain the word vector, so as to further enrich the input features of the text to be recognized, thereby further improving the entity extraction capability of the text to be recognized. In this embodiment, the pre-training language model may be a (Bidirectional Encoder representation from transforms, BERT) model, and is not limited herein.

Optionally, the specific algorithm for inputting the first word vector into the conditional random field model and outputting the first entity word of the text to be recognized by the chinese named entity recognition device may be as follows: the Chinese named entity recognition device inputs the first word vector into the conditional random field model to obtain a labeling score of the first word vector through a scoring function; finally, determining a first entity word of the text to be recognized according to the labeling score; wherein the scoring function is:

wherein, the

For indicating sentences of length LX is labeled as the label score for the sequence of labels I, f is the score for each label for each character output by the conditional random field model,

the representation of the t-th character is labeled as label [ I]_tThe annotation score of (1). R (T) represents the index of the radical of the T-th character, T is the transition matrix, and R is the radical label matrix.

In this embodiment, a radical label matrix is newly added to the conditional random field model, and the influence of the radical label on the character is calculated, thereby improving the entity extraction capability.

It can be understood that the specific algorithm for the Chinese named entity recognition device to input the first word vector and the second word vector into the conditional random field model and output the second entity word of the text to be recognized may also be as follows: the Chinese named entity recognition device splices the first word vector and the second word vector, inputs the conditional random field model and obtains the labeling score of the word vector spliced by the first word vector and the second word vector through a scoring function; finally, determining a second entity word of the text to be recognized according to the labeling score; wherein the scoring function is:

wherein, the

A labeling score for expressing a sentence X of length L labeled as a tag sequence I, where f is a score for each tag for each character output by the conditional random field model,

Optionally, the radical setting rule in the radical label matrix in the conditional random field model may include at least one of: in one implementation, the set of radicals in the radical label matrix is determined according to medical history characters, such as radicals mainly relating to "", "meat", "gold", "wood", "water", "fire", "earth", "heart", "ear", "mouth", "hand", "car", "grain", "grass", "person", "thin", "qi", "large", "blood", "dead" in medical history can be added to the set of radicals.

For example, in the conditional stochastic model, the calculation scale of the branch matrix is (label number × label number), and the scale of the radical label matrix is (radical number × label number). when the number of radicals reaches thousands, the calculation scale of the radical label matrix is very large, so that the cost is relatively large when the dynamic programming performs predictive decoding.

Based on the limitation of the radical set in the radical label matrix, the specific operation of the conditional random model in calculating the influence of the radical label on the character can be as follows: if the radical of the character is contained in the radical label matrix of the conditional random model, calculating a radical label score, a label labeling score and a transfer score of the character by the conditional random field model; if the radical of the character is not contained in the radical label matrix of the conditional random field model, then the conditional random field model calculates only the label labeling score and the transfer score of the character.

In a second aspect, an embodiment of the present application provides a chinese named entity recognition apparatus, which has a function of implementing the behavior of the chinese named entity recognition apparatus in the first aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.

In one possible implementation, the apparatus includes means or modules for performing the steps of the first aspect above. For example, the apparatus includes: the acquiring module is used for acquiring a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character; the processing module is used for splicing the character vector and the radical vector acquired by the acquisition module and inputting the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector; and the calculation module is used for inputting the first character vector obtained by the processing module into the conditional random field model and outputting a first entity word in the text to be recognized.

Optionally, the system further comprises a storage module for storing necessary program instructions and data of the chinese named entity recognition device.

In one possible implementation, the apparatus includes: a processor and a transceiver, the processor being configured to support the chinese named entity recognition apparatus to perform the corresponding functions in the method provided by the first aspect. The transceiver is used for instructing the Chinese named entity recognition device to acquire the text to be recognized related in the method. Optionally, the apparatus may further comprise a memory, coupled to the processor, that stores program instructions and data necessary for the Chinese named entity recognition apparatus.

In one possible implementation, when the device is a chip within a chinese named entity recognition device, the chip includes: the processing module can be, for example, a processor, and the processor is configured to obtain a character vector and a radical vector of each character in the text to be recognized, where the radical vector is a vector of a radical corresponding to each character; splicing the character vector and the radical vector acquired by the acquisition module, and inputting the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first word vector; the transceiver module may be, for example, an input/output interface, pin, or circuit on the chip, and transmits the character vector and the radical vector or the first word vector generated by the processor to other chips or modules coupled to the chip. The processing module can execute computer-executable instructions stored in the storage unit to support the Chinese named entity recognition device to execute the method provided by the first aspect. Alternatively, the storage unit may be a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

In one possible implementation, the apparatus includes: a processor, baseband circuitry, radio frequency circuitry, and an antenna. The processor is used for realizing control of functions of each circuit part, the baseband circuit is used for generating a character vector, a radical vector and a first word vector, and the character vector, the radical vector and the first word vector are subjected to analog conversion, filtering, amplification, up-conversion and other processing through the radio frequency circuit and then are sent to the output equipment through the antenna. Optionally, the apparatus further comprises a memory storing program instructions and data necessary for the chinese named entity recognition apparatus.

The processor mentioned in any of the above mentioned embodiments may be a general processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the method for identifying a named entity in chinese in the above mentioned aspects.

In a third aspect, an embodiment of the present application provides a chinese named entity recognition model, where the model includes a search matrix, a bidirectional long-short term memory network, and a conditional random field model; the search matrix is used for obtaining a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character; the bidirectional long and short term memory network is used for obtaining a first character vector of the character according to the input spliced character vector and the radical vector; the conditional random field model is used for outputting a first entity word of the text to be recognized according to the first character vector.

Optionally, the chinese named entity recognition model includes a chinese character-radical mapping table, where the chinese character-radical mapping table may be obtained by crawling an online noval dictionary or a related medical dictionary to obtain a one-to-one correspondence relationship between a chinese character and a radical, and then storing the chinese character and the radical in the one-to-one correspondence relationship to obtain the chinese character-radical mapping table; and finally, storing the Chinese character-radical mapping table in the Chinese named entity recognition model.

Optionally, the chinese named entity recognition model further includes a BERT model, configured to obtain a second word vector of each character in the text to be recognized; and then outputting a second entity word of the text to be recognized according to the input spliced first character vector and the second character vector in the conditional random field model.

Optionally, the conditional random field model further includes a radical label matrix for calculating radical label scores of the first word vector or the spliced first word vector and the spliced second word vector.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to execute the method described in any possible implementation manner in the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages: the Chinese named entity recognition device encodes the clear radical features of the named entities in the medical field into the character vector, enriches the input features of the named entity recognition model, and thus improves the entity extraction capability of the named entities. Meanwhile, the radical vectors and the character vectors are spliced and then put into the bidirectional long-short term memory network, so that not only can the relationship between the characters be captured, but also the relationship between the radicals and the characters and the relationship between the radicals and the radicals can be captured, and the entity recognition effect is enhanced.

Drawings

FIG. 1 is a system architecture diagram of named entity identification;

FIG. 2 is a schematic diagram of an embodiment of a method for identifying a named entity in Chinese according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another embodiment of a method for identifying a named entity in Chinese according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of a Chinese named entity recognition apparatus in the embodiment of the present application;

FIG. 5 is a schematic diagram of another embodiment of a Chinese named entity recognition apparatus in an embodiment of the present application;

FIG. 6 is a system architecture diagram of the Chinese named entity recognition model in an embodiment of the present application;

FIG. 7 is a diagram of another system architecture of the Chinese named entity recognition model in an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

With the rapid development of internet technology, network information shows an exponential growth situation, and a large number of on-line medical communities and medical information question and answer websites emerge, so that a large amount of medical diagnosis information is presented to people in the form of electronic documents. However, unlike databases, most of these medical data texts are in an unstructured state. In order to fully utilize the information contained in the texts in the medical fields, the named entity recognition technology is used for effectively extracting the useful medical entity words, which becomes the premise and the basis for realizing intelligent medical treatment. The basic principle is as shown in fig. 1, a text to be recognized is input into a named entity recognition model, and finally, entity words of the text to be recognized are output by the named entity recognition model through feature extraction and calculation of the named entity recognition model. Whereas the named entities of a medical field typically include the following categories: symptom description, disease name, body part name, treatment protocol name, and diagnostic means name, among others. For example, "other dizziness, fever for two days, accompanied by palpitations, confirmed as left ventricular valve damage by electrocardiographic and transthoracic examination", and the entities of symptoms that can be extracted from this sentence are "dizziness", "fever" and "palpitations"; the diagnostic means entities are electrocardiogram and chest X-ray; the symptomatic entity is "left ventricular valve injury". Due to the limited symptoms and the strong correspondence between diseases and symptoms and diagnosis and treatment means, named entities extracted from medical diagnosis records often have very high practical value. The extracted medical entities are linked and disambiguated, so that an intelligent medical knowledge map can be conveniently constructed for intelligent medical diagnosis or used as an important reference for artificial diagnosis and treatment.

Meanwhile, the Chinese electronic medical record data is different from the general field data in the field specificity, so that most of characters forming entity words in the data are limited and have special characteristics. For example, most of the Chinese characters of disease entity words have '' radicals, and most of the Chinese characters of body part entity words have 'moon' radicals. Of course, the Chinese character radicals composing the entity words in the medical field are not limited to these two types. For example, the five elements "jin mu shui huo tu" in chinese are also often used as the characteristic radicals of various entity words in the medical field, such as the common trace elements in the body corresponding to gold, and some names of drugs such as "xxx sodium", "xxx calcium", etc.; the names of "xxx", charles xxx "," xxx suppository "," vertebra "and some Chinese patent drugs corresponding to wood, etc.; various body fluids (blood plasma, tissue fluid, lymph fluid) corresponding to water, and "wet", "slippery", "overflow", "dissolve", "ulcerate" and the like representing some symptoms; the ending characters of various inflammations corresponding to fire, and some symptom entity characters of 'scald', 'focus', 'rot'; the various symptoms associated with soil are "vertical", "uniform", "type", "plug", and the body part modifiers "wall", "cut", and the like. In addition, there are other radicals such as "heart", "", "car", "grain", "bean", "qi", "mouth", etc., and similar radicals are also of significant reference value for identifying medical entity words. However, the named entity recognition research work of the current Chinese electronic medical record does not consider the specificity of the comprehensive Chinese and Chinese medical field, but migrates the model facing the universal data set to the entity type of the medical field, and has a limited analysis effect.

In order to solve the problem, the embodiment of the present application provides the following technical solutions: the Chinese named entity recognition device acquires a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character; then, the Chinese named entity recognition device splices the character vector and the radical vector and inputs the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector; and finally, inputting the first character vector into a conditional random field model and outputting a first entity word in the text to be recognized.

Specifically, referring to fig. 2, an embodiment of a method for identifying a named entity in chinese in an embodiment of the present application includes:

201. the Chinese named entity recognition device obtains a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character.

After the Chinese named entity recognition device obtains the text to be recognized, the text to be recognized is subjected to character division, and each character is an individual; and then inputting each character of the text to be recognized into a named entity recognition model, and searching character vectors and radical vectors corresponding to each character through a search matrix of the named entity recognition model.

When the chinese named entity recognition device obtains the character vector and the radical vector, the following may be specifically used: the Chinese naming entity converts each character in the text to be recognized into a corresponding first ID/one-hot code, and inputs the first ID/one-hot code into a search matrix of a naming entity recognition model to obtain a character vector; and simultaneously, the Chinese named entity recognition device searches radicals corresponding to all characters according to the Chinese character-radical mapping table, converts the radicals into corresponding second ID/one-hot codes, and inputs the second ID/one-hot codes into a search matrix of a named entity recognition model to obtain radical vectors. In the process of converting into ID/one-hot coding, the following steps may be specifically performed: for example, if there are 1 ten thousand Chinese characters in a dictionary, each Chinese character corresponds to a specific number. For example, "i" corresponds to the first Chinese character, then "i" can be expressed as a vector with a length of 1 ten thousand, the first bit of the vector is 1, and the other positions are 0. And so on, other Chinese characters can also be subjected to vector conversion.

Meanwhile, in the embodiment, the mapping table of the Chinese character to the radical is a module newly added into the named entity recognition model, and specifically, the mapping table between the Chinese character and the radical is established by crawling an online Xinhua dictionary or a related medical dictionary. For example, the radical corresponding to the lung is month, and the lung and the month are stored in a one-to-one correspondence relationship in the mapping table. When a mapping table between Chinese characters and radicals is established, because each Chinese character does not have a radical, the Chinese character is stored as the radical when the character does not have the radical. For example, if there is no corresponding radical in the "day", the "day" is stored as its radical, and finally the "day" and the "day" are stored in a one-to-one correspondence relationship in the mapping table. Or, when the character is a special character, the character and the character can be stored in the mapping table as a one-to-one Chinese character-radical relationship.

In this embodiment, since the radicals of simplified chinese usually have different definitions for the same radicals, in order to reduce the redundancy of the radicals, the radicals may be traditional radicals when generating the chinese character-radical mapping table. For example, the simplified radical of "lung" is "month", and the traditional radical is "meat", so the Chinese character-radical mapping table can also be stored as a one-to-one correspondence between "lung" and "meat".

202. The Chinese named entity recognition device splices the character vector and the radical vector and inputs the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector.

The Chinese named entity recognition device splices the character vector and the radical vector, and then inputs the spliced vector into a Bidirectional Long Short Term Memory network (BilSTM) to obtain a first character vector corresponding to each character of the text to be recognized.

For example, if the radical of "lung" is "meat", then "lung" and "meat" are obtained by searching the matrix to obtain a vector v1 and v2, respectively, and are spliced together to obtain a new vector x _ a, and then the new vector x _ a is input into the BilsTM network to obtain an output x _ b (i.e., the first word vector). The "lung" may be modeled by the BERT to obtain an output vector x _ c (i.e., the second word vector). Finally, x _ b and x _ c are spliced together to obtain a vector [ x _ b, x _ c ].

203. The Chinese named entity recognition device inputs the first character vector into a conditional random field model and outputs a first entity word in the text to be recognized.

The Chinese named entity recognition device inputs the first word vector into the conditional random field model, and then outputs the first entity word in the text to be recognized through calculation of the conditional random field model.

The specific algorithm for inputting the first word vector into the conditional random field model and outputting the first entity word of the text to be recognized by the Chinese named entity recognition device can be as follows: the Chinese named entity recognition device inputs the first word vector into the conditional random field model to obtain a labeling score of the first word vector through a scoring function; finally, determining a first entity word of the text to be recognized according to the labeling score; wherein the scoring function is:

wherein, the

Specifically referring to fig. 3, another embodiment of the method for identifying a chinese named entity in the embodiment of the present application includes:

301. the Chinese named entity recognition device obtains a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character.

302. The Chinese named entity recognition device splices the character vector and the radical vector and inputs the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector.

For example, if the radical of "lung" is "meat", then "lung" and "meat" are obtained by searching the matrix to obtain a vector v1 and v2, respectively, and are spliced together to obtain a new vector x _ a, and then the new vector x _ a is input into the BilsTM network to obtain an output x _ b (i.e., the first word vector).

303. The Chinese named entity recognition device inputs each character in the text to be recognized into a pre-training language model to obtain a second word vector.

The Chinese named entity recognition device inputs each character in the text to be recognized into a pre-training language model to obtain a second word vector. For example, passing "lung" through the pre-trained language model may result in an output vector x _ c (i.e., the second word vector).

It is understood that, in this embodiment, the pre-training language model may be a (Bidirectional encoderpressances from transforms, BERTs) model, and is not limited herein. Meanwhile, in this embodiment, as long as the first word vector and the second word vector can be obtained, the obtaining time sequence of the first word vector and the second word vector is not limited. That is, the operation sequence between step 301 to step 302 and step 303 is not limited.

304. The Chinese named entity recognition device splices the first word vector and the second word vector, inputs the conditional random field model and outputs a second entity word in the text to be recognized.

The Chinese named entity recognition device splices the first word vector and the second word vector to generate a new word vector, for example, if the radical of the lung is 'meat', the lung and the 'meat' respectively obtain a vector v1 and a vector v2 by searching a matrix, the new vectors x _ a are spliced together to obtain a new vector x _ a, and then the new vector x _ a is input into a BilSTM network to obtain an output x _ b (namely, the first word vector); the lung is processed by a BERT model to obtain an output vector x _ c (namely a second word vector); finally, x _ b and x _ c are spliced together to obtain a vector [ x _ b, x _ c ] (i.e. a newly generated word vector). And then the Chinese named entity recognition device inputs the newly generated word vector into a conditional random field model for calculation to obtain a second entity word of the text to be recognized.

The specific algorithm of the Chinese named entity recognition device for inputting the first word vector and the second word vector into the conditional random field model and outputting the second entity word of the text to be recognized can also be as follows: the Chinese named entity recognition device splices the first word vector and the second word vector, inputs the conditional random field model and obtains the labeling score of the word vector spliced by the first word vector and the second word vector through a scoring function; finally, determining a second entity word of the text to be recognized according to the labeling score; wherein the scoring function is:

wherein, the

In the embodiment, the Chinese named entity recognition device encodes the clear radical features of the named entity in the medical field into the character vector, and simultaneously increases the word vector output by the pre-training language model, so that the input features of the named entity recognition model are enriched, and the entity extraction capability of the named entity is improved. Meanwhile, the radical vectors and the character vectors are spliced and then put into the bidirectional long-short term memory network, so that not only can the relationship between the characters be captured, but also the relationship between the radicals and the characters and the relationship between the radicals and the radicals can be captured, and the entity recognition effect is enhanced.

The above describes the method for identifying a named entity in chinese in the embodiment of the present application, and the following describes a device for identifying a named entity in chinese.

Specifically, referring to fig. 4, the apparatus 400 for identifying a named entity in chinese in the embodiment of the present application includes: an acquisition module 401, a processing module 402 and a calculation module 403. The device 400 may be the chinese named entity recognition device in the above method embodiments, or may be one or more chips. The apparatus 400 may be used to perform some or all of the functions of the chinese named entity recognition apparatus in the above-described method embodiments.

For example, the obtaining module 401 may be configured to perform step 201 in the foregoing method embodiment or configured to perform step 301 in the foregoing method embodiment. For example, the obtaining module 401 obtains a character vector and a radical vector of each character in the text to be recognized, where the radical vector is a vector of a radical corresponding to each character. This processing module 402 may be used to perform step 202 in the above-described method embodiment or to perform

steps

302 and 303 in the above-described method embodiment. For example, the processing module 402 splices the character vector and the radical vector, and inputs the spliced character vector and radical vector into a bidirectional long-short term memory network to obtain a first word vector;

the calculation module 403 may be configured to execute step 203 in the above method embodiment, or configured to execute step 304. For example, the calculation module 403 inputs the first word vector into a conditional random field model and outputs a first entity word in the text to be recognized.

Optionally, the apparatus 400 further comprises a storage module coupled to the processing module, such that the processing module can execute the computer executable instructions stored in the storage module to implement the functions of the apparatus for identifying a named entity in chinese in the above-mentioned method embodiments. In an example, the storage module optionally included in the apparatus 400 may be a storage unit inside the chip, such as a register, a cache, and the like, and the storage module may also be a storage unit outside the chip, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

It should be understood that the flow executed between the modules of the chinese named entity recognition device in the embodiment corresponding to fig. 4 is similar to the flow executed by the chinese named entity recognition device in the corresponding method embodiment of fig. 2 to fig. 3, and detailed description thereof is omitted here.

FIG. 5 is a schematic diagram of a possible structure of a Chinese named entity recognition apparatus 500 in the above embodiment, where the apparatus 500 may be configured as the Chinese named entity recognition apparatus. The apparatus 500 may comprise: a processor 502, a computer-readable storage medium/memory 503, a transceiver 504, an input device 505, and an output device 506, and a bus 501. Wherein the processor, transceiver, computer readable storage medium, etc. are connected by a bus. The embodiments of the present application do not limit the specific connection medium between the above components.

In one example, the input device 505 inputs the text to be recognized; the processor 502 obtains a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character; splicing the character vector and the radical vector, and inputting the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector; inputting the first character vector into a conditional random field model and outputting a first entity word in the text to be recognized; the output device 506 outputs the first entity word.

In one example, the processor 502 may include baseband circuitry, e.g., may perform data encapsulation, encoding, etc. of the first entity word according to a protocol to generate a data packet.

In yet another example, the processor 502 may run an operating system that controls functions between various devices and appliances. The transceiver 504 may include baseband circuitry and radio frequency circuitry, for example, where data packets generated for the first entity word may be processed by the baseband circuitry and the radio frequency circuitry to be transmitted to other possible display devices.

The apparatus for identifying a named entity in chinese may implement the corresponding steps in any one of the embodiments of fig. 2 to fig. 3, which are not described herein in detail.

It is understood that fig. 5 merely illustrates a simplified design of a chinese named entity recognition device, and that in practical applications, the chinese named entity recognition device may comprise any number of transceivers, processors, memories, etc., and all chinese named entity recognition devices that may implement the present application are within the scope of the present application.

The processor 502 involved in the apparatus 500 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), a Network Processor (NP), a microprocessor, etc., or an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program according to the present disclosure. But also a Digital Signal Processor (DSP), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The controller/processor can also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. Processors typically perform logical and arithmetic operations based on program instructions stored within memory.

The bus 501 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

The computer-readable storage medium/memory 503 referred to above may also hold an operating system and other application programs. In particular, the program may include program code including computer operating instructions. More specifically, the memory may be a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk memory, and so forth. The memory 503 may be a combination of the above memory types. And the computer-readable storage medium/memory described above may be in the processor, may be external to the processor, or distributed across multiple entities including the processor or processing circuitry. The computer-readable storage medium/memory described above may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging material.

Alternatively, embodiments of the present application also provide a general-purpose processing system, such as that commonly referred to as a chip, including one or more microprocessors that provide processor functionality; and an external memory providing at least a portion of the storage medium, all connected together with other supporting circuitry through an external bus architecture. The memory-stored instructions, when executed by the processor, cause the processor to perform some or all of the steps of the chinese named entity recognition apparatus in the embodiments described in fig. 2-3, such as steps 201-203 in fig. 2, steps 302-304 in fig. 3, and/or other processes for the techniques described herein.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a Chinese named entity recognition device. Of course, the processor and the storage medium may reside as discrete components in the Chinese named entity recognition apparatus.

It is understood that, as shown in fig. 6, the embodiment of the present application further provides a chinese named entity recognition model 600, where the structure of the model 600 includes a lookup matrix 601, a bidirectional long-short term memory network 602, and a conditional random field model 603; the search matrix 601 is used for obtaining a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character; the bidirectional long and short term memory network 602 is configured to obtain a first word vector of the character according to the input spliced character vector and the radical vector; the conditional random field model 603 is used to output a first entity word of the text to be recognized according to the first word vector.

Optionally, the chinese named entity recognition model 600 includes a chinese character-radical mapping table, where the chinese character-radical mapping table may be obtained by crawling an online noval dictionary or a related medical dictionary, and then storing the chinese character and the radical in a one-to-one correspondence to obtain the chinese character-radical mapping table; and finally, storing the Chinese character-radical mapping table in the Chinese named entity recognition model. The chinese character-radical mapping table may be used by looking up the matrix 601.

Optionally, as shown in fig. 7, the chinese named entity recognition model 600 further includes a pre-training language model 604, configured to obtain a second word vector of each character in the text to be recognized; then, the conditional random field model 603 outputs a second entity word of the text to be recognized according to the input spliced first word vector and the second word vector.

Optionally, the conditional random field model 603 further includes a radical label matrix for calculating radical label scores of the first word vector or the spliced first word vector and the second word vector.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A Chinese named entity recognition method is applied to the following steps:

acquiring a character vector and a radical vector of each character in a text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character;

splicing the character vector and the radical vector, and inputting the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first word vector;

and inputting the first character vector into a conditional random field model to output a first entity word of the text to be recognized.

2. The method according to claim 1, wherein the obtaining of the character vector and the radical vector of each character in the text to be recognized comprises:

converting characters in the text to be recognized into corresponding first ID/one-hot codes, and obtaining the character vectors by the first ID/one-hot codes through a search matrix of a named entity recognition model;

determining radicals corresponding to characters in the text to be recognized according to a Chinese character-radical mapping table, wherein the Chinese character-radical mapping table is used for indicating one-to-one correspondence between the characters and the radicals;

and converting the radical into a corresponding second ID/one-hot code, and obtaining the radical vector by the second ID/one-hot code through a search matrix of the named entity recognition model.

3. The method of claim 2, wherein when the character has no radical or is a special character, the radical corresponding to the character is the character itself.

4. The method according to any one of claims 1 to 3, further comprising:

inputting the characters into a pre-training language model to obtain a second word vector;

and splicing the first word vector and the second word vector, inputting the conditional random field model and outputting a second entity word of the text to be recognized.

5. The method of any one of claims 1 to 4, wherein said inputting the first word vector into the conditional random field model to output the first entity word of the text to be recognized comprises:

inputting the first word vector into the conditional random field model to obtain a labeling score of the first word vector through a scoring function;

determining the first entity word according to the labeling score;

the scoring function is:

wherein, the

the representation of the t-th character is labeled as label [ I]_tR (T) represents the index of the radical of the tth character, T is the transition matrix, and R is the radical label matrix.

6. The method of claim 5, wherein the radical setting rule in the radical label matrix comprises at least one of:

the radical set in the radical label matrix is determined according to the characters of the medical record;

the number of radicals in the set of radicals in the radical label matrix is such that the computational size of the transition matrix and the radical label matrix are on the same order of magnitude.

7. A chinese named entity recognition device, comprising:

the acquiring module is used for acquiring a character vector and a radical vector of each character in the text to be recognized, wherein the radical vector is a vector of a radical corresponding to each character;

the processing module is used for splicing the character vector and the radical vector acquired by the acquisition module and inputting the spliced character vector and the radical vector into a bidirectional long-short term memory network to obtain a first character vector;

and the calculation module is used for inputting the first character vector obtained by the processing module into the conditional random field model and outputting a first entity word in the text to be recognized.

8. The apparatus according to claim 7, wherein the obtaining module is specifically configured to convert characters in the text to be recognized into corresponding first ID/one-hot codes, and obtain the character vector from the first ID/one-hot codes through a lookup matrix of a named entity recognition model;

9. The apparatus of claim 8, wherein when the character has no radical or is a special character, the radical corresponding to the character is the character itself.

10. The apparatus according to any one of claims 7 to 9, wherein the obtaining module is further configured to input the character into a pre-training language model to obtain a second word vector;

the calculation module is further configured to splice the first word vector and the second word vector, input the conditional random field model, and output a second entity result.

11. The apparatus according to any of the claims 7 to 10, wherein said computing module, in particular for inputting said first word vector into said conditional random field model, obtains a labeling score of said first word vector by means of a scoring function; determining the first entity result according to the entity marking score;

the scoring function is:

wherein, the

12. The apparatus of claim 11, wherein the radical setting rule in the radical tag matrix comprises at least one of: the radical set in the radical label matrix is determined according to the characters of the medical record;

13. A computer-readable storage medium having stored thereon computer instructions for performing the method of any of claims 1-6.

14. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 6.