CN110717331B

CN110717331B - Chinese named entity recognition method, device and equipment based on neural network and storage medium

Info

Publication number: CN110717331B
Application number: CN201911000998.4A
Authority: CN
Inventors: 黄浩
Original assignee: Beijing Aiyi Botong Information Technology Co ltd
Current assignee: Beijing Aiyi Botong Information Technology Co ltd
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2023-10-24
Anticipated expiration: 2039-10-21
Also published as: CN110717331A

Abstract

The invention relates to the technical field of Chinese language processing and recognition, and discloses a Chinese named entity recognition method, device, equipment and storage medium based on a neural network. The invention provides a novel method for improving the recognition rate of Chinese named entity based on a neural network deep learning mode by comprehensively utilizing characters and word characteristics, namely, the training sample contains character position identification vectors serving as word boundary information by preprocessing data to be trained before model training, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity label, and the problems that the prior art cannot utilize word information in sentences, the recognition effect is defective, and the recognition rate is limited to improve are solved, and the method is convenient for practical application and popularization. In addition, the Chinese named entity recognition method is easy to implement and has low development and operation costs.

Description

Chinese named entity recognition method, device and equipment based on neural network and storage medium

Technical Field

The invention belongs to the technical field of Chinese language processing and recognition, and particularly relates to a Chinese named entity recognition method, device, equipment and storage medium based on a neural network.

Background

Named entity recognition (Named Entity Recognition, NER for short) is a basic task of natural language processing to identify proper nouns and phrases in natural language processing and categorize them. As more and more researchers have proposed various model structures in the NEP field, the use of neural networks or deep learning to address the NER problem has become a major trend.

The current character-based method and the word-based method are two main stream processing methods, wherein the word-based method needs to use a word segmentation tool, but the word segmentation tool has a poor effect, and once word segmentation is wrong, the prediction of an entity boundary is directly affected, so that the recognition is wrong; while the character-based method, which trains in units of characters, has a larger training scale and a longer training time, researches have shown that the character-based method is superior to the word-based method for the recognition of a named entity of chinese. But the character-based method cannot utilize the information of the words in the sentence (actually, providing word boundary information can effectively improve the recognition rate), which can make the recognition effect defective and limit the improvement of the recognition rate.

Disclosure of Invention

The invention aims to solve the problems that the existing character-based Chinese named entity recognition method cannot utilize the information of words in sentences, so that the recognition effect is defective and the recognition rate is limited to be improved.

The technical scheme adopted by the invention is as follows:

a Chinese named entity recognition method based on a neural network comprises the following steps:

s101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;

s102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing a multi-layer neural network model for training to obtain a Chinese named entity recognition model;

s103, performing Chinese named entity recognition on the target text by using the Chinese named entity recognition model to obtain an entity labeling result.

Preferably, in the step S101, the character feature identification vector of each sentence is obtained according to the following steps:

s1011, carrying out sentence segmentation on the data to be trained to obtain a plurality of sentences;

s1012, performing word segmentation processing on each sentence to separate words from each other;

s1013, counting all words, and distributing a character characteristic unique ID number for each word;

s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence.

Preferably, in the step S101, the character position identification vector of each sentence is obtained according to the following steps:

s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences;

s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words;

s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence;

s1024, counting all character position labels, and distributing a character position unique ID number for each character position label;

s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence.

Further preferably, in said step S1023, the position of each word in the belonging word is marked as follows: the position mark information of the word in the belonged word is formed by splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol, a word length and a word position serial number, wherein the word position serial number refers to a serial number of the word in the belonged word in sequence.

Preferably, the step S102 includes the following steps:

s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output;

s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence;

s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy.

Preferably, the step S103 includes the following steps:

s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.

Specifically, the multi-layer neural network model is a CNN neural network model, a GRU neural network model, a bidirectional LSTM neural network model, a transducer neural network model or a BERT neural network model.

The other technical scheme adopted by the invention is as follows:

a Chinese named entity recognition device based on a neural network comprises a data preprocessing module, a model training module and an entity labeling module which are sequentially communicated;

the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;

the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model;

and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.

The other technical scheme adopted by the invention is as follows:

the Chinese named entity recognition device based on the neural network comprises a memory and a processor which are connected in communication, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program to realize the steps of the Chinese named entity recognition method based on the neural network.

The other technical scheme adopted by the invention is as follows:

a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a neural network based chinese named entity recognition method as described above.

The beneficial effects of the invention are as follows:

(1) The invention provides a new method for improving the recognition rate of Chinese named entities by comprehensively utilizing characters and word characteristics to carry out a neural network-based deep learning mode, namely, before model training, a training sample contains character position identification vectors serving as word boundary information by preprocessing data to be trained, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity tag, namely, the text to be recognized is input into the trained Chinese named entity recognition model, and the model can convert the text into a corresponding tag text, thereby solving the problems that the prior art cannot utilize word information in sentences, further leading to defect recognition effect and limiting the improvement of the recognition rate, and being convenient for practical application and popularization;

(2) The Chinese named entity recognition method is easy to realize, has low development and operation cost, can realize Chinese entity recognition service through one server, has high judgment speed and accuracy, and can be applied to other NLP tasks.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for identifying a Chinese named entity.

FIG. 2 is an exemplary diagram of a full-mode word segmentation result provided by the present invention.

Fig. 3 is a schematic diagram of a structure of a device for identifying chinese named entities according to the present invention.

Fig. 4 is a schematic structural diagram of a chinese named entity recognition device according to the present invention.

Detailed Description

The invention is further described with reference to the drawings and specific examples. It should be noted that the description of these examples is for aiding in understanding the present invention, but is not intended to limit the present invention. Specific structural and functional details disclosed herein are merely representative of example embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.

It should be understood that in some of the processes described herein, a plurality of operations occurring in a particular order are included, but that these operations may be performed in other than the order in which they occur herein, or in parallel, the sequence numbers of the operations, such as S101, S102, etc., are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the processes may include more or fewer operations, and the operations may likewise be performed in sequence or in parallel.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: the terms "/and" herein describe another associative object relationship, indicating that there may be two relationships, e.g., a/and B, may indicate that: the character "/" herein generally indicates that the associated object is an "or" relationship.

It will be understood that when an element is referred to as being "connected," "connected," or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe relationships between elements (e.g., "between … …" pair "directly between … …", "adjacent" pair "directly adjacent", etc.) should be interpreted in a similar manner.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, and do not preclude the presence or addition of one or more other features, numbers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

In the following description, specific details are provided to provide a thorough understanding of example embodiments. However, it will be understood by those of ordinary skill in the art that the example embodiments may be practiced without these specific details. For example, a system may be shown in block diagrams in order to avoid obscuring the examples with unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the example embodiments.

Example 1

As shown in fig. 1-2, the method for identifying a chinese named entity based on a neural network provided in this embodiment may, but is not limited to, include the following steps.

S101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences.

In the step S101, the data to be trained may be composed of various document data provided by a user or collected by existing collection software, wherein the document data may be composed of, but not limited to, a part of fields or several parts of fields of a title, a abstract, a keyword, a body, an attachment title, an attachment content, an author information, and the like.

In the step S101, it is preferable to obtain character feature identification vectors of respective sentences according to the following steps S1011 to S1014: s1011, carrying out sentence segmentation on the data to be trained to obtain a plurality of sentences; s1012, performing word segmentation processing on each sentence to separate words from each other; s1013, counting all words, and distributing a character characteristic unique ID number for each word; s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence. In detail, the sentence processing mode and the word segmentation processing mode are both conventional technical means; in the step S1013, the specific allocation manner may be, but is not limited to, the following: counting the total number of all words as m, and then giving integers with the number between 0 and m-1 to each word one by one according to the arrangement sequence. Thus, through the foregoing step S1013, a character table including all words and corresponding character feature unique ID numbers can be obtained, so that in the subsequent step S1014, the character table is applied to compile each word in each sentence, and the character feature identification vector of the sentence character level is obtained.

In the step S101, the character position identification vector of each sentence is preferably obtained according to the following steps S1021 to S1025: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence. In detail, the sentence processing manner and the full-mode word segmentation processing manner are both conventional technical means, wherein for full-mode word segmentation, as shown in an example in fig. 2, words such as "nanjing", "nanjing city", "city length", "Yangtze river bridge" and "bridge" can be obtained according to the sentence "nanjing city Yangtze river bridge". In said step S1023, it may be further preferable to mark the position of each word in the belonging word as follows: the position mark information of the word in the belonged word is formed by splicing a word head symbol, a word middle symbol, a word tail symbol, a non-word symbol and the like, and word length and word position serial numbers, wherein the word position serial numbers refer to serial numbers of the word in the belonged word in sequence, and particularly, the example table of the character position labels shown in the following table 1 can be referred to:

TABLE 1 character position tag exemplary Table

As shown in table 1 above, the initials may be used specifically: B. the symbols in the word: m, the suffix symbol: e or non-word symbol: s, etc., and the word length and the word bit sequence number are spliced to form the position mark information of the word in the belonged word. For example, the term "city" includes two "Nanjing city" and "city length", wherein the "city" is at the beginning B and the ending E of the two terms, the lengths of the two terms are 2 and 3, respectively, and the specific positions in the two terms are 3 and 1, namely L3 and L1, so that the position mark information of the "city" in the term "Nanjing city" is "E3L3", the position mark information of the "city length" is "B2L1", and the character position label formed by final splicing is "[ E3L3-B2L1]". In the step S1024, the specific allocation manner may be, but is not limited to, the following: counting the total number of all the character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence. Therefore, through the step S1024, a character position table including all character position tags and corresponding character position unique ID numbers can also be obtained, so that in the subsequent step S1025, each word in each sentence is compiled by using the character position table, and the character position identification vector of the sentence character level is obtained.

S102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing the training samples into a multi-layer neural network model for training to obtain a Chinese named entity recognition model.

In the step S102, the multi-layer neural network model is a deep learning model commonly used for training an identification model, and the training can be performed by introducing a sufficient amount of training samples (i.e., the character feature identification vector and the character position identification vector of each sentence) to obtain the Chinese named entity identification model with high identification accuracy, and the model structure and the specific training method can be obtained by conventional design with reference to the existing mode. Specifically, the multi-layer neural network model may be, but not limited to, a CNN neural network model, a GRU neural network model, a bidirectional LSTM neural network model, a transducer neural network model, or a BERT neural network model, which are all existing common models.

In the step S102, the chinese named entity recognition model is preferably obtained by using the following steps S201 to S203: s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output; s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence; s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy. In the step S202, although a very good recognition model can be trained by the neural network layer processing, there is still a deficiency, which may cause some constraint loss, for example, a correct word label is "B-LOC I-LOC", and the model labeling result is "B-LOC I-ORG", where the two labels have constraint problems, so CRF (i.e., conditional Random Field Algorithm, a mathematical algorithm, proposed in 2001, based on a probability map model following markov property, may combine the features of a maximum entropy model and a hidden markov model, and is a non-oriented map model, and in recent years, a very good effect is obtained in the sequence labeling tasks such as word segmentation, part-of-speech labeling, and named entity recognition) layer, and the context relation of the output label may be considered, so that the generated result is an optimal solution of the whole sequence, and the Chinese named entity recognition model with the highest recognition accuracy is obtained in the step S203.

In the step S103, the entity labeling result is preferably obtained according to the following step S301: s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.

In summary, the neural network-based Chinese named entity recognition method provided by the embodiment has the following technical effects:

(1) The embodiment provides a novel method for improving the recognition rate of Chinese named entities by comprehensively utilizing characters and word characteristics to perform a neural network-based deep learning mode, namely, before model training, a training sample is pre-processed to contain character position identification vectors serving as word boundary information, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity tag, namely, the text to be recognized is input into the trained Chinese named entity recognition model, and the model can convert the text into a corresponding tag text, thereby solving the problems that the prior art cannot utilize word information in sentences, further leading to defect recognition effect, limiting the improvement of the recognition rate and being convenient for practical application and popularization;

Example two

As shown in fig. 3, the present embodiment provides a hardware device for implementing the neural network-based method for identifying a chinese named entity, where the hardware device includes a data preprocessing module, a model training module, and an entity labeling module that are sequentially connected in communication; the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences; the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model; and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.

The working process, working details and technical effects of the chinese named entity recognition device provided in this embodiment may refer to the first embodiment, and are not described herein again.

Example III

As shown in fig. 4, the present embodiment provides a hardware device for implementing the neural network-based chinese named entity recognition method according to the embodiment, including a memory and a processor that are communicatively connected, where the memory is configured to store a computer program, and the processor is configured to execute the steps of the computer program for implementing the neural network-based chinese named entity recognition method according to the embodiment.

Example IV

The present embodiment provides a storage medium storing a computer program including the neural network-based chinese named entity recognition method of embodiment one, that is, the storage medium storing the computer program, which when executed by a processor, implements the steps of the neural network-based chinese named entity recognition method of embodiment one. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices, or may be a mobile smart device (such as a smart phone, a PAD, or ipad).

The working process, working details and technical effects of the storage medium provided in this embodiment may refer to the first embodiment, and are not described herein again.

The various embodiments described above are illustrative only, and the elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device to perform the method described in the embodiments or some parts of the embodiments.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents. Such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Finally, it should be noted that the invention is not limited to the alternative embodiments described above, but can be used by anyone in various other forms of products in the light of the present invention. The above detailed description should not be construed as limiting the scope of the invention, which is defined in the claims and the description may be used to interpret the claims.

Claims

1. The Chinese named entity recognition method based on the neural network is characterized by comprising the following steps of:

in the step S101, the character position identification vector of each sentence is obtained as follows: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence;

in said step S1023, the position of each word in the belonging word is marked as follows: splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol to form position mark information of a word in an affiliated word, wherein the word position serial number refers to serial numbers of the word in sequence in the affiliated word;

in the step S1024, a character position unique ID number is assigned to each character position tag as follows: counting the total number of all character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence;

2. The method for identifying chinese named entities based on neural network as claimed in claim 1, wherein in step S101, character feature identification vectors of each sentence are obtained as follows:

s1011 and S1021, performing sentence segmentation processing on the data to be trained to obtain a plurality of sentences;

3. The method for identifying a chinese name entity based on neural network as recited in claim 1, wherein said step S102 comprises the steps of:

4. The method for identifying a chinese name entity based on neural network as recited in claim 1, wherein said step S103 comprises the steps of:

5. The neural network-based recognition method of chinese named entities of claim 1, wherein the multi-layer neural network model is a CNN neural network model, a GRU neural network model, a bi-directional LSTM neural network model, a transducer neural network model, or a BERT neural network model.

6. The Chinese named entity recognition device based on the neural network is characterized by comprising a data preprocessing module, a model training module and an entity labeling module which are sequentially communicated;

the data preprocessing module is specifically configured to obtain character position identification vectors of each sentence according to the following steps: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence;

the data preprocessing module is specifically used for marking the position of each word in the belonged words in the following way: splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol to form position mark information of a word in an affiliated word, wherein the word position serial number refers to serial numbers of the word in sequence in the affiliated word;

the data preprocessing module is specifically configured to assign a unique ID number of a character position to each character position tag according to the following manner: counting the total number of all character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence;

7. A neural network based chinese named entity recognition device, comprising a memory and a processor communicatively coupled, wherein the memory is configured to store a computer program, and the processor is configured to execute the computer program to implement the steps of the neural network based chinese named entity recognition method according to any one of claims 1 to 5.

8. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the neural network based chinese named entity recognition method of any one of claims 1 to 5.