CN110717331B - Chinese named entity recognition method, device and equipment based on neural network and storage medium - Google Patents
Chinese named entity recognition method, device and equipment based on neural network and storage medium Download PDFInfo
- Publication number
- CN110717331B CN110717331B CN201911000998.4A CN201911000998A CN110717331B CN 110717331 B CN110717331 B CN 110717331B CN 201911000998 A CN201911000998 A CN 201911000998A CN 110717331 B CN110717331 B CN 110717331B
- Authority
- CN
- China
- Prior art keywords
- word
- character
- sentence
- neural network
- character position
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 30
- 239000013598 vector Substances 0.000 claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 43
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000003062 neural network model Methods 0.000 claims description 28
- 230000011218 segmentation Effects 0.000 claims description 26
- 238000002372 labelling Methods 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 11
- 238000013135 deep learning Methods 0.000 abstract description 4
- 230000002950 deficient Effects 0.000 abstract description 3
- 238000011161 development Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
The invention relates to the technical field of Chinese language processing and recognition, and discloses a Chinese named entity recognition method, device, equipment and storage medium based on a neural network. The invention provides a novel method for improving the recognition rate of Chinese named entity based on a neural network deep learning mode by comprehensively utilizing characters and word characteristics, namely, the training sample contains character position identification vectors serving as word boundary information by preprocessing data to be trained before model training, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity label, and the problems that the prior art cannot utilize word information in sentences, the recognition effect is defective, and the recognition rate is limited to improve are solved, and the method is convenient for practical application and popularization. In addition, the Chinese named entity recognition method is easy to implement and has low development and operation costs.
Description
Technical Field
The invention belongs to the technical field of Chinese language processing and recognition, and particularly relates to a Chinese named entity recognition method, device, equipment and storage medium based on a neural network.
Background
Named entity recognition (Named Entity Recognition, NER for short) is a basic task of natural language processing to identify proper nouns and phrases in natural language processing and categorize them. As more and more researchers have proposed various model structures in the NEP field, the use of neural networks or deep learning to address the NER problem has become a major trend.
The current character-based method and the word-based method are two main stream processing methods, wherein the word-based method needs to use a word segmentation tool, but the word segmentation tool has a poor effect, and once word segmentation is wrong, the prediction of an entity boundary is directly affected, so that the recognition is wrong; while the character-based method, which trains in units of characters, has a larger training scale and a longer training time, researches have shown that the character-based method is superior to the word-based method for the recognition of a named entity of chinese. But the character-based method cannot utilize the information of the words in the sentence (actually, providing word boundary information can effectively improve the recognition rate), which can make the recognition effect defective and limit the improvement of the recognition rate.
Disclosure of Invention
The invention aims to solve the problems that the existing character-based Chinese named entity recognition method cannot utilize the information of words in sentences, so that the recognition effect is defective and the recognition rate is limited to be improved.
The technical scheme adopted by the invention is as follows:
a Chinese named entity recognition method based on a neural network comprises the following steps:
s101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
s102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing a multi-layer neural network model for training to obtain a Chinese named entity recognition model;
s103, performing Chinese named entity recognition on the target text by using the Chinese named entity recognition model to obtain an entity labeling result.
Preferably, in the step S101, the character feature identification vector of each sentence is obtained according to the following steps:
s1011, carrying out sentence segmentation on the data to be trained to obtain a plurality of sentences;
s1012, performing word segmentation processing on each sentence to separate words from each other;
s1013, counting all words, and distributing a character characteristic unique ID number for each word;
s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence.
Preferably, in the step S101, the character position identification vector of each sentence is obtained according to the following steps:
s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences;
s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words;
s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence;
s1024, counting all character position labels, and distributing a character position unique ID number for each character position label;
s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence.
Further preferably, in said step S1023, the position of each word in the belonging word is marked as follows: the position mark information of the word in the belonged word is formed by splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol, a word length and a word position serial number, wherein the word position serial number refers to a serial number of the word in the belonged word in sequence.
Preferably, the step S102 includes the following steps:
s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output;
s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence;
s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy.
Preferably, the step S103 includes the following steps:
s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.
Specifically, the multi-layer neural network model is a CNN neural network model, a GRU neural network model, a bidirectional LSTM neural network model, a transducer neural network model or a BERT neural network model.
The other technical scheme adopted by the invention is as follows:
a Chinese named entity recognition device based on a neural network comprises a data preprocessing module, a model training module and an entity labeling module which are sequentially communicated;
the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model;
and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.
The other technical scheme adopted by the invention is as follows:
the Chinese named entity recognition device based on the neural network comprises a memory and a processor which are connected in communication, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program to realize the steps of the Chinese named entity recognition method based on the neural network.
The other technical scheme adopted by the invention is as follows:
a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a neural network based chinese named entity recognition method as described above.
The beneficial effects of the invention are as follows:
(1) The invention provides a new method for improving the recognition rate of Chinese named entities by comprehensively utilizing characters and word characteristics to carry out a neural network-based deep learning mode, namely, before model training, a training sample contains character position identification vectors serving as word boundary information by preprocessing data to be trained, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity tag, namely, the text to be recognized is input into the trained Chinese named entity recognition model, and the model can convert the text into a corresponding tag text, thereby solving the problems that the prior art cannot utilize word information in sentences, further leading to defect recognition effect and limiting the improvement of the recognition rate, and being convenient for practical application and popularization;
(2) The Chinese named entity recognition method is easy to realize, has low development and operation cost, can realize Chinese entity recognition service through one server, has high judgment speed and accuracy, and can be applied to other NLP tasks.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a method for identifying a Chinese named entity.
FIG. 2 is an exemplary diagram of a full-mode word segmentation result provided by the present invention.
Fig. 3 is a schematic diagram of a structure of a device for identifying chinese named entities according to the present invention.
Fig. 4 is a schematic structural diagram of a chinese named entity recognition device according to the present invention.
Detailed Description
The invention is further described with reference to the drawings and specific examples. It should be noted that the description of these examples is for aiding in understanding the present invention, but is not intended to limit the present invention. Specific structural and functional details disclosed herein are merely representative of example embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.
It should be understood that in some of the processes described herein, a plurality of operations occurring in a particular order are included, but that these operations may be performed in other than the order in which they occur herein, or in parallel, the sequence numbers of the operations, such as S101, S102, etc., are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the processes may include more or fewer operations, and the operations may likewise be performed in sequence or in parallel.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: the terms "/and" herein describe another associative object relationship, indicating that there may be two relationships, e.g., a/and B, may indicate that: the character "/" herein generally indicates that the associated object is an "or" relationship.
It will be understood that when an element is referred to as being "connected," "connected," or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe relationships between elements (e.g., "between … …" pair "directly between … …", "adjacent" pair "directly adjacent", etc.) should be interpreted in a similar manner.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, and do not preclude the presence or addition of one or more other features, numbers, steps, operations, elements, components, and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
In the following description, specific details are provided to provide a thorough understanding of example embodiments. However, it will be understood by those of ordinary skill in the art that the example embodiments may be practiced without these specific details. For example, a system may be shown in block diagrams in order to avoid obscuring the examples with unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the example embodiments.
Example 1
As shown in fig. 1-2, the method for identifying a chinese named entity based on a neural network provided in this embodiment may, but is not limited to, include the following steps.
S101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences.
In the step S101, the data to be trained may be composed of various document data provided by a user or collected by existing collection software, wherein the document data may be composed of, but not limited to, a part of fields or several parts of fields of a title, a abstract, a keyword, a body, an attachment title, an attachment content, an author information, and the like.
In the step S101, it is preferable to obtain character feature identification vectors of respective sentences according to the following steps S1011 to S1014: s1011, carrying out sentence segmentation on the data to be trained to obtain a plurality of sentences; s1012, performing word segmentation processing on each sentence to separate words from each other; s1013, counting all words, and distributing a character characteristic unique ID number for each word; s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence. In detail, the sentence processing mode and the word segmentation processing mode are both conventional technical means; in the step S1013, the specific allocation manner may be, but is not limited to, the following: counting the total number of all words as m, and then giving integers with the number between 0 and m-1 to each word one by one according to the arrangement sequence. Thus, through the foregoing step S1013, a character table including all words and corresponding character feature unique ID numbers can be obtained, so that in the subsequent step S1014, the character table is applied to compile each word in each sentence, and the character feature identification vector of the sentence character level is obtained.
In the step S101, the character position identification vector of each sentence is preferably obtained according to the following steps S1021 to S1025: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence. In detail, the sentence processing manner and the full-mode word segmentation processing manner are both conventional technical means, wherein for full-mode word segmentation, as shown in an example in fig. 2, words such as "nanjing", "nanjing city", "city length", "Yangtze river bridge" and "bridge" can be obtained according to the sentence "nanjing city Yangtze river bridge". In said step S1023, it may be further preferable to mark the position of each word in the belonging word as follows: the position mark information of the word in the belonged word is formed by splicing a word head symbol, a word middle symbol, a word tail symbol, a non-word symbol and the like, and word length and word position serial numbers, wherein the word position serial numbers refer to serial numbers of the word in the belonged word in sequence, and particularly, the example table of the character position labels shown in the following table 1 can be referred to:
TABLE 1 character position tag exemplary Table
As shown in table 1 above, the initials may be used specifically: B. the symbols in the word: m, the suffix symbol: e or non-word symbol: s, etc., and the word length and the word bit sequence number are spliced to form the position mark information of the word in the belonged word. For example, the term "city" includes two "Nanjing city" and "city length", wherein the "city" is at the beginning B and the ending E of the two terms, the lengths of the two terms are 2 and 3, respectively, and the specific positions in the two terms are 3 and 1, namely L3 and L1, so that the position mark information of the "city" in the term "Nanjing city" is "E3L3", the position mark information of the "city length" is "B2L1", and the character position label formed by final splicing is "[ E3L3-B2L1]". In the step S1024, the specific allocation manner may be, but is not limited to, the following: counting the total number of all the character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence. Therefore, through the step S1024, a character position table including all character position tags and corresponding character position unique ID numbers can also be obtained, so that in the subsequent step S1025, each word in each sentence is compiled by using the character position table, and the character position identification vector of the sentence character level is obtained.
S102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing the training samples into a multi-layer neural network model for training to obtain a Chinese named entity recognition model.
In the step S102, the multi-layer neural network model is a deep learning model commonly used for training an identification model, and the training can be performed by introducing a sufficient amount of training samples (i.e., the character feature identification vector and the character position identification vector of each sentence) to obtain the Chinese named entity identification model with high identification accuracy, and the model structure and the specific training method can be obtained by conventional design with reference to the existing mode. Specifically, the multi-layer neural network model may be, but not limited to, a CNN neural network model, a GRU neural network model, a bidirectional LSTM neural network model, a transducer neural network model, or a BERT neural network model, which are all existing common models.
In the step S102, the chinese named entity recognition model is preferably obtained by using the following steps S201 to S203: s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output; s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence; s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy. In the step S202, although a very good recognition model can be trained by the neural network layer processing, there is still a deficiency, which may cause some constraint loss, for example, a correct word label is "B-LOC I-LOC", and the model labeling result is "B-LOC I-ORG", where the two labels have constraint problems, so CRF (i.e., conditional Random Field Algorithm, a mathematical algorithm, proposed in 2001, based on a probability map model following markov property, may combine the features of a maximum entropy model and a hidden markov model, and is a non-oriented map model, and in recent years, a very good effect is obtained in the sequence labeling tasks such as word segmentation, part-of-speech labeling, and named entity recognition) layer, and the context relation of the output label may be considered, so that the generated result is an optimal solution of the whole sequence, and the Chinese named entity recognition model with the highest recognition accuracy is obtained in the step S203.
S103, performing Chinese named entity recognition on the target text by using the Chinese named entity recognition model to obtain an entity labeling result.
In the step S103, the entity labeling result is preferably obtained according to the following step S301: s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.
In summary, the neural network-based Chinese named entity recognition method provided by the embodiment has the following technical effects:
(1) The embodiment provides a novel method for improving the recognition rate of Chinese named entities by comprehensively utilizing characters and word characteristics to perform a neural network-based deep learning mode, namely, before model training, a training sample is pre-processed to contain character position identification vectors serving as word boundary information, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity tag, namely, the text to be recognized is input into the trained Chinese named entity recognition model, and the model can convert the text into a corresponding tag text, thereby solving the problems that the prior art cannot utilize word information in sentences, further leading to defect recognition effect, limiting the improvement of the recognition rate and being convenient for practical application and popularization;
(2) The Chinese named entity recognition method is easy to realize, has low development and operation cost, can realize Chinese entity recognition service through one server, has high judgment speed and accuracy, and can be applied to other NLP tasks.
Example two
As shown in fig. 3, the present embodiment provides a hardware device for implementing the neural network-based method for identifying a chinese named entity, where the hardware device includes a data preprocessing module, a model training module, and an entity labeling module that are sequentially connected in communication; the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences; the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model; and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.
The working process, working details and technical effects of the chinese named entity recognition device provided in this embodiment may refer to the first embodiment, and are not described herein again.
Example III
As shown in fig. 4, the present embodiment provides a hardware device for implementing the neural network-based chinese named entity recognition method according to the embodiment, including a memory and a processor that are communicatively connected, where the memory is configured to store a computer program, and the processor is configured to execute the steps of the computer program for implementing the neural network-based chinese named entity recognition method according to the embodiment.
The working process, working details and technical effects of the chinese named entity recognition device provided in this embodiment may refer to the first embodiment, and are not described herein again.
Example IV
The present embodiment provides a storage medium storing a computer program including the neural network-based chinese named entity recognition method of embodiment one, that is, the storage medium storing the computer program, which when executed by a processor, implements the steps of the neural network-based chinese named entity recognition method of embodiment one. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices, or may be a mobile smart device (such as a smart phone, a PAD, or ipad).
The working process, working details and technical effects of the storage medium provided in this embodiment may refer to the first embodiment, and are not described herein again.
The various embodiments described above are illustrative only, and the elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device to perform the method described in the embodiments or some parts of the embodiments.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents. Such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Finally, it should be noted that the invention is not limited to the alternative embodiments described above, but can be used by anyone in various other forms of products in the light of the present invention. The above detailed description should not be construed as limiting the scope of the invention, which is defined in the claims and the description may be used to interpret the claims.
Claims (8)
1. The Chinese named entity recognition method based on the neural network is characterized by comprising the following steps of:
s101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
in the step S101, the character position identification vector of each sentence is obtained as follows: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence;
in said step S1023, the position of each word in the belonging word is marked as follows: splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol to form position mark information of a word in an affiliated word, wherein the word position serial number refers to serial numbers of the word in sequence in the affiliated word;
in the step S1024, a character position unique ID number is assigned to each character position tag as follows: counting the total number of all character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence;
s102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing a multi-layer neural network model for training to obtain a Chinese named entity recognition model;
s103, performing Chinese named entity recognition on the target text by using the Chinese named entity recognition model to obtain an entity labeling result.
2. The method for identifying chinese named entities based on neural network as claimed in claim 1, wherein in step S101, character feature identification vectors of each sentence are obtained as follows:
s1011 and S1021, performing sentence segmentation processing on the data to be trained to obtain a plurality of sentences;
s1012, performing word segmentation processing on each sentence to separate words from each other;
s1013, counting all words, and distributing a character characteristic unique ID number for each word;
s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence.
3. The method for identifying a chinese name entity based on neural network as recited in claim 1, wherein said step S102 comprises the steps of:
s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output;
s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence;
s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy.
4. The method for identifying a chinese name entity based on neural network as recited in claim 1, wherein said step S103 comprises the steps of:
s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.
5. The neural network-based recognition method of chinese named entities of claim 1, wherein the multi-layer neural network model is a CNN neural network model, a GRU neural network model, a bi-directional LSTM neural network model, a transducer neural network model, or a BERT neural network model.
6. The Chinese named entity recognition device based on the neural network is characterized by comprising a data preprocessing module, a model training module and an entity labeling module which are sequentially communicated;
the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
the data preprocessing module is specifically configured to obtain character position identification vectors of each sentence according to the following steps: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence;
the data preprocessing module is specifically used for marking the position of each word in the belonged words in the following way: splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol to form position mark information of a word in an affiliated word, wherein the word position serial number refers to serial numbers of the word in sequence in the affiliated word;
the data preprocessing module is specifically configured to assign a unique ID number of a character position to each character position tag according to the following manner: counting the total number of all character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence;
the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model;
and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.
7. A neural network based chinese named entity recognition device, comprising a memory and a processor communicatively coupled, wherein the memory is configured to store a computer program, and the processor is configured to execute the computer program to implement the steps of the neural network based chinese named entity recognition method according to any one of claims 1 to 5.
8. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the neural network based chinese named entity recognition method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911000998.4A CN110717331B (en) | 2019-10-21 | 2019-10-21 | Chinese named entity recognition method, device and equipment based on neural network and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911000998.4A CN110717331B (en) | 2019-10-21 | 2019-10-21 | Chinese named entity recognition method, device and equipment based on neural network and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110717331A CN110717331A (en) | 2020-01-21 |
CN110717331B true CN110717331B (en) | 2023-10-24 |
Family
ID=69213945
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911000998.4A Active CN110717331B (en) | 2019-10-21 | 2019-10-21 | Chinese named entity recognition method, device and equipment based on neural network and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110717331B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339775A (en) * | 2020-02-11 | 2020-06-26 | 平安科技(深圳)有限公司 | Named entity identification method, device, terminal equipment and storage medium |
CN111476031A (en) * | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | Improved Chinese named entity recognition method based on L attice-L STM |
CN111339779A (en) * | 2020-03-20 | 2020-06-26 | 桂林电子科技大学 | Named entity identification method for Vietnamese |
CN111597804B (en) * | 2020-05-15 | 2023-03-10 | 腾讯科技(深圳)有限公司 | Method and related device for training entity recognition model |
CN113743116A (en) * | 2020-05-28 | 2021-12-03 | 株式会社理光 | Training method and device for named entity recognition and computer readable storage medium |
CN111709242B (en) * | 2020-06-01 | 2024-02-02 | 广州多益网络股份有限公司 | Chinese punctuation mark adding method based on named entity recognition |
CN112101028B (en) * | 2020-08-17 | 2022-08-26 | 淮阴工学院 | Multi-feature bidirectional gating field expert entity extraction method and system |
CN112257446A (en) * | 2020-10-20 | 2021-01-22 | 平安科技(深圳)有限公司 | Named entity recognition method and device, computer equipment and readable storage medium |
CN112380854B (en) * | 2020-11-17 | 2024-03-01 | 苏州大学 | Chinese word segmentation method and device, electronic equipment and storage medium |
CN114548103B (en) * | 2020-11-25 | 2024-03-29 | 马上消费金融股份有限公司 | Named entity recognition model training method and named entity recognition method |
CN112380866A (en) * | 2020-11-25 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Text topic label generation method, terminal device and storage medium |
CN112686047B (en) * | 2021-01-21 | 2024-03-29 | 北京云上曲率科技有限公司 | Sensitive text recognition method, device and system based on named entity recognition |
CN113420557B (en) * | 2021-06-09 | 2024-03-08 | 山东师范大学 | Chinese named entity recognition method, system, equipment and storage medium |
CN113408507B (en) * | 2021-08-20 | 2021-11-26 | 北京国电通网络技术有限公司 | Named entity identification method and device based on resume file and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107908614A (en) * | 2017-10-12 | 2018-04-13 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi LSTM |
CN108460012A (en) * | 2018-02-01 | 2018-08-28 | 哈尔滨理工大学 | A kind of name entity recognition method based on GRU-CRF |
CN109446514A (en) * | 2018-09-18 | 2019-03-08 | 平安科技(深圳)有限公司 | Construction method, device and the computer equipment of news property identification model |
CN109635279A (en) * | 2018-11-22 | 2019-04-16 | 桂林电子科技大学 | A kind of Chinese name entity recognition method neural network based |
CN109933801A (en) * | 2019-03-25 | 2019-06-25 | 北京理工大学 | Two-way LSTM based on predicted position attention names entity recognition method |
CN110083831A (en) * | 2019-04-16 | 2019-08-02 | 武汉大学 | A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10867597B2 (en) * | 2013-09-02 | 2020-12-15 | Microsoft Technology Licensing, Llc | Assignment of semantic labels to a sequence of words using neural network architectures |
US10157177B2 (en) * | 2016-10-28 | 2018-12-18 | Kira Inc. | System and method for extracting entities in electronic documents |
RU2691214C1 (en) * | 2017-12-13 | 2019-06-11 | Общество с ограниченной ответственностью "Аби Продакшн" | Text recognition using artificial intelligence |
-
2019
- 2019-10-21 CN CN201911000998.4A patent/CN110717331B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107908614A (en) * | 2017-10-12 | 2018-04-13 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi LSTM |
CN108460012A (en) * | 2018-02-01 | 2018-08-28 | 哈尔滨理工大学 | A kind of name entity recognition method based on GRU-CRF |
CN109446514A (en) * | 2018-09-18 | 2019-03-08 | 平安科技(深圳)有限公司 | Construction method, device and the computer equipment of news property identification model |
CN109635279A (en) * | 2018-11-22 | 2019-04-16 | 桂林电子科技大学 | A kind of Chinese name entity recognition method neural network based |
CN109933801A (en) * | 2019-03-25 | 2019-06-25 | 北京理工大学 | Two-way LSTM based on predicted position attention names entity recognition method |
CN110083831A (en) * | 2019-04-16 | 2019-08-02 | 武汉大学 | A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF |
Also Published As
Publication number | Publication date |
---|---|
CN110717331A (en) | 2020-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110717331B (en) | Chinese named entity recognition method, device and equipment based on neural network and storage medium | |
CN111985239B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
CN109543181B (en) | Named entity model and system based on combination of active learning and deep learning | |
CN109885824B (en) | Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium | |
CN110188362B (en) | Text processing method and device | |
CN110287480B (en) | Named entity identification method, device, storage medium and terminal equipment | |
CN110598203B (en) | Method and device for extracting entity information of military design document combined with dictionary | |
CN109635279B (en) | Chinese named entity recognition method based on neural network | |
CN109359291A (en) | A kind of name entity recognition method | |
CN112329465A (en) | Named entity identification method and device and computer readable storage medium | |
CN110309511B (en) | Shared representation-based multitask language analysis system and method | |
CN112163429B (en) | Sentence correlation obtaining method, system and medium combining cyclic network and BERT | |
CN110222184A (en) | A kind of emotion information recognition methods of text and relevant apparatus | |
CN111522839A (en) | Natural language query method based on deep learning | |
CN111368544B (en) | Named entity identification method and device | |
CN113743119B (en) | Chinese named entity recognition module, method and device and electronic equipment | |
CN108829823A (en) | A kind of file classification method | |
CN112036184A (en) | Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model | |
CN113743101B (en) | Text error correction method, apparatus, electronic device and computer storage medium | |
CN111008526A (en) | Named entity identification method based on dual-channel neural network | |
CN110852040A (en) | Punctuation prediction model training method and text punctuation determination method | |
CN115374786A (en) | Entity and relationship combined extraction method and device, storage medium and terminal | |
CN113553853B (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN112487813B (en) | Named entity recognition method and system, electronic equipment and storage medium | |
CN111353295A (en) | Sequence labeling method and device, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |