CN110717331B - Chinese named entity recognition method, device and equipment based on neural network and storage medium - Google Patents

Chinese named entity recognition method, device and equipment based on neural network and storage medium Download PDF

Info

Publication number
CN110717331B
CN110717331B CN201911000998.4A CN201911000998A CN110717331B CN 110717331 B CN110717331 B CN 110717331B CN 201911000998 A CN201911000998 A CN 201911000998A CN 110717331 B CN110717331 B CN 110717331B
Authority
CN
China
Prior art keywords
word
character
sentence
neural network
character position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911000998.4A
Other languages
Chinese (zh)
Other versions
CN110717331A (en
Inventor
黄浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aiyi Botong Information Technology Co ltd
Original Assignee
Beijing Aiyi Botong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aiyi Botong Information Technology Co ltd filed Critical Beijing Aiyi Botong Information Technology Co ltd
Priority to CN201911000998.4A priority Critical patent/CN110717331B/en
Publication of CN110717331A publication Critical patent/CN110717331A/en
Application granted granted Critical
Publication of CN110717331B publication Critical patent/CN110717331B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of Chinese language processing and recognition, and discloses a Chinese named entity recognition method, device, equipment and storage medium based on a neural network. The invention provides a novel method for improving the recognition rate of Chinese named entity based on a neural network deep learning mode by comprehensively utilizing characters and word characteristics, namely, the training sample contains character position identification vectors serving as word boundary information by preprocessing data to be trained before model training, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity label, and the problems that the prior art cannot utilize word information in sentences, the recognition effect is defective, and the recognition rate is limited to improve are solved, and the method is convenient for practical application and popularization. In addition, the Chinese named entity recognition method is easy to implement and has low development and operation costs.

Description

Chinese named entity recognition method, device and equipment based on neural network and storage medium
Technical Field
The invention belongs to the technical field of Chinese language processing and recognition, and particularly relates to a Chinese named entity recognition method, device, equipment and storage medium based on a neural network.
Background
Named entity recognition (Named Entity Recognition, NER for short) is a basic task of natural language processing to identify proper nouns and phrases in natural language processing and categorize them. As more and more researchers have proposed various model structures in the NEP field, the use of neural networks or deep learning to address the NER problem has become a major trend.
The current character-based method and the word-based method are two main stream processing methods, wherein the word-based method needs to use a word segmentation tool, but the word segmentation tool has a poor effect, and once word segmentation is wrong, the prediction of an entity boundary is directly affected, so that the recognition is wrong; while the character-based method, which trains in units of characters, has a larger training scale and a longer training time, researches have shown that the character-based method is superior to the word-based method for the recognition of a named entity of chinese. But the character-based method cannot utilize the information of the words in the sentence (actually, providing word boundary information can effectively improve the recognition rate), which can make the recognition effect defective and limit the improvement of the recognition rate.
Disclosure of Invention
The invention aims to solve the problems that the existing character-based Chinese named entity recognition method cannot utilize the information of words in sentences, so that the recognition effect is defective and the recognition rate is limited to be improved.
The technical scheme adopted by the invention is as follows:
a Chinese named entity recognition method based on a neural network comprises the following steps:
s101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
s102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing a multi-layer neural network model for training to obtain a Chinese named entity recognition model;
s103, performing Chinese named entity recognition on the target text by using the Chinese named entity recognition model to obtain an entity labeling result.
Preferably, in the step S101, the character feature identification vector of each sentence is obtained according to the following steps:
s1011, carrying out sentence segmentation on the data to be trained to obtain a plurality of sentences;
s1012, performing word segmentation processing on each sentence to separate words from each other;
s1013, counting all words, and distributing a character characteristic unique ID number for each word;
s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence.
Preferably, in the step S101, the character position identification vector of each sentence is obtained according to the following steps:
s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences;
s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words;
s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence;
s1024, counting all character position labels, and distributing a character position unique ID number for each character position label;
s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence.
Further preferably, in said step S1023, the position of each word in the belonging word is marked as follows: the position mark information of the word in the belonged word is formed by splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol, a word length and a word position serial number, wherein the word position serial number refers to a serial number of the word in the belonged word in sequence.
Preferably, the step S102 includes the following steps:
s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output;
s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence;
s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy.
Preferably, the step S103 includes the following steps:
s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.
Specifically, the multi-layer neural network model is a CNN neural network model, a GRU neural network model, a bidirectional LSTM neural network model, a transducer neural network model or a BERT neural network model.
The other technical scheme adopted by the invention is as follows:
a Chinese named entity recognition device based on a neural network comprises a data preprocessing module, a model training module and an entity labeling module which are sequentially communicated;
the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model;
and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.
The other technical scheme adopted by the invention is as follows:
the Chinese named entity recognition device based on the neural network comprises a memory and a processor which are connected in communication, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program to realize the steps of the Chinese named entity recognition method based on the neural network.
The other technical scheme adopted by the invention is as follows:
a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a neural network based chinese named entity recognition method as described above.
The beneficial effects of the invention are as follows:
(1) The invention provides a new method for improving the recognition rate of Chinese named entities by comprehensively utilizing characters and word characteristics to carry out a neural network-based deep learning mode, namely, before model training, a training sample contains character position identification vectors serving as word boundary information by preprocessing data to be trained, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity tag, namely, the text to be recognized is input into the trained Chinese named entity recognition model, and the model can convert the text into a corresponding tag text, thereby solving the problems that the prior art cannot utilize word information in sentences, further leading to defect recognition effect and limiting the improvement of the recognition rate, and being convenient for practical application and popularization;
(2) The Chinese named entity recognition method is easy to realize, has low development and operation cost, can realize Chinese entity recognition service through one server, has high judgment speed and accuracy, and can be applied to other NLP tasks.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a method for identifying a Chinese named entity.
FIG. 2 is an exemplary diagram of a full-mode word segmentation result provided by the present invention.
Fig. 3 is a schematic diagram of a structure of a device for identifying chinese named entities according to the present invention.
Fig. 4 is a schematic structural diagram of a chinese named entity recognition device according to the present invention.
Detailed Description
The invention is further described with reference to the drawings and specific examples. It should be noted that the description of these examples is for aiding in understanding the present invention, but is not intended to limit the present invention. Specific structural and functional details disclosed herein are merely representative of example embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.
It should be understood that in some of the processes described herein, a plurality of operations occurring in a particular order are included, but that these operations may be performed in other than the order in which they occur herein, or in parallel, the sequence numbers of the operations, such as S101, S102, etc., are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the processes may include more or fewer operations, and the operations may likewise be performed in sequence or in parallel.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: the terms "/and" herein describe another associative object relationship, indicating that there may be two relationships, e.g., a/and B, may indicate that: the character "/" herein generally indicates that the associated object is an "or" relationship.
It will be understood that when an element is referred to as being "connected," "connected," or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe relationships between elements (e.g., "between … …" pair "directly between … …", "adjacent" pair "directly adjacent", etc.) should be interpreted in a similar manner.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, and do not preclude the presence or addition of one or more other features, numbers, steps, operations, elements, components, and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
In the following description, specific details are provided to provide a thorough understanding of example embodiments. However, it will be understood by those of ordinary skill in the art that the example embodiments may be practiced without these specific details. For example, a system may be shown in block diagrams in order to avoid obscuring the examples with unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the example embodiments.
Example 1
As shown in fig. 1-2, the method for identifying a chinese named entity based on a neural network provided in this embodiment may, but is not limited to, include the following steps.
S101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences.
In the step S101, the data to be trained may be composed of various document data provided by a user or collected by existing collection software, wherein the document data may be composed of, but not limited to, a part of fields or several parts of fields of a title, a abstract, a keyword, a body, an attachment title, an attachment content, an author information, and the like.
In the step S101, it is preferable to obtain character feature identification vectors of respective sentences according to the following steps S1011 to S1014: s1011, carrying out sentence segmentation on the data to be trained to obtain a plurality of sentences; s1012, performing word segmentation processing on each sentence to separate words from each other; s1013, counting all words, and distributing a character characteristic unique ID number for each word; s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence. In detail, the sentence processing mode and the word segmentation processing mode are both conventional technical means; in the step S1013, the specific allocation manner may be, but is not limited to, the following: counting the total number of all words as m, and then giving integers with the number between 0 and m-1 to each word one by one according to the arrangement sequence. Thus, through the foregoing step S1013, a character table including all words and corresponding character feature unique ID numbers can be obtained, so that in the subsequent step S1014, the character table is applied to compile each word in each sentence, and the character feature identification vector of the sentence character level is obtained.
In the step S101, the character position identification vector of each sentence is preferably obtained according to the following steps S1021 to S1025: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence. In detail, the sentence processing manner and the full-mode word segmentation processing manner are both conventional technical means, wherein for full-mode word segmentation, as shown in an example in fig. 2, words such as "nanjing", "nanjing city", "city length", "Yangtze river bridge" and "bridge" can be obtained according to the sentence "nanjing city Yangtze river bridge". In said step S1023, it may be further preferable to mark the position of each word in the belonging word as follows: the position mark information of the word in the belonged word is formed by splicing a word head symbol, a word middle symbol, a word tail symbol, a non-word symbol and the like, and word length and word position serial numbers, wherein the word position serial numbers refer to serial numbers of the word in the belonged word in sequence, and particularly, the example table of the character position labels shown in the following table 1 can be referred to:
TABLE 1 character position tag exemplary Table
As shown in table 1 above, the initials may be used specifically: B. the symbols in the word: m, the suffix symbol: e or non-word symbol: s, etc., and the word length and the word bit sequence number are spliced to form the position mark information of the word in the belonged word. For example, the term "city" includes two "Nanjing city" and "city length", wherein the "city" is at the beginning B and the ending E of the two terms, the lengths of the two terms are 2 and 3, respectively, and the specific positions in the two terms are 3 and 1, namely L3 and L1, so that the position mark information of the "city" in the term "Nanjing city" is "E3L3", the position mark information of the "city length" is "B2L1", and the character position label formed by final splicing is "[ E3L3-B2L1]". In the step S1024, the specific allocation manner may be, but is not limited to, the following: counting the total number of all the character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence. Therefore, through the step S1024, a character position table including all character position tags and corresponding character position unique ID numbers can also be obtained, so that in the subsequent step S1025, each word in each sentence is compiled by using the character position table, and the character position identification vector of the sentence character level is obtained.
S102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing the training samples into a multi-layer neural network model for training to obtain a Chinese named entity recognition model.
In the step S102, the multi-layer neural network model is a deep learning model commonly used for training an identification model, and the training can be performed by introducing a sufficient amount of training samples (i.e., the character feature identification vector and the character position identification vector of each sentence) to obtain the Chinese named entity identification model with high identification accuracy, and the model structure and the specific training method can be obtained by conventional design with reference to the existing mode. Specifically, the multi-layer neural network model may be, but not limited to, a CNN neural network model, a GRU neural network model, a bidirectional LSTM neural network model, a transducer neural network model, or a BERT neural network model, which are all existing common models.
In the step S102, the chinese named entity recognition model is preferably obtained by using the following steps S201 to S203: s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output; s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence; s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy. In the step S202, although a very good recognition model can be trained by the neural network layer processing, there is still a deficiency, which may cause some constraint loss, for example, a correct word label is "B-LOC I-LOC", and the model labeling result is "B-LOC I-ORG", where the two labels have constraint problems, so CRF (i.e., conditional Random Field Algorithm, a mathematical algorithm, proposed in 2001, based on a probability map model following markov property, may combine the features of a maximum entropy model and a hidden markov model, and is a non-oriented map model, and in recent years, a very good effect is obtained in the sequence labeling tasks such as word segmentation, part-of-speech labeling, and named entity recognition) layer, and the context relation of the output label may be considered, so that the generated result is an optimal solution of the whole sequence, and the Chinese named entity recognition model with the highest recognition accuracy is obtained in the step S203.
S103, performing Chinese named entity recognition on the target text by using the Chinese named entity recognition model to obtain an entity labeling result.
In the step S103, the entity labeling result is preferably obtained according to the following step S301: s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.
In summary, the neural network-based Chinese named entity recognition method provided by the embodiment has the following technical effects:
(1) The embodiment provides a novel method for improving the recognition rate of Chinese named entities by comprehensively utilizing characters and word characteristics to perform a neural network-based deep learning mode, namely, before model training, a training sample is pre-processed to contain character position identification vectors serving as word boundary information, so that the Chinese named entity recognition model obtained by training is ensured to have extremely high recognition rate, the recognition model can convert an input text into a named entity tag, namely, the text to be recognized is input into the trained Chinese named entity recognition model, and the model can convert the text into a corresponding tag text, thereby solving the problems that the prior art cannot utilize word information in sentences, further leading to defect recognition effect, limiting the improvement of the recognition rate and being convenient for practical application and popularization;
(2) The Chinese named entity recognition method is easy to realize, has low development and operation cost, can realize Chinese entity recognition service through one server, has high judgment speed and accuracy, and can be applied to other NLP tasks.
Example two
As shown in fig. 3, the present embodiment provides a hardware device for implementing the neural network-based method for identifying a chinese named entity, where the hardware device includes a data preprocessing module, a model training module, and an entity labeling module that are sequentially connected in communication; the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences; the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model; and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.
The working process, working details and technical effects of the chinese named entity recognition device provided in this embodiment may refer to the first embodiment, and are not described herein again.
Example III
As shown in fig. 4, the present embodiment provides a hardware device for implementing the neural network-based chinese named entity recognition method according to the embodiment, including a memory and a processor that are communicatively connected, where the memory is configured to store a computer program, and the processor is configured to execute the steps of the computer program for implementing the neural network-based chinese named entity recognition method according to the embodiment.
The working process, working details and technical effects of the chinese named entity recognition device provided in this embodiment may refer to the first embodiment, and are not described herein again.
Example IV
The present embodiment provides a storage medium storing a computer program including the neural network-based chinese named entity recognition method of embodiment one, that is, the storage medium storing the computer program, which when executed by a processor, implements the steps of the neural network-based chinese named entity recognition method of embodiment one. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices, or may be a mobile smart device (such as a smart phone, a PAD, or ipad).
The working process, working details and technical effects of the storage medium provided in this embodiment may refer to the first embodiment, and are not described herein again.
The various embodiments described above are illustrative only, and the elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device to perform the method described in the embodiments or some parts of the embodiments.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents. Such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Finally, it should be noted that the invention is not limited to the alternative embodiments described above, but can be used by anyone in various other forms of products in the light of the present invention. The above detailed description should not be construed as limiting the scope of the invention, which is defined in the claims and the description may be used to interpret the claims.

Claims (8)

1. The Chinese named entity recognition method based on the neural network is characterized by comprising the following steps of:
s101, preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
in the step S101, the character position identification vector of each sentence is obtained as follows: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence;
in said step S1023, the position of each word in the belonging word is marked as follows: splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol to form position mark information of a word in an affiliated word, wherein the word position serial number refers to serial numbers of the word in sequence in the affiliated word;
in the step S1024, a character position unique ID number is assigned to each character position tag as follows: counting the total number of all character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence;
s102, taking the character feature identification vector and the character position identification vector of each sentence as training samples, and importing a multi-layer neural network model for training to obtain a Chinese named entity recognition model;
s103, performing Chinese named entity recognition on the target text by using the Chinese named entity recognition model to obtain an entity labeling result.
2. The method for identifying chinese named entities based on neural network as claimed in claim 1, wherein in step S101, character feature identification vectors of each sentence are obtained as follows:
s1011 and S1021, performing sentence segmentation processing on the data to be trained to obtain a plurality of sentences;
s1012, performing word segmentation processing on each sentence to separate words from each other;
s1013, counting all words, and distributing a character characteristic unique ID number for each word;
s1014, generating the character feature identification vector according to the unique ID number of the corresponding character feature of each word in the corresponding sentence for each sentence.
3. The method for identifying a chinese name entity based on neural network as recited in claim 1, wherein said step S102 comprises the steps of:
s201, after the character feature identification vector and the character position identification vector are spliced, the character feature identification vector and the character position identification vector are imported into the multi-layer neural network model for training, and then an identification model containing a hidden layer vector is output;
s202, marking the entity of each character by using a conditional random field, and marking entity information in a sentence sequence;
s203, obtaining a group of optimal data weights through repeated training, and obtaining the Chinese named entity recognition model with highest recognition accuracy.
4. The method for identifying a chinese name entity based on neural network as recited in claim 1, wherein said step S103 comprises the steps of:
s301, carrying out word-by-word serialization labeling on the target text by applying the Chinese named entity recognition model, and then converting the character strings into entities to obtain entity labeling results.
5. The neural network-based recognition method of chinese named entities of claim 1, wherein the multi-layer neural network model is a CNN neural network model, a GRU neural network model, a bi-directional LSTM neural network model, a transducer neural network model, or a BERT neural network model.
6. The Chinese named entity recognition device based on the neural network is characterized by comprising a data preprocessing module, a model training module and an entity labeling module which are sequentially communicated;
the data preprocessing module is used for preprocessing data to be trained to obtain character feature identification vectors and character position identification vectors of all sentences, wherein the character feature identification vectors comprise character feature unique ID numbers of all words in corresponding sentences, and the character position identification vectors comprise character position unique ID numbers of all words in corresponding sentences;
the data preprocessing module is specifically configured to obtain character position identification vectors of each sentence according to the following steps: s1021, sentence segmentation is carried out on the data to be trained to obtain a plurality of sentences; s1022, performing full-mode word segmentation processing based on word segmentation tools on each sentence to obtain a plurality of words; s1023, marking the position of each word in the belonged word for each sentence, and then splicing the position marking information into a character position label of the corresponding word according to the sequence of the belonged word in the corresponding sentence; s1024, counting all character position labels, and distributing a character position unique ID number for each character position label; s1025, generating the character position identification vector according to the unique ID number of the corresponding character position of each word in the corresponding sentence for each sentence;
the data preprocessing module is specifically used for marking the position of each word in the belonged words in the following way: splicing a word head symbol, a word middle symbol, a word tail symbol or a non-word symbol to form position mark information of a word in an affiliated word, wherein the word position serial number refers to serial numbers of the word in sequence in the affiliated word;
the data preprocessing module is specifically configured to assign a unique ID number of a character position to each character position tag according to the following manner: counting the total number of all character position labels as n, and then giving integers with the number between 0 and n-1 to each character position label one by one according to the arrangement sequence;
the model training module is used for taking the character feature identification vector and the character position identification vector of each sentence as training samples, importing the training samples into a multi-layer neural network model for training, and obtaining a Chinese named entity recognition model;
and the entity labeling module is used for carrying out Chinese named entity recognition on the target text by applying the Chinese named entity recognition model to obtain an entity labeling result.
7. A neural network based chinese named entity recognition device, comprising a memory and a processor communicatively coupled, wherein the memory is configured to store a computer program, and the processor is configured to execute the computer program to implement the steps of the neural network based chinese named entity recognition method according to any one of claims 1 to 5.
8. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the neural network based chinese named entity recognition method of any one of claims 1 to 5.
CN201911000998.4A 2019-10-21 2019-10-21 Chinese named entity recognition method, device and equipment based on neural network and storage medium Active CN110717331B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911000998.4A CN110717331B (en) 2019-10-21 2019-10-21 Chinese named entity recognition method, device and equipment based on neural network and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911000998.4A CN110717331B (en) 2019-10-21 2019-10-21 Chinese named entity recognition method, device and equipment based on neural network and storage medium

Publications (2)

Publication Number Publication Date
CN110717331A CN110717331A (en) 2020-01-21
CN110717331B true CN110717331B (en) 2023-10-24

Family

ID=69213945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911000998.4A Active CN110717331B (en) 2019-10-21 2019-10-21 Chinese named entity recognition method, device and equipment based on neural network and storage medium

Country Status (1)

Country Link
CN (1) CN110717331B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339775A (en) * 2020-02-11 2020-06-26 平安科技(深圳)有限公司 Named entity identification method, device, terminal equipment and storage medium
CN111476031A (en) * 2020-03-11 2020-07-31 重庆邮电大学 Improved Chinese named entity recognition method based on L attice-L STM
CN111339779A (en) * 2020-03-20 2020-06-26 桂林电子科技大学 Named entity identification method for Vietnamese
CN111597804B (en) * 2020-05-15 2023-03-10 腾讯科技(深圳)有限公司 Method and related device for training entity recognition model
CN113743116A (en) * 2020-05-28 2021-12-03 株式会社理光 Training method and device for named entity recognition and computer readable storage medium
CN111709242B (en) * 2020-06-01 2024-02-02 广州多益网络股份有限公司 Chinese punctuation mark adding method based on named entity recognition
CN112101028B (en) * 2020-08-17 2022-08-26 淮阴工学院 Multi-feature bidirectional gating field expert entity extraction method and system
CN112257446A (en) * 2020-10-20 2021-01-22 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and readable storage medium
CN112380854B (en) * 2020-11-17 2024-03-01 苏州大学 Chinese word segmentation method and device, electronic equipment and storage medium
CN114548103B (en) * 2020-11-25 2024-03-29 马上消费金融股份有限公司 Named entity recognition model training method and named entity recognition method
CN112380866A (en) * 2020-11-25 2021-02-19 厦门市美亚柏科信息股份有限公司 Text topic label generation method, terminal device and storage medium
CN112686047B (en) * 2021-01-21 2024-03-29 北京云上曲率科技有限公司 Sensitive text recognition method, device and system based on named entity recognition
CN113420557B (en) * 2021-06-09 2024-03-08 山东师范大学 Chinese named entity recognition method, system, equipment and storage medium
CN113408507B (en) * 2021-08-20 2021-11-26 北京国电通网络技术有限公司 Named entity identification method and device based on resume file and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108460012A (en) * 2018-02-01 2018-08-28 哈尔滨理工大学 A kind of name entity recognition method based on GRU-CRF
CN109446514A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Construction method, device and the computer equipment of news property identification model
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109933801A (en) * 2019-03-25 2019-06-25 北京理工大学 Two-way LSTM based on predicted position attention names entity recognition method
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10867597B2 (en) * 2013-09-02 2020-12-15 Microsoft Technology Licensing, Llc Assignment of semantic labels to a sequence of words using neural network architectures
US10157177B2 (en) * 2016-10-28 2018-12-18 Kira Inc. System and method for extracting entities in electronic documents
RU2691214C1 (en) * 2017-12-13 2019-06-11 Общество с ограниченной ответственностью "Аби Продакшн" Text recognition using artificial intelligence

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108460012A (en) * 2018-02-01 2018-08-28 哈尔滨理工大学 A kind of name entity recognition method based on GRU-CRF
CN109446514A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Construction method, device and the computer equipment of news property identification model
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109933801A (en) * 2019-03-25 2019-06-25 北京理工大学 Two-way LSTM based on predicted position attention names entity recognition method
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF

Also Published As

Publication number Publication date
CN110717331A (en) 2020-01-21

Similar Documents

Publication Publication Date Title
CN110717331B (en) Chinese named entity recognition method, device and equipment based on neural network and storage medium
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN109885824B (en) Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
CN110188362B (en) Text processing method and device
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN110598203B (en) Method and device for extracting entity information of military design document combined with dictionary
CN109635279B (en) Chinese named entity recognition method based on neural network
CN109359291A (en) A kind of name entity recognition method
CN112329465A (en) Named entity identification method and device and computer readable storage medium
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN112163429B (en) Sentence correlation obtaining method, system and medium combining cyclic network and BERT
CN110222184A (en) A kind of emotion information recognition methods of text and relevant apparatus
CN111522839A (en) Natural language query method based on deep learning
CN111368544B (en) Named entity identification method and device
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN108829823A (en) A kind of file classification method
CN112036184A (en) Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN113743101B (en) Text error correction method, apparatus, electronic device and computer storage medium
CN111008526A (en) Named entity identification method based on dual-channel neural network
CN110852040A (en) Punctuation prediction model training method and text punctuation determination method
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN113553853B (en) Named entity recognition method and device, computer equipment and storage medium
CN112487813B (en) Named entity recognition method and system, electronic equipment and storage medium
CN111353295A (en) Sequence labeling method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant