CN112487812A

CN112487812A - Nested entity identification method and system based on boundary identification

Info

Publication number: CN112487812A
Application number: CN202011134652.6A
Authority: CN
Inventors: 姜华; 田济东; 郦一天; 姜晨昊
Original assignee: Shanghai Minpu Technology Co ltd
Current assignee: Shanghai Minpu Technology Co ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-03-12
Anticipated expiration: 2040-10-21
Also published as: CN112487812B

Abstract

The invention provides a nested entity recognition method and a system based on boundary recognition, which are used for preprocessing input text data and converting the preprocessed text data into a multi-dimensional vector; performing feature coding on the obtained multidimensional vector to obtain a coding vector with context information; extracting entity boundary information from the coded vector, then decoding the extracted entity boundary information, identifying the boundary of the entity segment, and obtaining entity boundary information; masking the coding vector by adopting the entity boundary information obtained by identification to obtain an alternative entity fragment vector, and classifying the characteristics of the alternative entity fragment by entity classification decoding to obtain entity classification information; and combining the obtained entity classification information and the entity boundary information to further obtain the nested entity to be extracted. The nested entity recognition method realizes nested entity recognition by flattening the nested structure and a two-layer boundary recognition method, ensures the recognition accuracy and has generalization capability.

Description

Nested entity identification method and system based on boundary identification

Technical Field

The invention relates to the technical field of natural language processing, in particular to a nested entity identification method and a nested entity identification system based on boundary identification, which are used for identifying nested entities in natural language.

Background

The named entity is a basic unit borne by information in natural language, and entity identification is a basic task of natural language, such as information extraction and reading and understanding, so that the deep research of accurate extraction of the entity has great significance in natural language processing.

Generally, a named entity refers to a noun of special significance in the text, such as a person's name (PER), Location (LOC), geographic area (GPE), organization group (ORG), and other proper or special nouns. Conventional entity recognition can be realized through a sequence labeling model (such as a long-short term memory-conditional random field model) in deep learning, and the model can label each semantic unit to obtain a unique label of the semantic unit, and entity fragments are obtained through combining the labels. However, there is a nesting phenomenon in named entity recognition, so that a one-to-one relationship cannot be established between a word and an entity tag. Therefore, the existing mature sequence labeling model cannot be directly applied to the identification of the nested entities.

For the identification of nested entities, there are currently two main types of methods:

one is to identify the nested entities layer by layer according to a certain rule, and the method has three serious defects: 1) errors generated by identifying entities in different levels are accumulated continuously, so that the effect of the model on entity identification is worse and worse as the levels are deepened; 2) the fuzzy of the level definition causes the distribution difference between the entities in the same layer to be extremely large, and the model is difficult to accurately identify; 3) repeated recognition of the same text segment brings unnecessary calculation and increases the calculation cost. These drawbacks have resulted in such methods being impractical.

The other method is to extract the entity by a sequence labeling method after flattening the nested structure by means of external knowledge. These external knowledge, including regularization, calibration rules, etc., is a generalization of the a priori knowledge contained by the entities in the text. However, in practice, the entity distribution and patterns involved in different domains vary, which results in the need to subscribe to different external knowledge for extraction for different data sets. Thus, such methods tend to be significant on a particular data set, without generalization.

Based on the above background, the main contradiction existing in the current nested entity identification lies in how to balance accuracy and generalization capability, namely, how to construct a method with generalization capability on the premise of ensuring the identification accuracy of the nested entity has important significance for the practical application of the nested entity identification.

At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a nested entity identification method and a nested entity identification system based on boundary identification.

The invention is realized by the following technical scheme.

According to one aspect of the invention, a nested entity identification method based on boundary identification is provided, which comprises the following steps:

carrying out data preprocessing on an input text, and converting the preprocessed text data into a multi-dimensional vector;

performing feature coding on the obtained multidimensional vector to obtain a coding vector with context information;

extracting entity boundary information from the coding vector with the context information, then decoding the extracted entity boundary information, identifying the boundary of the obtained entity segment, and obtaining entity boundary information;

masking the coding vector with the context information by adopting the entity boundary information obtained by identification to obtain alternative entity fragment vectors, and classifying the characteristics of the alternative entity fragments by entity classification decoding to obtain entity classification information;

and combining the obtained entity classification information and the entity boundary information to further obtain the nested entity to be extracted.

Preferably, the data preprocessing of the input text includes: text preprocessing and vector embedding; wherein:

the text preprocessing is used for capturing the internal information of the input text, including word segmentation, part-of-speech tagging, grammar parsing and semantic parsing, and obtaining text segments taking words as units and grammar dependency trees and semantic parsing trees corresponding to the text segments;

the vector embedding is to embed vocabulary, characters, parts of speech, semantics and grammar on the basis of text preprocessing; wherein:

vocabulary embedding is vectorized through a pre-trained language model, including: calling a pre-trained Chinese pre-training model, coding each vocabulary according to an interface provided by the model to be used as the input of the model, and finally obtaining a vocabulary vector through BERT calculation;

the character embedding is realized by a convolutional neural network learning embedding mode, and the method comprises the following steps: randomly initializing a character embedding table, coding each character, obtaining an initial vector through the embedding table, performing convolution on the vector through a convolution neural network, and obtaining a character-level vector by adopting a maximum pooling method;

part-of-speech embedding is obtained by randomly initializing vectors and training, and comprises the following steps: randomly initializing a part-of-speech embedding table, coding each type of part-of-speech, and obtaining a part-of-speech vector through the embedding table;

embedding semantics and grammar, and convolving the semantic parse tree and the grammar dependency tree through a graph convolution network to obtain corresponding semantic vectors and grammar vectors;

the input text is converted into a multi-dimensional vector through text preprocessing and vector embedding.

Preferably, the performing feature coding on the obtained multidimensional vector to obtain a coded vector with context information includes:

and performing linear transformation and nonlinear distortion on the obtained multidimensional vector by using a bidirectional long-time and short-time memory network, wherein the coded vector contains context information, namely the coded vector with the context information.

Preferably, the extracting information related to the entity boundary for the coding vector with the context information, then decoding the extracted information related to the entity boundary, and identifying the boundary where the entity segment is obtained, to obtain the entity boundary information includes:

using a two-level pointer network, the mesh identifies the left and right boundary sets of coded vectors with context information, which are then decoded into corresponding physical boundaries.

Preferably, the two-level pointer network comprises a group sequence pointer network for identifying a left boundary group and an entity sequence pointer network for identifying a right boundary sequence; wherein:

for a group sequence pointer network, the input of the group sequence pointer network is a coding vector e with context information and a left boundary vector o obtained at the last moment, and the coding vector e is subjected to attention operation through the left boundary vector o to obtain non-standardized positioning probability; for time j, the left bounding bit probability is:

wherein u is_j,iThe left boundary is the non-standardized positioning probability, v and W are trainable parameters, a subscript l represents the left boundary, and a superscript T is a vector transposition symbol;

at this time, the left boundary vector o selected at the j-th time_jComprises the following steps:

o_j＝argmax_i(u_j,i)；

for the entity sequence pointer network, the input is the coding vector, the right boundary vector obtained at the last moment and the left boundary vector corresponding to the group, the left boundary vector and the corresponding right boundary vector are spliced, and then the attention operation is performed on the coding vector:

wherein u is_j,k,iProbability of location for right boundary not normalized, belowMarks p, r and k are respectively a right boundary and a corresponding kth left boundary, and a superscript T is a vector transposition symbol;

the right boundary vector finally obtained is o_j,k＝argmax_i(u_j,k,i)。

Preferably, the masking the coding vector with the context information by using the entity boundary information obtained by the identification to obtain the candidate entity fragment vector, and classifying the features of the candidate entity fragment by entity classification decoding to obtain the entity classification information includes:

masking the coding vector with the context information by adopting the entity boundary information obtained by identification to obtain an alternative entity fragment vector, learning the alternative entity fragment vector through a convolutional neural network, and classifying the obtained characteristics to obtain the category of the entity, namely the entity classification information.

Preferably, the method further comprises: the entity boundary information extraction process and the entity classification information extraction process are optimized,

preferably, the optimizing the entity boundary information extraction process and the entity classification information extraction process includes:

and (3) alternately training an entity boundary information extraction process and an entity classification information extraction process by adopting a cross entropy loss function in a recall rate priority mode, so as to realize the optimization of the extraction process.

Preferably, in the process of optimizing the entity classification information extraction process, a null value class and a negative sample are also added; wherein:

the null value class is used for secondary entity screening, so that the accuracy is improved;

the negative examples are used to ensure that tokens of the null class can be learned;

the negative examples are generated by an entity boundary information extraction process.

According to another aspect of the present invention, there is provided a nested entity recognition system based on boundary recognition, including:

a data preprocessing module: carrying out data preprocessing on an input text, and converting the preprocessed text data into a multi-dimensional vector;

the characteristic coding module is used for carrying out characteristic coding on the multidimensional vector obtained by the data preprocessing module to obtain a coding vector with context information;

the boundary identification decoding module is used for extracting and obtaining entity boundary information by taking the coding vector with the context information obtained by the characteristic coding module as input, then decoding the extracted entity boundary information, identifying and obtaining the boundary of an entity segment, and outputting and obtaining the entity boundary information;

the entity classification decoding module is used for taking the entity boundary information obtained by the boundary identification decoding module and the coding vector with the context information obtained by the characteristic coding module as input, masking the coding vector with the context information by adopting the entity boundary information to obtain an alternative entity fragment vector, classifying the characteristics of the alternative entity fragment through entity classification decoding, and outputting to obtain entity classification information;

and the entity prediction module is used for combining the entity classification information obtained by the boundary identification decoding module and the entity boundary information obtained by the entity classification decoding module so as to obtain the nested entity to be extracted.

Preferably, the system further comprises:

and the model training module is used for respectively optimizing the boundary recognition decoding module and the entity classification decoding module.

According to a third aspect of the present invention, there is provided a terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program being operable to perform any of the methods described above.

According to a fourth aspect of the invention, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, is operable to perform the method of any of the above.

Due to the adoption of the technical scheme, compared with the prior art, the invention has at least one of the following beneficial effects:

1. according to the nested entity identification method and system based on boundary identification, negative effects caused by accumulated errors and entity distribution differences generated by nested structure layering are avoided through a boundary flattening mode (a mode of obtaining a coding vector with context information and obtaining entity boundary information), and the nested entity identification method and system based on boundary identification have good identification capability in the identification of nested entities at different levels;

2. the nested entity identification method and the system based on boundary identification do not need to introduce a regular or other marking rule flattening entity, can be used on different data in different fields, and have strong generalization capability;

3. the nested entity identification method and the nested entity identification system based on the boundary identification provided by the invention bring other additional gains, such as avoiding repeated operation on texts, improving the identification efficiency and the like.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a flowchart of a nested entity identification method based on boundary identification in a preferred embodiment of the present invention.

FIG. 2 is a diagram illustrating a boundary identification decoding process according to a preferred embodiment of the present invention;

FIG. 3 is a diagram illustrating an entity classification decoding process according to a preferred embodiment of the present invention.

Detailed Description

The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

The existing layer-by-layer identification of the nested entity introduces a large amount of accumulated errors and calculation cost, the effectiveness of the method is difficult to ensure, and the method has no practicality. Therefore, the idea of effectively solving the identification of the nested entity needs to firstly flatten the nested structure. However, the biggest challenge of this approach is that the flattening of the text data is often dependent on the structure of the data itself, so that a great deal of prior knowledge is needed for assistance, and generalization is difficult.

An embodiment of the invention provides a nested entity identification method based on boundary identification.

The nested entity identification method based on boundary identification provided by the embodiment comprises the following steps:

step 1, performing data preprocessing on an input text, and converting the preprocessed text data into a multi-dimensional vector;

step 2, performing feature coding on the obtained multidimensional vector to obtain a coded vector with context information;

step 3, extracting entity boundary information from the coded vector with the context information, then decoding the extracted entity boundary information, identifying the boundary of the obtained entity segment, and obtaining entity boundary information;

step 4, masking the coding vector with the context information by adopting the entity boundary information obtained by identification to obtain an alternative entity fragment vector, and classifying the characteristics of the alternative entity fragment by entity classification decoding to obtain entity classification information;

and 5, combining the obtained entity classification information and the entity boundary information to further obtain the nested entity to be extracted.

In this embodiment, the extraction method includes extracting a left boundary to obtain a list, then extracting a corresponding right boundary for each value in the list to obtain a series of lists, then combining each left boundary and each right boundary in the corresponding bounded list into a boundary pair, where the boundary pair represents a nested entity, and then combining the corresponding entity classes into a triple to represent a nested entity.

As a preferred embodiment, in step 1, the data preprocessing is performed on the input text, and includes: text preprocessing and vector embedding; wherein:

text preprocessing, namely capturing the internal information of an input text, including word segmentation, part-of-speech tagging, grammar parsing and semantic parsing, and obtaining a text segment taking a word as a unit and a grammar dependency tree and a semantic parsing tree corresponding to the text segment;

vector embedding, namely embedding vocabulary, characters, parts of speech, semantics and grammar on the basis of text preprocessing; wherein:

vocabulary embedding is vectorized through a pre-trained language model, specifically: calling a pre-trained Chinese pre-training model, coding each vocabulary according to an interface provided by the model to be used as the input of the model, and finally obtaining a vocabulary vector through BERT calculation;

character embedding is realized in a convolutional neural network learning embedding mode, specifically, a character embedding table is initialized randomly, each character is coded, an initial vector is obtained through the embedding table, the vector is convolved through a convolutional neural network, and a character level vector is obtained by adopting a maximum pooling method;

the part-of-speech embedding is obtained by randomly initializing vectors and training, specifically, randomly initializing a part-of-speech embedding table, encoding each type of part-of-speech, and obtaining part-of-speech vectors through the embedding table;

As a preferred embodiment, step 2 comprises:

As a preferred embodiment, step 3 comprises:

As a preferred embodiment, the two-level pointer network comprises a group sequence pointer network for identifying the left boundary group and an entity sequence pointer network for identifying the right boundary sequence; wherein:

o_j＝argmax_i(u_j,i)；

wherein u is_j,k,jSubscripts r and k are respectively a right boundary and a corresponding kth left boundary, and superscript T is a vector transposition symbol;

the right boundary vector finally obtained is o_j,k＝argmax_i(u_j,k,i)。

As a preferred embodiment, step 4 comprises:

masking the coding vector with the context information by using the entity boundary information obtained by identification to obtain an alternative entity fragment vector, learning the alternative entity fragment characteristic (the characteristic is a vector which is an alternative entity fragment vector) through a convolutional neural network, and classifying the obtained characteristic to obtain the entity category (one of two tasks in entity identification is to determine which part is an entity firstly and then judge the entity category, such as people, places, organizations and the like), which is the entity classification information.

As a preferred embodiment, the method provided in this embodiment further includes the following steps:

and optimizing the entity boundary information extraction process and the entity classification information extraction process.

As a preferred embodiment, the optimization method comprises the following steps:

As a preferred embodiment, in the process of optimizing the entity classification information extraction process, a null value class and a negative sample are also added; wherein:

negative examples are used to ensure that tokens of the null class can be learned;

The nested entity identification method based on boundary identification provided by the embodiment mainly comprises the following steps: data preprocessing, text feature encoding, boundary recognition decoding, entity classification decoding, process optimization (training), and entity prediction.

In some embodiments of the invention:

the method comprises the steps of comprehensively capturing the internal information of an input text by text preprocessing methods such as word segmentation, part of speech tagging, grammar parsing, semantic parsing and the like; obtaining a distributed representation with rich semantics by means of a pre-training language model; a multi-dimensional vector is obtained.

And coding the obtained distributed representation by using a bidirectional long-time memory network, wherein the coded representation comprises context information. The subsequent boundary identification decoding and entity classification decoding are characterized as input.

And constructing a boundary identification decoding model by using a two-stage pointer network, thereby identifying a left boundary group and a right boundary sequence in a net shape, and then decoding the left boundary group and the right boundary sequence into corresponding entity boundaries.

Masking vectors obtained after feature coding is carried out through the alternative boundaries after the entity boundary decoding, and classifying the candidate entities through a convolution cyclic network, wherein the process is called entity classification decoding.

And alternately training a boundary recognition decoding process and an entity classification decoding process in a recall rate priority mode to realize process optimization.

And multiplexing the process parameters (model parameters) obtained after training, connecting the boundary recognition decoding process (boundary recognition decoding model) and the entity classification decoding process (entity classification decoding model) in a cascading mode, and extracting the nested entities in the text to be detected.

The method for identifying the nested entity based on the boundary identification provided by the embodiment includes the steps that firstly, the entities are grouped according to the left boundary of the entity, and each group of entities is characterized through the right boundary of the entity to obtain a boundary-based two-layer partially-flattened structure, so that the flattening work of the nested structure is realized.

Data preprocessing: the method mainly comprises two steps of text processing and vector embedding, and realizes the vectorization coding process of text data. The method comprises the steps of firstly adopting a basic method in natural language processing to segment a text, labeling, parsing grammar and the like, and then combining features through different encoding embedding modes to obtain a distributed text vector.

Feature coding: the feature extraction further encodes the text on the basis of the distributed text vector, and captures the context information of the text through a circulating network, thereby obtaining an encoding vector containing the context information. The encoded vector is used as input to two decoding processes.

Boundary identification decoding: the method is the core of the whole method, on one hand, boundary identification decoding needs to capture positioning information of an entity through a coding vector; on the other hand, the boundary identification decoding process also needs to realize flattening of the nested structure according to a certain strategy. Finally, the boundary identification decoding process decodes the candidate boundary of the entity

And (3) entity classification decoding: the method is to construct a classifier to classify the candidate boundary after the boundary is identified, further determine whether the candidate entity is a real entity or not and determine the entity class.

Process optimization (training): and optimizing parameters of the processes of feature coding, boundary recognition decoding and entity classification decoding by adopting an ADAM optimizer in deep learning. In the optimization process, the method of priority of recall rate is adopted to effectively reduce the accumulated error generated in the process of connection.

Entity prediction: directly cascading the processes of feature coding, boundary identification decoding and entity classification decoding, and loading the trained process parameters to realize the extraction of the nested entities of the text to be detected.

The technical solution provided by the present embodiment is further described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic workflow diagram of a nested entity identification method based on boundary identification. The method mainly comprises six processes which are respectively as follows: data preprocessing, feature coding, boundary identification decoding, entity classification decoding, process training and entity prediction.

The data preprocessing comprises two sub-processes of text processing and vector embedding. First, text preprocessing comprises the following steps: word segmentation, part of speech tagging, grammar parsing and semantic parsing. Through the above steps, the text preprocessing outputs a text fragment in terms of words, taking "a cameraman has funeral when a us tank attacks a basistein hotel" as an example, and the text processing outputs a text fragment with part of speech tagging "a m/cameraman n/in p/us ns/tank n/attack v/a m/basistein/hotel n/time n/funeral v". In addition, the text preprocessing also outputs a grammar dependency tree and a semantic parse tree corresponding to the text. Second, vector embedding involves embedding of words, characters, parts of speech, semantics, and syntax. Vectorizing the vocabulary through a pre-trained language model; the information of character level can be learned and embedded by a convolution neural network, part of speech embedding can be obtained by randomly initializing vectors and training the process, and semantic and grammar embedding is carried out on a semantic parsing tree and a grammar dependency tree by convolution through a graph convolution network to obtain corresponding semantic vectors and grammar vectors.

Feature encoding further encodes using the preprocessed text vectors, which provide shared context information for both decoding processes.

As shown in fig. 2, a schematic diagram of a feature encoding process is shown on the left side of fig. 2, and a bidirectional recurrent neural network (specifically, a bidirectional long-term and short-term memory model) is used to encode a text word by word to obtain an encoded vector e.

As shown in fig. 2, the right side of fig. 2 shows a schematic diagram of the boundary identification decoding process. According to the structure shown in the figure, the boundary identification decoding adopts two pointer networks to respectively calculate the group sequence based on the left boundary and the intra-group entity sequence based on the right boundary in a mesh mode. Firstly, for the group sequence pointer network input as a coding vector e and a left boundary vector o obtained at the last moment, the unnormalized positioning probability is obtained by performing attention operation on e by o. Thus, for time j, the left bounding bit probability may pass

Wherein v and W are trainable parameters. At this time, the left boundary selected at the j-th time is o_j＝argmax_i(u_j,i)(2). Similarly, the entity right boundary sequence is calculated by using the coding vector, the right boundary vector obtained at the previous moment and the left boundary vector corresponding to the group where the right boundary vector is located as input, and compared with the formula (1), the entity sequence pointer network needs to perform attention operation on the coding vector after splicing the left boundary vector and the corresponding right boundary vector:

the right boundary finally obtained is o_j,k＝argmax_i(u_j,k,i)(4)。

As shown in fig. 3, a schematic diagram of the entity classification decoding process is shown. First, text information x and a candidate boundary y resulting from the boundary recognition decoding process are taken as input. The text information obtains a coding vector e through a data preprocessing process, masking the e through a boundary y to obtain a vector of a relevant segment of the alternative entity, then learning segment characteristics through a convolutional neural network, and then classifying to obtain the category of the entity.

Course training provides a solution to training the entire course. And respectively defining loss functions for the boundary identification part and the entity classification part, wherein the method adopts the cross entropy loss function for learning. During learning, optimization is performed by a stochastic gradient. Since the two processes of boundary identification and entity classification are in a serial form (the output of the process 1 is used as the input of the process 2), the whole nested entity identification process based on boundary identification is comprehensively trained through strategies of alternately training the two processes during training. In addition, to ensure the accuracy of training, in the training phase, the entity classification decoding process needs to add two additional types of operations: 1) the classification model is added with null value classes for secondary entity screening, so that the improvement accuracy is ensured; 2) the artificial addition of 10% negative examples, which can be generated by the boundary recognition decoding process, ensures that the characterization of the null class can be learned.

And the entity prediction is used for identifying the nested entities of the unlabeled text and outputting nested entity fragments and corresponding classification information after identification.

The method provided by the above embodiment of the present invention is further described in detail below with reference to a specific application example.

Taking "Shanghai university of transportation" as an example, the fragment contains two entities: the geographic entity "shanghai" and the organizational entity "shanghai transportation university". In the identification process, firstly, the position 1- "upper" is identified as a left boundary, and then, the position "2-" sea "and the position 6-" student "are identified as a right boundary, and finally, two entities are obtained. The flattening mode only utilizes the inherent attribute of the nested structure, and does not need to extract the prior knowledge of the data, so that the method can be suitable for different data sets in different fields, and the generalization capability of the method is ensured.

Based on the analysis, the nested entity identification method based on the boundary identification solves the following technical problems in a boundary identification mode of flattening the nested structure:

1) flattening the nested structure by the boundary portion;

2) encoding text data;

3) constructing a decoder based on boundary identification;

4) and training a boundary identification decoding model and an entity classification decoding model.

The accuracy of the obtained result is evaluated through the F1 index pair, and compared with the prior art, the accuracy of the obtained result is improved by 1.3 percentage points.

Another embodiment of the present invention provides a nested entity recognition system based on boundary recognition, including:

As a preferred embodiment, the system provided in this embodiment further includes:

A third embodiment of the present invention provides a terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being operable to execute the method according to any one of the above embodiments of the present invention when executing the program.

Optionally, a memory for storing a program; a Memory, which may include a volatile Memory (RAM), such as a Random Access Memory (SRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and the like; the memory may also comprise a non-volatile memory, such as a flash memory. The memories are used to store computer programs (e.g., applications, functional modules, etc. that implement the above-described methods), computer instructions, etc., which may be stored in partition in the memory or memories. And the computer programs, computer instructions, data, etc. described above may be invoked by a processor.

The computer programs, computer instructions, etc. described above may be stored in one or more memories in a partitioned manner. And the computer programs, computer instructions, data, etc. described above may be invoked by a processor.

A processor for executing the computer program stored in the memory to implement the steps of the method according to the above embodiments. Reference may be made in particular to the description relating to the preceding method embodiment.

The processor and the memory may be separate structures or may be an integrated structure integrated together. When the processor and the memory are separate structures, the memory, the processor may be coupled by a bus.

A fourth embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any of the above-mentioned embodiments of the invention.

In the method and system for identifying nested entities based on boundary identification provided in the embodiments of the present invention, two layers of extraction pattern matching based on entity boundaries are performed, entities are grouped according to the left boundary, and entity sequences in each group are matched according to the right boundary; coding the text by adopting a recurrent neural network; generating an entity group sequence in sequence by iteration of a pointer network 1 by taking the left boundary generated in the previous step as input; combining the left boundary in each group and the right boundary generated in the previous step as input, and iteratively generating an entity sequence in the group through a pointer network; decoding the two-layer structure to obtain a candidate entity; the entities are classified by a convolutional neural network. According to the method and the system provided by the embodiment of the invention, on the premise of not introducing external knowledge, the nested structure in the nested information is effectively flattened through the simple two-layer structure, the resolving capability of the deep nested structure can be effectively improved on the basis of ensuring the accurate extraction of the shallow information, and the accuracy of the extraction of the deep nested information is ensured. By flattening the nested structure, the nested entity recognition is realized by a two-layer boundary recognition method, and the method has generalization capability while ensuring the recognition accuracy.

It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art may implement the composition of the system by referring to the technical solution of the method, that is, the embodiment in the method may be understood as a preferred example for constructing the system, and will not be described herein again.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A nested entity identification method based on boundary identification is characterized by comprising the following steps:

2. The boundary recognition-based nested entity recognition method of claim 1, wherein the data preprocessing of the input text comprises: text preprocessing and vector embedding; wherein:

3. The boundary identification-based nested entity identification method of claim 1, wherein the feature coding the obtained multidimensional vector to obtain a coded vector with context information comprises:

4. The method for identifying nested entities based on boundary identification according to claim 1, wherein the extracting entity boundary related information from the coded vector with context information, then decoding the extracted entity boundary related information, and identifying the boundary of the obtained entity segment to obtain entity boundary information comprises:

5. The boundary identification-based nested entity identification method of claim 4, wherein the two-level pointer network comprises a group sequence pointer network for identifying a left boundary group and an entity sequence pointer network for identifying a right boundary sequence; wherein:

wherein u is_j，iThe left boundary is the non-standardized positioning probability, v and W are trainable parameters, a subscript l represents the left boundary, and a superscript T is a vector transposition symbol;

o_j＝argmax_i(u_j，i)；

wherein u is_j，k，iSubscripts r and k are respectively a right boundary and a corresponding kth left boundary, and superscript T is a vector transposition symbol;

the right boundary vector finally obtained is o_j，k＝argmax_i(u_j，k，i)。

6. The method of claim 1, wherein the masking a coding vector with context information by using entity boundary information obtained by identification to obtain a candidate entity fragment vector, and classifying features of the candidate entity fragment by entity classification decoding to obtain entity classification information comprises:

7. A nested entity recognition method based on boundary recognition according to any one of claims 1 to 6, characterized by further comprising: and optimizing the entity boundary information extraction process and the entity classification information extraction process.

8. The method for identifying nested entities based on boundary identification according to claim 7, wherein the optimizing the entity boundary information extraction process and the entity classification information extraction process comprises:

alternately training an entity boundary information extraction process and an entity classification information extraction process by adopting a cross entropy loss function in a recall rate priority mode to realize the optimization of the extraction process; wherein:

in the process of optimizing the entity classification information extraction process, adding a null value class and a negative sample; wherein:

9. A nested entity recognition system based on boundary recognition, comprising:

10. The boundary recognition-based nested entity recognition system of claim 9, further comprising:

11. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, is operative to perform the method of any of claims 1-8.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 8.