CN113221576B

CN113221576B - Named entity identification method based on sequence-to-sequence architecture

Info

Publication number: CN113221576B
Application number: CN202110608812.4A
Authority: CN
Inventors: 邱锡鹏; 颜航
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2023-01-13
Anticipated expiration: 2041-06-01
Also published as: CN113221576A

Abstract

The invention relates to the technical field of recognition, and provides a named entity recognition method based on a sequence-to-sequence architecture.

Description

Named entity identification method based on sequence-to-sequence architecture

Technical Field

The invention relates to the technical field of identification, in particular to a named entity identification method based on a sequence-to-sequence architecture.

Background

The named entity recognition task is a task of capturing a specific type of text segment from given text, such as extracting characters, places, symptoms and the like in the text. For example, for a sentence, "zhang san will be in a job in 2021," two tuples (zhang san, person), (2021, time) need to be extracted, the first element of the tuple represents the content in the sentence, and the second element of the tuple represents what type of named entity the content is.

Named entity recognition is one of basic technologies of information extraction technology, and is widely applied to a question and answer system, a dialogue system, a translation system, and the like in natural language processing. In the most common named entity task, there is no intersection between different entities, and the same entity must be a contiguous piece of text. However, in some specific application scenarios, there may be a nested relationship between entities, for example, the phrase "souvenir hall" includes at least the following entities: (person, lugnu), (commemorative hall, venue), there is a nested relationship between the two entities. Furthermore, named entity recognition in the medical field may also be the case where there are non-continuous entities, for example, in entity recognition where patient symptoms are extracted, both symptoms (muscle pain, symptoms) and (muscle soreness, symptoms) need to be extracted from "patient muscle pain and soreness", where "muscle soreness" is not a continuous text segment in the original sentence.

At present, common named entity recognition is generally solved by a sequence labeling mode, but for nested named entity recognition and discontinuous named entity recognition, a complicated specification needs to be designed by adopting the sequence labeling mode. Moreover, the method for identifying the named entities through sequence marking is very limited, different types of named entity identification must be processed by adopting different model structures, and the application range is narrow.

Disclosure of Invention

The present invention is made to solve the above problems, and an object of the present invention is to provide a named entity recognition method based on a sequence-to-sequence architecture.

The invention provides a named entity identification method based on a sequence-to-sequence architecture, which has the characteristics that the method comprises the following steps: s1, constructing a named entity recognition model; s2, training the named entity recognition model through a preset sample, wherein an entity sequence of the preset sample is obtained according to a preset sequencing rule; s3, inputting the text to be detected into a named entity recognition model to obtain a recognition result sequence; and S4, decoding the recognition result sequence output by the named entity recognition model to obtain a plurality of named entities and text labels corresponding to the named entities, wherein the named entity recognition model comprises an encoder and a decoder, the output of the decoder is named entity positions and the text labels, in the training process, the decoder outputs the named entity positions and the output labels as sample labels according to a preset sample, the corresponding named entities are obtained from the preset sample according to the named entity positions to serve as sample entities, the decoder is trained according to the sample entities and the sample labels, and the named entity sequence is composed of the named entity positions and the text labels output by the named entity recognition model according to the text to be tested.

The named entity identification method based on the sequence-to-sequence architecture provided by the invention can also have the following characteristics: the input of the coder is a text to be recognized, and the output of the coder is a high-dimensional vector of words.

In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: wherein the input of the decoder is the output of the encoder and the output of the decoder is the named entity sequence.

In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: in the named entity sequence, the named entity position is used for indicating the position of the named entity in the text to be identified, and the text label is the category corresponding to the named entity.

In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: wherein the predetermined ordering rule is: and sequencing the named entities according to the starting positions of the named entities in sequence, and sequencing the named entities with the same starting positions according to the entity lengths corresponding to the named entities.

In the named entity recognition method based on the sequence-to-sequence architecture provided by the invention, the method can also have the following characteristics: the named entity position is a pointer pointing to the sequence number of the character in the text.

Action and Effect of the invention

According to the named entity recognition method based on the sequence-to-sequence architecture, the named entity recognition model of the component comprises an encoder and a decoder, the output of the decoder is named entity position and text labels, after the named entity recognition model is trained through a preset sample, a text to be detected is input into the named entity recognition model to obtain a recognition result sequence, and the recognition result sequence output by the named entity recognition model is decoded to obtain a plurality of named entities and text labels corresponding to the named entities.

In addition, in the training process, the decoder outputs the named entity position and the output label as a sample label according to the preset sample, acquires the corresponding named entity from the preset sample as a sample entity according to the named entity position, and trains the decoder according to the sample entity and the sample label, so that the training effect is prevented from being influenced by inputting the named entity position which does not contain semantic information.

Drawings

FIG. 1 is a flow diagram of a named entity recognition method based on a sequence-to-sequence architecture in an embodiment of the present invention;

FIG. 2 is a diagram of a named entity recognition model in an embodiment of the invention.

Detailed Description

In order to make the technical means, creation features, achievement purposes and effects of the invention easy to understand, the following embodiments specifically describe the named entity recognition method based on sequence-to-sequence architecture in combination with the drawings.

< example >

This embodiment details the named entity recognition method based on the sequence-to-sequence architecture.

Fig. 1 is a flowchart of a named entity identification method based on a sequence-to-sequence architecture in this embodiment.

As shown in fig. 1, the named entity identification method based on sequence-to-sequence architecture includes the following steps:

and S1, constructing a named entity recognition model.

The named entity recognition model comprises an encoder and a decoder, wherein the input of the encoder is a text to be recognized, and the output of the encoder is a high-dimensional vector of words. The input of the decoder is the output of the encoder and the output of the decoder is the named entity sequence.

And S2, training the named entity recognition model through a preset sample, wherein an entity sequence of the preset sample is obtained according to a preset sequencing rule.

In the training process, the decoder outputs the position of the named entity and the output label as a sample label according to a preset sample, acquires the corresponding named entity from the preset sample as a sample entity according to the position of the named entity, trains the decoder according to the sample entity and the sample label,

and S3, inputting the text to be detected into the named entity recognition model to obtain a recognition result sequence.

And S4, decoding the recognition result sequence output by the named entity recognition model to obtain a plurality of named entities and text labels corresponding to the named entities.

In the named entity sequence, the named entity position is used for indicating the position of the named entity in the text to be identified, and the text label is the category corresponding to the named entity. The named entity location is a pointer to the sequence number of the character in the text.

The predetermined ordering rule is: and sequencing the named entities according to the starting positions of the named entities in sequence, and sequencing the named entities with the same starting positions according to the entity lengths corresponding to the named entities.

FIG. 2 is a diagram of a named entity recognition model in this embodiment.

As shown in fig. 2, the named entity recognition model includes an encoder and a decoder.

The conversion mode after the text to be tested is input into the encoder is as follows:

and when the named entities in the text to be tested are conventional named entities, sequentially arranging the named entities according to the appearance sequence of the named entities in the text. For example, for text [ x ] to be tested ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ]Suppose [ x ] therein ₁ ,x ₂ ],[x ₅ ,x ₆ ]Are respectively entity class e ₁ And e ₂ Then the named entity sequence in the text to be tested is represented as [1,2, e ] ₁ ,5,6,e ₂ ]In this embodiment, the entity is expressed by using the named entity position in the text to be tested instead of using the text segment, so as to avoid ambiguity caused by the occurrence of the same text segment in the text. For example, the entity sequence corresponding to "Zhang Sansheng in Hunan" is [1, people, 4, place]。

When the named entities in the text to be tested are nested named entities, the conversion mode of the named entity sequence is that the named entities which start first are ranked in front, and the named entities which start at the same position are ranked in front with shorter length. For example if the sentence [ x ] ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ]In [ x ] ₁ ,x ₂ ]，[x ₁ ,x ₂ ,x ₃ ]And [ x ] ₅ ,x ₆ ]As entity class e ₁ ,e ₂ And e ₃ Then the corresponding entity sequence is [1,2, e ] ₁ ,1,2,3,e ₂ ,5,6,e ₃ ]And expressing the named entity by using the named entity position in the text to be tested.

When the named entities in the text to be tested are discontinuous named entities, the conversion rule of the named entity sequence is that the entity sequence started first is arranged in front, the entities started at the same position are arranged according to the entity length, and the shorter entity is arranged in front. For example if the sentence [ x ] ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ]In [ x ] ₁ ,x ₃ ]，[x ₁ ,x ₂ ,x ₃ ,x ₅ ]And [ x ] ₅ ,x ₆ ]As entity class e ₁ ,e ₂ And e ₃ Then the corresponding entity sequence is [1,3, e ] ₁ ,1,2,3,5,e ₂ ,5,6,e ₃ ]And expressing the named entity by using the named entity position in the text to be tested. Wherein x is ₁ -x _n For the identified entity, n > 1

The calculation process of the encoder is as follows:

H ^e ＝Encoder([x ₁ ，...，x _n ])，

wherein H ^e Is a latent vector for each word after encoding.

The calculation process of the decoder is as follows:

E ^e ＝TokenEmbed(X)，

C ^d ＝TokenEmbed(C)，

wherein the content of the first and second substances,

is the content that has been generated by the encoder, alpha is a hyper-parameter quantity, C is a collection of entity classes,

is a dot product, P _t Is the distribution of the output words at the current moment, X is the input word,

is the decoder hidden state at time t, E ^e Is an input word embedding vector, C ^d Embedded vectors for classes.

In this embodiment, at the time of decoding, P is passed _t The output is the pointer number or the category number, and when they are input at the next moment of the decoder, they need to be converted into corresponding words. Since the target generation sequence may generate the word sequence number of the input text, but the sequence number itself does not contain semantic information, the word sequence number cannot be directly transmitted as an input to the decoder in the process of performing autoregressive generation, and needs to be restored to a specific word in the input through mapping.

Effects and effects of the embodiments

According to the named entity recognition method based on the sequence-to-sequence architecture, the named entity recognition model of the component comprises the encoder and the decoder, the output of the decoder is the named entity position and the text label, after the named entity recognition model is trained through the preset samples, the text to be detected is input into the named entity recognition model to obtain the recognition result, and then the recognition result is sequenced according to the preset entity sequencing rule to obtain the named entity sequence.

In addition, in the training process, the decoder outputs the position of the named entity and the output label as a sample label according to the preset sample, acquires the corresponding named entity from the preset sample as a sample entity according to the position of the named entity, and trains the decoder according to the sample entity and the sample label, so that the influence on the training effect caused by the input of the position of the named entity without semantic information is avoided.

The above embodiments are preferred examples of the present invention, and are not intended to limit the scope of the present invention.

Claims

1. A named entity identification method based on sequence-to-sequence architecture is characterized by comprising the following steps:

s1, constructing a named entity recognition model;

s2, training the named entity recognition model through a preset sample, wherein an entity sequence of the preset sample is obtained according to a preset sequencing rule;

s3, inputting the text to be detected into the named entity recognition model to obtain a recognition result sequence;

s4, decoding the recognition result sequence output by the named entity recognition model to obtain a plurality of named entities and text labels corresponding to the named entities,

wherein the named entity recognition model comprises an encoder and a decoder,

the output of the decoder is a named entity location and a text label,

in the training process, the decoder outputs a named entity position and an output label as a sample label according to the preset sample, acquires a corresponding named entity from the preset sample as a sample entity according to the named entity position, and trains the decoder according to the sample entity and the sample label,

the named entity sequence consists of the named entity position and the text label which are output by the named entity recognition model according to the text to be tested,

the named entity position is a pointer pointing to a sequence number of a character in the text to be tested, in the named entity sequence, the named entity position is used for indicating the position of the named entity in the text to be tested, the text label is a category corresponding to the named entity,

the predetermined ordering rule is:

sequencing the named entities according to the starting positions in sequence according to the positions of the named entities, sequencing the named entities with the same starting positions according to the entity lengths corresponding to the named entities,

when the named entities are conventional named entities, the named entities are sequentially arranged according to the appearance sequence of the named entities in the text;

when the named entities are nested named entities, the conversion mode of the named entity sequence is that the named entities which start first are ranked earlier, and the named entities which start at the same position are ranked earlier with shorter length, wherein for the text [ x ] to be tested ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ]Wherein [ x ] ₁ ,x ₂ ]、[x ₁ ,x ₂ ,x ₃ ]And [ x ] ₅ ,x ₆ ]Respectively entity class e ₁ 、e ₂ And e ₃ Then the named entity sequence in the text to be tested is represented as [1,2, e ] ₁ ,1,2,3,e ₂ ,5,6,e ₃ ]；

When the named entities are discontinuous named entities, the conversion rule of the named entity sequence is that the named entities which start first are ranked in front, the named entities which start at the same position are ranked according to the entity length, the shorter named entities are ranked in front, wherein for the text [ x ] to be tested ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ]Wherein [ x ] ₁ ,x ₃ ]、[x ₁ ,x ₂ ,x ₃ ,x ₅ ]And [ x ] ₅ ,x ₆ ]Respectively entity class e ₁ 、e ₂ And e ₃ Then the named entity sequence in the text to be tested is represented as [1,3, e ] ₁ ,1,2,3,5,e ₂ ,5,6,e ₃ ]，

The calculation process of the encoder is as follows:

H ^e ＝Encoder([x ₁ ,...,x _n ])

in the formula, H ^e Is a latent vector for each word after encoding,

the calculation process of the decoder is as follows:

E ^e ＝TokenEmbed(X)，

c ^d ＝TokenEmbed(C)，

in the formula (I), the compound is shown in the specification,

is the content that has been generated by the encoder, alpha is the hyper-parameter quantity, C is the set of entity classes,

is a dot product, P _t Is the distribution of the output words at the current time.

2. The named entity recognition method based on sequence-to-sequence architecture as claimed in claim 1, wherein:

the input of the encoder is a text to be detected, and the output of the encoder is a high-dimensional vector of words.

3. The named entity recognition method based on sequence-to-sequence architecture as claimed in claim 1, wherein:

wherein the input of the decoder is the output of the encoder and the output of the decoder is the named entity sequence.