CN114462391A

CN114462391A - Nested entity identification method and system based on comparative learning

Info

Publication number: CN114462391A
Application number: CN202210247571.XA
Authority: CN
Inventors: 胡碧峰; 王艳飞; 胡茂海; 尹光荣
Original assignee: Workway Shenzhen Information Technology Co ltd
Current assignee: Workway Shenzhen Information Technology Co ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-05-10
Anticipated expiration: 2042-03-14
Also published as: CN114462391B

Abstract

The invention provides a method and a system for classifying nested entities based on comparative learning.A target nested entity classification model for classifying the nested entities is obtained through two stages, wherein in the first stage, the representation of the entities is learned by using the comparative learning method, and in the second stage, a fragment method is adopted.

Description

Nested entity identification method and system based on comparative learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a nested entity identification method and system based on comparative learning.

Background

In the current nested entity recognition technology, there are mainly two methods: the method comprises the steps of firstly, a sequence labeling method, which is used for decoding for multiple times in the decoding process so as to identify entities nested in one sentence; the second is a fragment method, that is, the entity identification is converted into the classification of fragments, all the fragments in a sentence are enumerated and classified, and thus the nested entities are identified.

Compared with a sequence labeling method, the missing recognition rate of the fragment method is lower, so that the method is widely adopted. However, the amount of negative samples to be considered in the training process is very large, if a sentence has n characters, n (n +1)/2 segments are generated, which causes imbalance of samples, the convergence rate of the model is slow, the training efficiency is affected, and especially when the sequence is long, the timeliness requirement of the line on the model is not met.

Disclosure of Invention

In view of the foregoing technical problems, embodiments of the present invention provide a method and a system for classifying nested entities based on contrast learning, so as to solve at least one of the above technical problems.

The technical scheme adopted by the invention is as follows:

the embodiment of the invention provides a nested entity classification method based on comparative learning, which comprises the following steps:

s1, acquiring an input statement data table; wherein, the jth row of the statement data table i comprises (X)_ij，L_ij)，X_ij＝(X¹ _ij，X² _ij，…，X^nij _ij)，X^k _ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l is_ij＝{(E¹ _ij，T¹ _ij)，(E² _ij，T² _ij)，…，(E^mij _ij，T^mij _ij)}，E^r _ijIs the r-th entity, T, in the j-th statement of statement data table i^r _ijIs E^r _ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i; n is the number of statement data tables;

s2, for the sentence j in the sentence data table i, the following operations are executed:

s201, coding the sentence j twice by adopting a pre-training language model to respectively obtain a first characterization vector h1_ij＝(h1¹ _ij，h1² _ij，…，h1^nij _ij) And a second token vector h2_ij＝(h2¹ _ij，h2² _ij，…，h2^nij _ij) Wherein, h1^k _ijAnd h2^k _ijAre respectively a pair X^k _ijPerforming first encoding and second encoding to obtain a representation;

s202, obtaining

Wherein the content of the first and second substances,

are each h1_ijAnd h2_ijA first entity representation and a second entity representation of an r-th entity in a corresponding entity representation vector,

a second entity characterization of the t-th entity in the s-th statement except the statement j in the statement data table i; tau is a temperature over-parameter;

to represent

And

the cosine similarity between the two signals is determined,

to represent

And

s203, obtaining

Wherein, B1'_iCharacterizing the r-th entity except the statement j in the statement data table iEntity representations of any entity of the same type of the corresponding entity,

the first entity representation is an entity q which is different from the entity corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i;

s204, optimizing the tau and the dropout in the pre-training language model to ensure that Loss1^k _ijAnd Loss2^k _ijMinimum;

s205, setting j ═ j + 1; if j is less than or equal to Pi, executing S2; otherwise, go to S3;

s3, enumerating the segments of each sentence t, randomly extracting the segments with a set number except the entity as negative samples to obtain a training set comprising N training samples;

s4, inputting the training set into the optimized pre-training language model, and classifying the entity type in each sentence to obtain a classification prediction result;

s5, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model;

and S6, classifying the input sentences by using the target nested entity classification model.

The invention also discloses a nested entity classification system based on contrast learning, which comprises a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, and N statement data tables are stored in the database, wherein the jth row of the statement data table i comprises (X)_ij，L_ij)，X_ij＝(X¹ _ij，X² _ij，…，X^nij _ij)，X^k _ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l is_ij＝{(E¹ _ij，T¹ _ij)，(E² _ij，T² _ij)，…，(E^mij _ij，T^mij _ij)}，E^r _ijIs the r-th entity, T, in the j-th statement of statement data table i^r _ijIs E^r _ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i;

the processor is configured to execute a computer program to implement the steps of:

s10, for the sentence j in the sentence data table i, the following operations are executed:

s101, coding the sentence j twice by adopting a pre-training language model Bert to respectively obtain a first characterization vector h1_ij＝(h1¹ _ij，h1² _ij，…，h1^nij _ij) And a second token vector h2_ij＝(h2¹ _ij，h2² _ij，…，h2^nij _ij) Wherein, h1^k _ijAnd h2^k _ijAre respectively a pair X^k _ijPerforming first encoding and second encoding to obtain a representation;

s102, obtaining

Wherein the content of the first and second substances,

to represent

And

the cosine similarity between the two signals is determined,

to represent

And

s103, obtaining

Wherein, B1'_iThe entity representation of any entity with the same type as the entity corresponding to the r-th entity representation except the statement j in the statement data table i,

s104, optimizing the tau and the dropout in the pre-training language model to enable Loss1^k _ijAnd Loss2^k _ijMinimum;

s105, setting j ═ j + 1; if j is less than or equal to Pi, executing S10; otherwise, go to S20;

s20, enumerating the segments of each sentence t, randomly extracting the segments with a set number except the entity as negative samples to obtain a training set comprising N training samples;

s30, inputting the training set into the optimized pre-training language model, and classifying the entity type in each sentence to obtain a classification prediction result;

and S40, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.

The embodiment of the invention at least has the following technical effects: the target nested entity classification model for the nested entity classification is obtained through two stages, wherein in the first stage, the representation of the entity is learned by using a contrast learning method, and in the second stage, a fragment method is adopted.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below.

An embodiment of the present invention provides a method for classifying nested entities based on comparative learning, which includes the following steps:

s1, acquiring an input statement data table; wherein, the jth row of the statement data table i comprises (X)_ij，L_ij)，X_ij＝(X¹ _ij，X² _ij，…，X^nij _ij)，X^k _ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l is_ij＝{(E¹ _ij，T¹ _ij)，(E² _ij，T² _ij)，…，(E^mij _ij，T^mij _ij)}，E^r _ijIs the r-th entity, T, in the j-th statement of statement data table i^r _ijIs E^r _ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i; and N is the number of statement data tables.

In one illustrative embodiment, for example, statement data table i includes two statements:

statement 1: "Beijing university school's Long Chui Yuanbei", wherein the entity type of "Beijing university" nesting entity, "Beijing" is address type, the entity type of "university" is organization type, and the entity type of "Chui Yuanbei" is object type.

Statement 2: "dune tung graduate at hong kong chinese university", wherein the entity type of "dune tung" is a character type, the entity type of "hong kong" is an address type, and the entity type of "chinese university" is an institution type.

In an embodiment of the present invention, the number of sentences in each sentence data table may be the same, i.e., P1 ═ P2 ═ … ═ PN.

In another embodiment of the present invention, the number of sentences in the first N-1 sentence data tables may be the same, i.e., P1 ═ P2 ═ … ═ P (N-1) ═ P, and the number of sentences in the last sentence data table may be equal to the total number of sentences M- (N-1) × P.

s201, coding the sentence j twice by adopting a pre-training language model to respectively obtain a first characterization vector h1_ij＝(h1¹ _ij，h1² _ij，…，h1^nij _ij) And a second token vector h2_ij＝(h2¹ _ij，h2² _ij，…，h2^nij _ij) Wherein, h1^k _ijAnd h2^k _ijAre respectively a pair X^k _ijAnd performing the first encoding and the second encoding to obtain the characterization.

In an exemplary embodiment of the invention, the pre-trained language model may be a bert model. Due to the mechanism of the random mask inside the Bert, the result of encoding the same sentence for many times is different, so that the sample is encoded twice by using the characteristic to generate a positive sample required by contrast learning.

In another exemplary embodiment of the invention, the pre-trained language model is a roberta model.

Those skilled in the art will appreciate that methods of encoding sentences using pre-trained language models may be within the skill of the art.

S202, obtaining

Wherein the content of the first and second substances,

are each h1_ijAnd h2_ijThe first entity representation and the second entity representation of the r-th entity in the corresponding entity representation vector, for example, the field "beijing" composed of the first character and the second character in statement 1 is the first entity of statement 1, then,

second entity characterization for the t-th entity in the s-th statement in statement data table i except for statement j, e.g., "dunghong", "hong kong", and "hong kong chinese university" in statement 2; tau is a temperature over-parameter;

to represent

And

the cosine similarity between the two signals is determined,

to represent

And

in the embodiment of the present invention, the Loss function Loss1 is used to make the corresponding entity words in the two encoding results similar, for example, the entity characterization of "zeiti" in the two encoding results in statement 1 is as similar as possible, and the entity is far from the entity characterization of other entities in the current statement data table.

S203, obtaining

Wherein, B1_iThe entity representation of any entity with the same type as the entity corresponding to the r-th entity representation except the statement j in the statement data table i,

the first entity representation is the entity q which is different in entity type corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i.

In the embodiment of the present invention, the Loss function Loss2 is used to make the same type of entity words should be similar and different types of entity words should be far from the current sentence data table, for example, the person type entity "cai yue" is similar to "cupulong" and is far from other types of entities in the current sentence data table.

S204, optimizing the tau and the dropout in the pre-training language model to ensure that Loss1^k _ijAnd Loss2^k _ijAnd minimum.

Those skilled in the art will appreciate that tau and dropout in the pre-trained language model are optimized such that Loss1^k _ijAnd Loss2^k _ijThe smallest implementation may be prior art.

τ and dropout after the first stage optimization can be obtained through S204.

S205, setting j ═ j + 1; if j is less than or equal to Pi, executing S2; otherwise, S3 is executed.

And S3, enumerating the segments of each sentence t, and randomly extracting the segments with a set number except the entities as negative samples to obtain a training set comprising N training samples.

In the embodiment of the present invention, the set number may be set based on actual needs. Specifically, each training sample is a sentence, and includes a positive sample and a negative sample, and the entities in each sentence of the positive sample, for example, the enumerated section of the sentence "beijing university zeia yucca" may include: beijing, Beijing university, school leader, Chua Yuanbei, Beijing university, …. The positive samples are: beijing, university of Beijing, Chua Yuanbei; negative samples can be randomly drawn "Jingda", …, etc.

And S4, inputting the training set into the optimized pre-training language model, and classifying the types of the entities in each sentence to obtain a classification prediction result.

The classification prediction result may include a classification result, i.e., a type, for each segment in the training set.

And S5, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.

In an embodiment of the present invention, the optimized pre-trained language model may be optimized based on the F1 Score ((F1 Score)). In the embodiment of the invention, the positive samples in the training set are labeled, namely the entity type of each entity is known, so that the fragments in the sentence can be known to be of no type, and the classification accuracy can be obtained by comparing the predicted type with the actual type. Those skilled in the art will appreciate that determining classification accuracy based on the F1 score may be prior art.

Under the condition that the classification accuracy is larger than or equal to the set threshold, the current classification model is accurate, therefore, the current classification model can be used as a target nested entity classification model, and if the classification accuracy is smaller than the set threshold, τ and dropout are continuously adjusted until the classification accuracy is larger than or equal to the set threshold.

Because the characteristics of the samples can be learned by the S1 and the S2, the proportion of negative samples can be reduced, the convergence of the model can be accelerated, the model result is more stable, and the boundary area of the entity is more highly graded.

In practical application, the obtained target nested entity classification model can be used for directly classifying the input sentences.

The invention further provides a nested entity classification system based on contrast learning, which comprises a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, and the database stores N statement data tables, wherein the jth row of the statement data table i comprises (X)_ij，L_ij)，X_ij＝(X¹ _ij，X² _ij，…，X^nij _ij)，X^k _ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l is_ij＝{(E¹ _ij，T¹ _ij)，(E² _ij，T² _ij)，…，(E^mij _ij，T^mij _ij)}，E^r _ijFor the r-th entity in the j-th statement of statement data table i, T^r _ijIs E^r _ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i;

the processor is configured to execute the computer program to perform the steps of:

s102, obtaining

Wherein the content of the first and second substances,

to represent

And

the cosine similarity between the two signals is determined,

to represent

And

s103, obtaining

the entities with different types corresponding to the r-th entity characterization in the p-th statement except the statement j in the statement data table i are represented by the same entityA first entity representation of body q;

Further, in S40, the optimized pre-trained language model is optimized based on the F1 score.

Further, the pre-training language model is a bert model.

Further, the pre-training language model is a roberta model.

Further, P1 ═ P2 ═ … ═ PN.

The specific implementation of this embodiment can be seen in the foregoing embodiments.

Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A nested entity classification method based on comparative learning is characterized by comprising the following steps:

s1, acquiring an input statement data table; whereinLine j of statement data table i includes (X)_ij，L_ij)，X_ij＝(X¹ _ij，X² _ij，…，X^nij _ij)，X^k _ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l is_ij＝{(E¹ _ij，T¹ _ij)，(E² _ij，T² _ij)，…，(E^mij _ij，T^mij _ij)}，E^r _ijIs the r-th entity, T, in the j-th statement of statement data table i^r _ijIs E^r _ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i; n is the number of statement data tables;

s202, obtaining

Wherein the content of the first and second substances,

represent

And

the cosine similarity between the two signals is determined,

to represent

And

s203, obtaining

2. The method of claim 1, wherein in S5, the optimized pre-trained language model is optimized based on the F1 score.

3. The method of claim 1, wherein the pre-trained language model is a bert model.

4. The method of claim 1, wherein the pre-trained language model is a roberta model.

5. The method of claim 1, wherein P1 is P2 is … is PN.

6. A nested entity classification system based on comparative learning is characterized by comprising a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, N statement data tables are stored in the database, and the jth row of the statement data table i comprises (X)_ij，L_ij)，X_ij＝(X¹ _ij，X² _ij，…，X^nij _ij)，X^k _ijAs the statement data table iThe k-th character in the j sentences, the values of k are 1 to nij, and nij is the number of characters in the j-th sentence of the sentence data table i; l is_ij＝{(E¹ _ij，T¹ _ij)，(E² _ij，T² _ij)，…，(E^mij _ij，T^mij _ij)}，E^r _ijFor the r-th entity in the j-th statement of statement data table i, T^r _ijIs E^r _ijCorresponding actual entity types, wherein the value of r is 1 to mij, and mij is the number of entities in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i;

s102, obtaining

Wherein the content of the first and second substances,

as a number of sentencesAccording to a second entity characterization of the t-th entity in the s-th statement except the statement j in the table i; tau is a temperature over-parameter;

to represent

And

the cosine similarity between the two signals is determined,

to represent

And

s103, obtaining

7. The system of claim 6, wherein in S40, the optimized pre-trained language model is optimized based on the F1 score.

8. The system of claim 6, wherein the pre-trained language model is a bert model.

9. The system of claim 6, wherein the pre-trained language model is a roberta model.

10. The system of claim 6, wherein P1-P2- … -PN.