CN114462391A - Nested entity identification method and system based on comparative learning - Google Patents

Nested entity identification method and system based on comparative learning Download PDF

Info

Publication number
CN114462391A
CN114462391A CN202210247571.XA CN202210247571A CN114462391A CN 114462391 A CN114462391 A CN 114462391A CN 202210247571 A CN202210247571 A CN 202210247571A CN 114462391 A CN114462391 A CN 114462391A
Authority
CN
China
Prior art keywords
entity
statement
data table
sentence
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210247571.XA
Other languages
Chinese (zh)
Other versions
CN114462391B (en
Inventor
胡碧峰
王艳飞
胡茂海
尹光荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Workway Shenzhen Information Technology Co ltd
Original Assignee
Workway Shenzhen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Workway Shenzhen Information Technology Co ltd filed Critical Workway Shenzhen Information Technology Co ltd
Priority to CN202210247571.XA priority Critical patent/CN114462391B/en
Publication of CN114462391A publication Critical patent/CN114462391A/en
Application granted granted Critical
Publication of CN114462391B publication Critical patent/CN114462391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for classifying nested entities based on comparative learning.A target nested entity classification model for classifying the nested entities is obtained through two stages, wherein in the first stage, the representation of the entities is learned by using the comparative learning method, and in the second stage, a fragment method is adopted.

Description

Nested entity identification method and system based on comparative learning
Technical Field
The invention relates to the technical field of natural language processing, in particular to a nested entity identification method and system based on comparative learning.
Background
In the current nested entity recognition technology, there are mainly two methods: the method comprises the steps of firstly, a sequence labeling method, which is used for decoding for multiple times in the decoding process so as to identify entities nested in one sentence; the second is a fragment method, that is, the entity identification is converted into the classification of fragments, all the fragments in a sentence are enumerated and classified, and thus the nested entities are identified.
Compared with a sequence labeling method, the missing recognition rate of the fragment method is lower, so that the method is widely adopted. However, the amount of negative samples to be considered in the training process is very large, if a sentence has n characters, n (n +1)/2 segments are generated, which causes imbalance of samples, the convergence rate of the model is slow, the training efficiency is affected, and especially when the sequence is long, the timeliness requirement of the line on the model is not met.
Disclosure of Invention
In view of the foregoing technical problems, embodiments of the present invention provide a method and a system for classifying nested entities based on contrast learning, so as to solve at least one of the above technical problems.
The technical scheme adopted by the invention is as follows:
the embodiment of the invention provides a nested entity classification method based on comparative learning, which comprises the following steps:
the embodiment of the invention provides a nested entity classification method based on comparative learning, which comprises the following steps:
s1, acquiring an input statement data table; wherein, the jth row of the statement data table i comprises (X)ij,Lij),Xij=(X1 ij,X2 ij,…,Xnij ij),Xk ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l isij={(E1 ij,T1 ij),(E2 ij,T2 ij),…,(Emij ij,Tmij ij)},Er ijIs the r-th entity, T, in the j-th statement of statement data table ir ijIs Er ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i; n is the number of statement data tables;
s2, for the sentence j in the sentence data table i, the following operations are executed:
s201, coding the sentence j twice by adopting a pre-training language model to respectively obtain a first characterization vector h1ij=(h11 ij,h12 ij,…,h1nij ij) And a second token vector h2ij=(h21 ij,h22 ij,…,h2nij ij) Wherein, h1k ijAnd h2k ijAre respectively a pair Xk ijPerforming first encoding and second encoding to obtain a representation;
s202, obtaining
Figure BDA0003545646830000021
Wherein the content of the first and second substances,
Figure BDA0003545646830000022
are each h1ijAnd h2ijA first entity representation and a second entity representation of an r-th entity in a corresponding entity representation vector,
Figure BDA0003545646830000023
a second entity characterization of the t-th entity in the s-th statement except the statement j in the statement data table i; tau is a temperature over-parameter;
Figure BDA0003545646830000024
to represent
Figure BDA0003545646830000025
And
Figure BDA0003545646830000026
the cosine similarity between the two signals is determined,
Figure BDA0003545646830000027
to represent
Figure BDA0003545646830000028
And
Figure BDA0003545646830000029
s203, obtaining
Figure BDA00035456468300000210
Wherein, B1'iCharacterizing the r-th entity except the statement j in the statement data table iEntity representations of any entity of the same type of the corresponding entity,
Figure BDA00035456468300000211
the first entity representation is an entity q which is different from the entity corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i;
s204, optimizing the tau and the dropout in the pre-training language model to ensure that Loss1k ijAnd Loss2k ijMinimum;
s205, setting j ═ j + 1; if j is less than or equal to Pi, executing S2; otherwise, go to S3;
s3, enumerating the segments of each sentence t, randomly extracting the segments with a set number except the entity as negative samples to obtain a training set comprising N training samples;
s4, inputting the training set into the optimized pre-training language model, and classifying the entity type in each sentence to obtain a classification prediction result;
s5, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model;
and S6, classifying the input sentences by using the target nested entity classification model.
The invention also discloses a nested entity classification system based on contrast learning, which comprises a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, and N statement data tables are stored in the database, wherein the jth row of the statement data table i comprises (X)ij,Lij),Xij=(X1 ij,X2 ij,…,Xnij ij),Xk ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l isij={(E1 ij,T1 ij),(E2 ij,T2 ij),…,(Emij ij,Tmij ij)},Er ijIs the r-th entity, T, in the j-th statement of statement data table ir ijIs Er ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i;
the processor is configured to execute a computer program to implement the steps of:
s10, for the sentence j in the sentence data table i, the following operations are executed:
s101, coding the sentence j twice by adopting a pre-training language model Bert to respectively obtain a first characterization vector h1ij=(h11 ij,h12 ij,…,h1nij ij) And a second token vector h2ij=(h21 ij,h22 ij,…,h2nij ij) Wherein, h1k ijAnd h2k ijAre respectively a pair Xk ijPerforming first encoding and second encoding to obtain a representation;
s102, obtaining
Figure BDA0003545646830000031
Wherein the content of the first and second substances,
Figure BDA0003545646830000032
are each h1ijAnd h2ijA first entity representation and a second entity representation of an r-th entity in a corresponding entity representation vector,
Figure BDA0003545646830000033
a second entity characterization of the t-th entity in the s-th statement except the statement j in the statement data table i; tau is a temperature over-parameter;
Figure BDA0003545646830000041
to represent
Figure BDA0003545646830000042
And
Figure BDA0003545646830000043
the cosine similarity between the two signals is determined,
Figure BDA0003545646830000044
to represent
Figure BDA0003545646830000045
And
Figure BDA0003545646830000046
s103, obtaining
Figure BDA0003545646830000047
Wherein, B1'iThe entity representation of any entity with the same type as the entity corresponding to the r-th entity representation except the statement j in the statement data table i,
Figure BDA0003545646830000048
the first entity representation is an entity q which is different from the entity corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i;
s104, optimizing the tau and the dropout in the pre-training language model to enable Loss1k ijAnd Loss2k ijMinimum;
s105, setting j ═ j + 1; if j is less than or equal to Pi, executing S10; otherwise, go to S20;
s20, enumerating the segments of each sentence t, randomly extracting the segments with a set number except the entity as negative samples to obtain a training set comprising N training samples;
s30, inputting the training set into the optimized pre-training language model, and classifying the entity type in each sentence to obtain a classification prediction result;
and S40, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.
The embodiment of the invention at least has the following technical effects: the target nested entity classification model for the nested entity classification is obtained through two stages, wherein in the first stage, the representation of the entity is learned by using a contrast learning method, and in the second stage, a fragment method is adopted.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below.
An embodiment of the present invention provides a method for classifying nested entities based on comparative learning, which includes the following steps:
s1, acquiring an input statement data table; wherein, the jth row of the statement data table i comprises (X)ij,Lij),Xij=(X1 ij,X2 ij,…,Xnij ij),Xk ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l isij={(E1 ij,T1 ij),(E2 ij,T2 ij),…,(Emij ij,Tmij ij)},Er ijIs the r-th entity, T, in the j-th statement of statement data table ir ijIs Er ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i; and N is the number of statement data tables.
In one illustrative embodiment, for example, statement data table i includes two statements:
statement 1: "Beijing university school's Long Chui Yuanbei", wherein the entity type of "Beijing university" nesting entity, "Beijing" is address type, the entity type of "university" is organization type, and the entity type of "Chui Yuanbei" is object type.
Statement 2: "dune tung graduate at hong kong chinese university", wherein the entity type of "dune tung" is a character type, the entity type of "hong kong" is an address type, and the entity type of "chinese university" is an institution type.
In an embodiment of the present invention, the number of sentences in each sentence data table may be the same, i.e., P1 ═ P2 ═ … ═ PN.
In another embodiment of the present invention, the number of sentences in the first N-1 sentence data tables may be the same, i.e., P1 ═ P2 ═ … ═ P (N-1) ═ P, and the number of sentences in the last sentence data table may be equal to the total number of sentences M- (N-1) × P.
S2, for the sentence j in the sentence data table i, the following operations are executed:
s201, coding the sentence j twice by adopting a pre-training language model to respectively obtain a first characterization vector h1ij=(h11 ij,h12 ij,…,h1nij ij) And a second token vector h2ij=(h21 ij,h22 ij,…,h2nij ij) Wherein, h1k ijAnd h2k ijAre respectively a pair Xk ijAnd performing the first encoding and the second encoding to obtain the characterization.
In an exemplary embodiment of the invention, the pre-trained language model may be a bert model. Due to the mechanism of the random mask inside the Bert, the result of encoding the same sentence for many times is different, so that the sample is encoded twice by using the characteristic to generate a positive sample required by contrast learning.
In another exemplary embodiment of the invention, the pre-trained language model is a roberta model.
Those skilled in the art will appreciate that methods of encoding sentences using pre-trained language models may be within the skill of the art.
S202, obtaining
Figure BDA0003545646830000061
Wherein the content of the first and second substances,
Figure BDA0003545646830000062
are each h1ijAnd h2ijThe first entity representation and the second entity representation of the r-th entity in the corresponding entity representation vector, for example, the field "beijing" composed of the first character and the second character in statement 1 is the first entity of statement 1, then,
Figure BDA0003545646830000063
second entity characterization for the t-th entity in the s-th statement in statement data table i except for statement j, e.g., "dunghong", "hong kong", and "hong kong chinese university" in statement 2; tau is a temperature over-parameter;
Figure BDA0003545646830000064
to represent
Figure BDA0003545646830000065
And
Figure BDA0003545646830000066
the cosine similarity between the two signals is determined,
Figure BDA0003545646830000067
to represent
Figure BDA0003545646830000068
And
Figure BDA0003545646830000069
in the embodiment of the present invention, the Loss function Loss1 is used to make the corresponding entity words in the two encoding results similar, for example, the entity characterization of "zeiti" in the two encoding results in statement 1 is as similar as possible, and the entity is far from the entity characterization of other entities in the current statement data table.
S203, obtaining
Figure BDA00035456468300000610
Wherein, B1iThe entity representation of any entity with the same type as the entity corresponding to the r-th entity representation except the statement j in the statement data table i,
Figure BDA00035456468300000611
the first entity representation is the entity q which is different in entity type corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i.
In the embodiment of the present invention, the Loss function Loss2 is used to make the same type of entity words should be similar and different types of entity words should be far from the current sentence data table, for example, the person type entity "cai yue" is similar to "cupulong" and is far from other types of entities in the current sentence data table.
S204, optimizing the tau and the dropout in the pre-training language model to ensure that Loss1k ijAnd Loss2k ijAnd minimum.
Those skilled in the art will appreciate that tau and dropout in the pre-trained language model are optimized such that Loss1k ijAnd Loss2k ijThe smallest implementation may be prior art.
τ and dropout after the first stage optimization can be obtained through S204.
S205, setting j ═ j + 1; if j is less than or equal to Pi, executing S2; otherwise, S3 is executed.
And S3, enumerating the segments of each sentence t, and randomly extracting the segments with a set number except the entities as negative samples to obtain a training set comprising N training samples.
In the embodiment of the present invention, the set number may be set based on actual needs. Specifically, each training sample is a sentence, and includes a positive sample and a negative sample, and the entities in each sentence of the positive sample, for example, the enumerated section of the sentence "beijing university zeia yucca" may include: beijing, Beijing university, school leader, Chua Yuanbei, Beijing university, …. The positive samples are: beijing, university of Beijing, Chua Yuanbei; negative samples can be randomly drawn "Jingda", …, etc.
And S4, inputting the training set into the optimized pre-training language model, and classifying the types of the entities in each sentence to obtain a classification prediction result.
The classification prediction result may include a classification result, i.e., a type, for each segment in the training set.
And S5, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.
In an embodiment of the present invention, the optimized pre-trained language model may be optimized based on the F1 Score ((F1 Score)). In the embodiment of the invention, the positive samples in the training set are labeled, namely the entity type of each entity is known, so that the fragments in the sentence can be known to be of no type, and the classification accuracy can be obtained by comparing the predicted type with the actual type. Those skilled in the art will appreciate that determining classification accuracy based on the F1 score may be prior art.
Under the condition that the classification accuracy is larger than or equal to the set threshold, the current classification model is accurate, therefore, the current classification model can be used as a target nested entity classification model, and if the classification accuracy is smaller than the set threshold, τ and dropout are continuously adjusted until the classification accuracy is larger than or equal to the set threshold.
Because the characteristics of the samples can be learned by the S1 and the S2, the proportion of negative samples can be reduced, the convergence of the model can be accelerated, the model result is more stable, and the boundary area of the entity is more highly graded.
And S6, classifying the input sentences by using the target nested entity classification model.
In practical application, the obtained target nested entity classification model can be used for directly classifying the input sentences.
The invention further provides a nested entity classification system based on contrast learning, which comprises a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, and the database stores N statement data tables, wherein the jth row of the statement data table i comprises (X)ij,Lij),Xij=(X1 ij,X2 ij,…,Xnij ij),Xk ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l isij={(E1 ij,T1 ij),(E2 ij,T2 ij),…,(Emij ij,Tmij ij)},Er ijFor the r-th entity in the j-th statement of statement data table i, Tr ijIs Er ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i;
the processor is configured to execute the computer program to perform the steps of:
s10, for the sentence j in the sentence data table i, the following operations are executed:
s101, coding the sentence j twice by adopting a pre-training language model Bert to respectively obtain a first characterization vector h1ij=(h11 ij,h12 ij,…,h1nij ij) And a second token vector h2ij=(h21 ij,h22 ij,…,h2nij ij) Wherein, h1k ijAnd h2k ijAre respectively a pair Xk ijPerforming first encoding and second encoding to obtain a representation;
s102, obtaining
Figure BDA0003545646830000081
Wherein the content of the first and second substances,
Figure BDA0003545646830000082
are each h1ijAnd h2ijA first entity representation and a second entity representation of an r-th entity in a corresponding entity representation vector,
Figure BDA0003545646830000083
a second entity characterization of the t-th entity in the s-th statement except the statement j in the statement data table i; tau is a temperature over-parameter;
Figure BDA0003545646830000084
to represent
Figure BDA0003545646830000085
And
Figure BDA0003545646830000086
the cosine similarity between the two signals is determined,
Figure BDA0003545646830000087
to represent
Figure BDA0003545646830000088
And
Figure BDA0003545646830000089
s103, obtaining
Figure BDA0003545646830000091
Wherein, B1'iThe entity representation of any entity with the same type as the entity corresponding to the r-th entity representation except the statement j in the statement data table i,
Figure BDA0003545646830000092
the entities with different types corresponding to the r-th entity characterization in the p-th statement except the statement j in the statement data table i are represented by the same entityA first entity representation of body q;
s104, optimizing the tau and the dropout in the pre-training language model to enable Loss1k ijAnd Loss2k ijMinimum;
s105, setting j ═ j + 1; if j is less than or equal to Pi, executing S10; otherwise, go to S20;
s20, enumerating the segments of each sentence t, randomly extracting the segments with a set number except the entity as negative samples to obtain a training set comprising N training samples;
s30, inputting the training set into the optimized pre-training language model, and classifying the entity type in each sentence to obtain a classification prediction result;
and S40, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.
Further, in S40, the optimized pre-trained language model is optimized based on the F1 score.
Further, the pre-training language model is a bert model.
Further, the pre-training language model is a roberta model.
Further, P1 ═ P2 ═ … ═ PN.
The specific implementation of this embodiment can be seen in the foregoing embodiments.
Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. A nested entity classification method based on comparative learning is characterized by comprising the following steps:
s1, acquiring an input statement data table; whereinLine j of statement data table i includes (X)ij,Lij),Xij=(X1 ij,X2 ij,…,Xnij ij),Xk ijThe method comprises the steps that the number of the kth character in the jth statement of a statement data table i is shown, the value of k is 1 to nij, and nij is the number of the characters in the jth statement of the statement data table i; l isij={(E1 ij,T1 ij),(E2 ij,T2 ij),…,(Emij ij,Tmij ij)},Er ijIs the r-th entity, T, in the j-th statement of statement data table ir ijIs Er ijCorresponding to the actual entity type, the value of r is 1 to mij, and mij is the entity number in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i; n is the number of statement data tables;
s2, for the sentence j in the sentence data table i, the following operations are executed:
s201, coding the sentence j twice by adopting a pre-training language model to respectively obtain a first characterization vector h1ij=(h11 ij,h12 ij,…,h1nij ij) And a second token vector h2ij=(h21 ij,h22 ij,…,h2nij ij) Wherein, h1k ijAnd h2k ijAre respectively a pair Xk ijPerforming first encoding and second encoding to obtain a representation;
s202, obtaining
Figure FDA0003545646820000011
Wherein the content of the first and second substances,
Figure FDA0003545646820000012
are each h1ijAnd h2ijA first entity representation and a second entity representation of an r-th entity in a corresponding entity representation vector,
Figure FDA0003545646820000013
a second entity characterization of the t-th entity in the s-th statement except the statement j in the statement data table i; tau is a temperature over-parameter;
Figure FDA0003545646820000014
represent
Figure FDA0003545646820000015
And
Figure FDA0003545646820000016
the cosine similarity between the two signals is determined,
Figure FDA0003545646820000017
to represent
Figure FDA0003545646820000018
And
Figure FDA0003545646820000019
s203, obtaining
Figure FDA00035456468200000111
Wherein, B1'iThe entity representation of any entity with the same type as the entity corresponding to the r-th entity representation except the statement j in the statement data table i,
Figure FDA0003545646820000021
the first entity representation is an entity q which is different from the entity corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i;
s204, optimizing the tau and the dropout in the pre-training language model to ensure that Loss1k ijAnd Loss2k ijMinimum;
s205, setting j ═ j + 1; if j is less than or equal to Pi, executing S2; otherwise, go to S3;
s3, enumerating the segments of each sentence t, randomly extracting the segments with a set number except the entity as negative samples to obtain a training set comprising N training samples;
s4, inputting the training set into the optimized pre-training language model, and classifying the entity type in each sentence to obtain a classification prediction result;
s5, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model;
and S6, classifying the input sentences by using the target nested entity classification model.
2. The method of claim 1, wherein in S5, the optimized pre-trained language model is optimized based on the F1 score.
3. The method of claim 1, wherein the pre-trained language model is a bert model.
4. The method of claim 1, wherein the pre-trained language model is a roberta model.
5. The method of claim 1, wherein P1 is P2 is … is PN.
6. A nested entity classification system based on comparative learning is characterized by comprising a server and a database which are in communication connection, wherein the server comprises a processor and a memory which stores a computer program, N statement data tables are stored in the database, and the jth row of the statement data table i comprises (X)ij,Lij),Xij=(X1 ij,X2 ij,…,Xnij ij),Xk ijAs the statement data table iThe k-th character in the j sentences, the values of k are 1 to nij, and nij is the number of characters in the j-th sentence of the sentence data table i; l isij={(E1 ij,T1 ij),(E2 ij,T2 ij),…,(Emij ij,Tmij ij)},Er ijFor the r-th entity in the j-th statement of statement data table i, Tr ijIs Er ijCorresponding actual entity types, wherein the value of r is 1 to mij, and mij is the number of entities in the jth statement of the statement data table i; the value of i is 1 to N, the value of j is 1 to Pi, and Pi is the statement quantity in the statement data table i;
the processor is configured to execute a computer program to implement the steps of:
s10, for the sentence j in the sentence data table i, the following operations are executed:
s101, coding the sentence j twice by adopting a pre-training language model Bert to respectively obtain a first characterization vector h1ij=(h11 ij,h12 ij,…,h1nij ij) And a second token vector h2ij=(h21 ij,h22 ij,…,h2nij ij) Wherein, h1k ijAnd h2k ijAre respectively a pair Xk ijPerforming first encoding and second encoding to obtain a representation;
s102, obtaining
Figure FDA0003545646820000031
Wherein the content of the first and second substances,
Figure FDA0003545646820000032
are each h1ijAnd h2ijA first entity representation and a second entity representation of an r-th entity in a corresponding entity representation vector,
Figure FDA0003545646820000033
as a number of sentencesAccording to a second entity characterization of the t-th entity in the s-th statement except the statement j in the table i; tau is a temperature over-parameter;
Figure FDA0003545646820000034
to represent
Figure FDA0003545646820000035
And
Figure FDA0003545646820000036
the cosine similarity between the two signals is determined,
Figure FDA0003545646820000037
to represent
Figure FDA0003545646820000038
And
Figure FDA0003545646820000039
s103, obtaining
Figure FDA00035456468200000312
Wherein, B1'iThe entity representation of any entity with the same type as the entity corresponding to the r-th entity representation except the statement j in the statement data table i,
Figure FDA00035456468200000311
the first entity representation is an entity q which is different from the entity corresponding to the r-th entity representation in the p-th statement except the statement j in the statement data table i;
s104, optimizing the tau and the dropout in the pre-training language model to enable Loss1k ijAnd Loss2k ijMinimum;
s105, setting j ═ j + 1; if j is less than or equal to Pi, executing S10; otherwise, go to S20;
s20, enumerating the segments of each sentence t, randomly extracting the segments with a set number except the entity as negative samples to obtain a training set comprising N training samples;
s30, inputting the training set into the optimized pre-training language model, and classifying the entity type in each sentence to obtain a classification prediction result;
and S40, optimizing the optimized pre-training language model based on the classification prediction result and the actual entity type in each sentence to obtain a target nested entity classification model.
7. The system of claim 6, wherein in S40, the optimized pre-trained language model is optimized based on the F1 score.
8. The system of claim 6, wherein the pre-trained language model is a bert model.
9. The system of claim 6, wherein the pre-trained language model is a roberta model.
10. The system of claim 6, wherein P1-P2- … -PN.
CN202210247571.XA 2022-03-14 2022-03-14 Nested entity identification method and system based on contrast learning Active CN114462391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210247571.XA CN114462391B (en) 2022-03-14 2022-03-14 Nested entity identification method and system based on contrast learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210247571.XA CN114462391B (en) 2022-03-14 2022-03-14 Nested entity identification method and system based on contrast learning

Publications (2)

Publication Number Publication Date
CN114462391A true CN114462391A (en) 2022-05-10
CN114462391B CN114462391B (en) 2024-05-14

Family

ID=81417788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210247571.XA Active CN114462391B (en) 2022-03-14 2022-03-14 Nested entity identification method and system based on contrast learning

Country Status (1)

Country Link
CN (1) CN114462391B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753545A (en) * 2020-06-19 2020-10-09 科大讯飞(苏州)科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
CN112347785A (en) * 2020-11-18 2021-02-09 湖南国发控股有限公司 Nested entity recognition system based on multitask learning
CN112487812A (en) * 2020-10-21 2021-03-12 上海旻浦科技有限公司 Nested entity identification method and system based on boundary identification
CN113869051A (en) * 2021-09-22 2021-12-31 西安理工大学 Named entity identification method based on deep learning
CN113886571A (en) * 2020-07-01 2022-01-04 北京三星通信技术研究有限公司 Entity identification method, entity identification device, electronic equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753545A (en) * 2020-06-19 2020-10-09 科大讯飞(苏州)科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
CN113886571A (en) * 2020-07-01 2022-01-04 北京三星通信技术研究有限公司 Entity identification method, entity identification device, electronic equipment and computer readable storage medium
WO2022005188A1 (en) * 2020-07-01 2022-01-06 Samsung Electronics Co., Ltd. Entity recognition method, apparatus, electronic device and computer readable storage medium
CN112487812A (en) * 2020-10-21 2021-03-12 上海旻浦科技有限公司 Nested entity identification method and system based on boundary identification
CN112347785A (en) * 2020-11-18 2021-02-09 湖南国发控股有限公司 Nested entity recognition system based on multitask learning
CN113869051A (en) * 2021-09-22 2021-12-31 西安理工大学 Named entity identification method based on deep learning

Also Published As

Publication number Publication date
CN114462391B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
CN108121700B (en) Keyword extraction method and device and electronic equipment
CN111553164A (en) Training method and device for named entity recognition model and computer equipment
CN112487812B (en) Nested entity identification method and system based on boundary identification
US11526663B2 (en) Methods, apparatuses, devices, and computer-readable storage media for determining category of entity
CN108228758A (en) A kind of file classification method and device
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN112100377B (en) Text classification method, apparatus, computer device and storage medium
CN109948160B (en) Short text classification method and device
CN113836925B (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
CN114036950B (en) Medical text named entity recognition method and system
CN112541332A (en) Form information extraction method and device, electronic equipment and storage medium
CN115048505A (en) Corpus screening method and device, electronic equipment and computer readable medium
CN110929532B (en) Data processing method, device, equipment and storage medium
CN115730590A (en) Intention recognition method and related equipment
CN110969005B (en) Method and device for determining similarity between entity corpora
CN114742016A (en) Chapter-level event extraction method and device based on multi-granularity entity differential composition
CN117033961A (en) Multi-mode image-text classification method for context awareness
CN115080748B (en) Weak supervision text classification method and device based on learning with noise label
CN116720498A (en) Training method and device for text similarity detection model and related medium thereof
CN114462391A (en) Nested entity identification method and system based on comparative learning
CN115713082A (en) Named entity identification method, device, equipment and storage medium
CN113688633A (en) Outline determination method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant