CN113468887A - Student information relation extraction method and system based on boundary and segment classification - Google Patents
Student information relation extraction method and system based on boundary and segment classification Download PDFInfo
- Publication number
- CN113468887A CN113468887A CN202110685661.2A CN202110685661A CN113468887A CN 113468887 A CN113468887 A CN 113468887A CN 202110685661 A CN202110685661 A CN 202110685661A CN 113468887 A CN113468887 A CN 113468887A
- Authority
- CN
- China
- Prior art keywords
- entity
- text
- word
- module
- boundary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a scholars information relation extraction method based on boundary and segment classification, which comprises the following steps: step S1: acquiring personal information and text contents of different teachers; step S2: carrying out similar entity word replacement and amplification training data on entity words in the text; step S3: embedding the text by using a pre-training model and extracting semantic features; step S4: recognizing the boundary of a main word and classifying entity fragments; step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments; step S6: and performing user portrayal according to the recognition and classification results. By utilizing the thought of a probability map and combining a half pointer-half labeling mode, the problems that in relation extraction, one subject word corresponds to a plurality of subject words, and the relations between two same entities are different are solved. The method for enhancing entity fragment classification by using the boundary can reduce the influence caused by the prediction error of the tail pointer and improve the accuracy rate of entity relationship extraction.
Description
Technical Field
The invention relates to the technical field of machine learning and natural language processing, in particular to a student information relation extraction method and system based on boundary and segment classification.
Background
The method is characterized in that information such as student names, mailboxes, titles, personal homepages, education experiences and work experiences is required to be extracted from texts, most student information is from the personal homepages or introductory webpages (Baidu encyclopedia and school teacher directories), the problems of few information sources, high data noise and high data information redundancy exist, furthermore, a certain grammatical difference exists between html files obtained by preprocessing the information texts of the students from webpages and natural texts in the traditional sense, so that the information is difficult to extract by a self-defined rule method, however, the problems of large workload, low efficiency and the like exist in artificially extracting the student information of the webpages, and therefore, the extraction of the information of the webpage texts by using an entity relationship extraction technology in natural language processing is very important.
Entity relationship extraction is an important branch in the field of information extraction, and comprises two subtasks: entity identification and relationship extraction, namely identifying entities from a natural text, extracting the relationship between each entity pair, and finally forming a relationship triple < s, p, o >, wherein s represents a subject word (subject), p represents a predicate, namely a relationship (predicate), and o represents a subject word (object). The entity refers to concepts such as time, place, people and organization in the text; relationships refer to semantic relationships between entities.
At present, the entity relationship extraction mainly adopts a neural network model, and two realization modes are mainly adopted: 1. a pipeline form; 2. a joint decimation mode. The pipeline method is characterized in that after entity recognition is finished, relationship extraction between entities is directly carried out, although pipeline learning is easy to realize and the two extraction models are high in flexibility, the entity recognition model and the relationship extraction model can adopt independent data sets, but errors of the entity recognition model can affect the effect of the relationship extraction model, the phenomenon of error propagation occurs, meanwhile, the entity recognition model can obtain a plurality of redundant entities, the difficulty and the complexity of subsequent relationship extraction tasks are increased, and the pipeline learning ignores the connection between the two tasks, so that researchers fuse named entity recognition and relationship extraction into one task to carry out combined learning, the propagation of errors can be relieved to a certain degree through the combined learning, the two tasks are fused into one model, the efficiency of the model learning and prediction is increased, and the robustness of the model is improved, in recent years, the precision of natural language processing of tasks in different fields is greatly improved by the aid of the Transformer model, and the data volume of the training model can be reduced by fine-tuning the Transformer pre-training model.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a student information relation extraction method and system based on boundary and segment classification.
The invention provides a scholars information relation extraction method based on boundary and segment classification, which comprises the following steps:
step S1: acquiring personal information and text contents of different teachers;
step S2: carrying out similar entity word replacement and amplification training data on entity words in the text;
step S3: embedding the text by using a pre-training model and extracting semantic features;
step S4: recognizing the boundary of a main word and classifying entity fragments;
step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
step S6: and performing user portrayal according to the recognition and classification results.
Preferably, the step S1 includes the steps of:
step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
step S102: extracting text content in the html file obtained in the step S101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
step S104: and constructing an entity relation extraction data set by the labeled text.
Preferably, the step S2 includes the steps of:
step S201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF;
step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.
Preferably, the step S3 includes the steps of:
step S301: embedding the text by using a BERT pre-training model and extracting semantic features;
step S302: text matrix T ═ T obtained by segmenting text1,t2,···,tnN represents the character length after sentence segmentation; t is tnRepresenting an nth text participle;
step S303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,···,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,···,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
Preferably, the step S4 includes the steps of:
step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head-to-tail pointer by using a sigmoid activation function, where the formula is as follows:
where a represents a sigmoid function,respectively representing the probability that the ith character is used as a head pointer of the main word and the probability that the ith character is used as a tail pointer of the main word; start and end respectively represent a head pointer and a tail pointer, wherein a subscript s represents a main word; wstart、WendTrainable weight matrices representing the probability of predicting head pointers, trainable weight matrices representing the probability of predicting tail pointers, respectively, bstart、bendRespectively representing the offset of a prediction head pointer and the offset of a prediction tail pointer; h isiAn encoding vector representing the i-th character obtained by step S303; if the probability exceeds a certain threshold value, the token distribution corresponding to the probability is marked as 1, otherwise, the probability is marked as 0;
step S402: go through the token with all head pointers marked 1, whose position in the sequence is given by ystartsFinding the first tail pointer after the head pointer, whose position in the sequence is given byObtaining the sequenceLet is HsubWherein H is the encoding vector H in step S303, and the subscript sub represents the main word sequence;
step S403: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and a code vector S of the subject word sequence is inputsubThe formula is as follows:
ssub=LSTM(Hsub)
wherein LSTM represents a bidirectional long-short time cyclic neural network;
step S404: the subject word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Ps=softmax(WTssub+b)
wherein P issIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, b represents the bias; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
Preferably, the step S5 includes:
step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of a corresponding object word by using a sigmoid activation function, wherein the formula is as follows:
wherein r represents a subject word and object word relationship, whereinAndthe head pointer and tail pointer probabilities that represent the ith character in the sequence as an object word,andis a trainable weight matrix that predicts the head and tail pointers of the relationship r,andthe offset of the head and tail pointers representing the predicted relationship r;
step S502: traversing token of which all head pointers are marked as 1 in each relation r, and the position of the token in the sequence isFind the first tail pointer after the head pointer, whose position in the sequence is ordered asObject word sequence for predicting corresponding relation r of subject wordLet is
Step S503: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and the obtained encoding vector S of the object word is input, with the following formula:
where the LSTM is shared with the LSTM parameters in step S403,the object word coding vector is the corresponding relation r of the subject word, and obj represents the object word;
step S504: the object word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
wherein P issIs the probability of whether the object word sequence is an entity, where the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
The invention also provides a system for extracting the relationship between scholars information based on boundary and segment classification, which comprises the following modules:
module M1: acquiring personal information and text contents of different teachers;
module M2: carrying out similar entity word replacement and amplification training data on entity words in the text;
module M3: embedding the text by using a pre-training model and extracting semantic features;
module M4: recognizing the boundary of a main word and classifying entity fragments;
module M5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
module M6: and performing user portrayal according to the recognition and classification results.
Preferably, the module M1 includes the following modules:
a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
the module M104: and constructing an entity relation extraction data set by the labeled text.
Preferably, the module M2 includes the following modules:
module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF;
module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.
Preferably, the module M3 includes the following modules:
module M301: embedding the text by using a BERT pre-training model and extracting semantic features;
the module M302: text matrix T ═ T obtained by segmenting text1,t2,…,tnN represents the character length after sentence segmentation; t is tnRepresenting an nth text participle;
module M303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
Compared with the prior art, the invention has the following beneficial effects:
1. the method uses the trained entity recognition model BERT-CRF to construct the entity library and is used for enhancing the data of the relation extraction model, thereby improving the diversity of the data set and reducing the cost of manually constructing the data set;
2. the invention uses the probability map thought to convert the joint probability distribution p (s, p, o) into the conditional probability distribution p (p, o | s) and the half pointer-half labeling mode to solve the problems that one subject word corresponds to a plurality of object words, the subject word and the object word correspond to a plurality of relations and one object word corresponds to a plurality of subject words in the relation extraction;
3. the invention adopts entity segment classification to solve the problem of overlong entity prediction caused by missing tail pointers in the pointer network prediction process.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the relationship extraction model prediction of the present invention;
FIG. 2 is a flow chart of data set amplification according to the present invention;
FIG. 3 is a diagram illustrating a segment classification relation extraction model based on boundary enhancement according to the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Referring to fig. 1, the invention provides a scholars information relation extraction method based on boundary and segment classification, comprising the following steps:
step S1: acquiring text contents of personal information of different teachers from Baidu encyclopedia and teacher directories; step S1 includes: crawls the learner information, preprocesses the text, and constructs a basic data set. Step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list; step S102: extracting text contents (deleting html tags) in the html file obtained in the step S101 to obtain a whole plain text file of the teacher personal information; step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences; step S104: and manually marking the text to construct an entity relation extraction data set. The data format is as follows:
{ "text": 2012.12 obtained the agronomic Master academic degree at the college of the institute of Polymer chemistry, university of Mediterranean China. ",
"spo_list":[
"agricultural Master degree", 26 "Master degree-time", "2012.12", 0",
…]}
the 'text' field is a text obtained from a webpage, and the 'spo _ list' field records < s, p, o > triples in the text, wherein each list is a triplet, the first element in the list is a subject word, the second element is the position of the first character of the subject word in the text, the third element is the relationship between the subject word and an object word, the 4 th element is an object word, and the last element is the position of the first character of the object word in the text.
Step S2: referring to fig. 2, the entity words in the text are subjected to the similar entity word replacement amplification training data; step S2 includes: and constructing an entity library, amplifying data by using the entity library, and dividing a training set and a test set. Step S201: constructing an entity library (entity types comprise time, organization, name, courtyard, position, job title and job title) by an entity recognition model BERT-CRF; the format is as follows;
{ "time ["2012-10","2002"," 2012.10", … ],
"degree [" Benke "," Master "," doctor ", … ],
…}
the method comprises the fields of time, hierarchy, attribute, parent, title, joband name, wherein each field record category is an entity of the field.
Step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF (the entity type is the same as the entity type in the entity library in the step S201); step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.
Step S3: referring to fig. 3, word embedding is performed on a text and semantic features are extracted by using a BERT pre-training model; step S3 includes: and loading a pre-trained BERT model to embed the text to obtain a word vector. Step S301: text matrix T ═ T obtained by segmenting text1,t2,…tnN represents the character length after sentence segmentation; the input embedding vector for each character comes from character embedding, position embedding andand character type embedded sum value X ═ X1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hn}。
Step S4: recognizing the boundary of a main word and classifying entity fragments; step S4 includes: and predicting the head and tail pointers of the main body, and classifying the obtained entity sequences.
Step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head pointer or a tail pointer by using a sigmoid activation function, wherein the formula is as follows;
where a represents a sigmoid function,andrepresenting the probability of the ith character as a head and tail pointer of the subject word; start and end represent head and tail pointers, where s represents a body word; wstartAnd WendTrainable weight matrices representing the probabilities of predicting head and tail pointers, bstartAnd bendRepresenting the offsets of the predicted head and tail pointers; h isiAn encoding vector representing the i-th character obtained by step S303; and if the probability exceeds a certain threshold value, marking the token assignment corresponding to the probability as 1, otherwise, marking the probability as 0.
Step S402: go through the token whose head pointer is marked 1 (its position in the sequence is ordered as) Find outThe first tail pointer after the head pointer (whose position in the sequence is ordered to be) Obtaining the sequence(let be H)sub)。
Wherein L issubA boundary of a main word is represented,indicating whether the ith character tag is a head-to-tail pointer,representing the probability of predicting head and tail pointers by the ith character model;
step S403: the subject word sequence obtained in step S402 is input as a bidirectional LSTM (long-short recurrent neural network) to obtain a coding vector S of the subject word, and the formula is as follows:
ssub=LSTM(Hsub)
wherein LSTM represents a bidirectional long-short time cyclic neural network;
step S404: the subject word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Ps=softmax(WTssub+b)
wherein P issIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity class number, and b represents the bias. If the probability exceeds a certain threshold, its corresponding entity class is marked as 1, otherwise it is marked as 0. The loss function of the entity classification of the main word adopts a cross entropy loss functionThe formula is as follows:
where class represents the type of real entity,a loss function representing classification of the subject word segments, n represents the number of the subject word segments, i represents the ith entity segment, and j represents a subject word type label.
Step S5: and identifying object word boundaries and corresponding relation boundaries and classifying entity fragments. Step S5 includes: and predicting the object word boundary of the corresponding relation and predicting the entity type of the object word.
Step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of the object word in the corresponding relationship by using a sigmoid activation function, wherein the formula is as follows:
wherein r represents a subject word and object word relationship, whereinAndthe head pointer and tail pointer probabilities that represent the ith character in the sequence as an object word,andis a trainable weight matrix that predicts the head and tail pointers of the relationship r,andthe offsets of the head and tail pointers representing the predicted relationship r.
Step S502: traverse each relation r with token labeled 1 for all head pointers (its position in the sequence is ordered as) The first tail pointer (whose position in the sequence is given by) Predicting the object word sequence of the corresponding relation r of the subject word(order is). The binary cross entropy loss function adopted by the head-to-tail prediction of the object words and the loss function of the pointer network has the following formula:
wherein L isobjA boundary of a main word is represented,indicating whether the ith character tag is a head-to-tail pointer,representing the probability of predicting head and tail pointers by the ith character model;
step S503: the subject word sequence obtained in step S402 is used as a bidirectional LSTM (long-short recurrent neural network) to input the obtained encoding vector S of the object word, and the formula is as follows:
wherein the LSTM shares with the LSTM parameters in step S403, S _ obj ^ r is the object word encoding vector of the corresponding relation r of the subject word, and obj represents the object word.
Step S504: the object word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
wherein P issIs the probability of whether the object word sequence is an entity, where the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404. If the probability exceeds a certain threshold, its corresponding entity class is marked as 1, otherwise it is marked as 0. The loss function of the entity classification of the object words adopts a cross entropy loss function, and the formula is as follows:
where class represents the type of real entity,a loss function representing classification of the subject word segments, n represents the number of the subject word segments, i represents the ith entity segment, and j represents a subject word type label.
Step S505, the loss function of the whole model is subject word boundary loss, object word boundary loss and entity classification loss, and the formula is as follows:
and during training, an adaptive learning algorithm Adam is used for parameter optimization.
Firstly, the data set is amplified through the entity recognition model BERT-CRF, so that the diversity of the data set can be greatly increased, the labor cost is reduced, and the efficiency is improved. Secondly, the invention adopts the probability map idea to decompose the triple joint probability into condition probability, and simultaneously uses a half pointer-half label mode to solve the problems that one subject word corresponds to a plurality of object words, one object word corresponds to a plurality of subject words and the subject word and the object word correspond to a plurality of relations in the relation extraction. Finally, the problem that the entity is too long due to the fact that the tail pointer is wrongly predicted in the pointer network prediction process can be solved by using entity fragment classification. Meanwhile, the boundary identification focuses on the context information of the entity, the entity classification focuses on the information of the entity, and the two are jointly learned, so that the model prediction accuracy can be improved. The student portrait about the student is constructed through entity and relationship extraction, so that the user can be helped to pay attention to the student information more efficiently, and meanwhile, the student education or work migration diagram can be analyzed through the accurate and concise information. And the work can be conveniently transferred to information extraction in other fields.
Step S6: and performing user portrayal according to the recognition and classification results.
The invention also provides a system for extracting the relationship between scholars information based on boundary and segment classification, which comprises the following modules: module M1: acquiring personal information and text contents of different teachers from Baidu encyclopedia and classroom directories; a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list; the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher; the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences; the module M104: and constructing an entity relation extraction data set by the labeled text.
Module M2: carrying out similar entity word replacement and amplification training data on entity words in the text; module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles; the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF; module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.
Module M3: embedding the text by using a BERT pre-training model and extracting semantic features; module M301: text matrix T ═ T obtained by segmenting text1,t2,···,tnN represents the character length after sentence segmentation;
the module M302: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,···,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,···,hn}。
Module M4: recognizing the boundary of a main word and classifying entity fragments;
module M5: and identifying object word boundaries and corresponding relation boundaries and classifying entity fragments.
Module M6: and performing user portrayal according to the recognition and classification results.
The method uses the trained entity recognition model BERT-CRF to construct the entity library and is used for enhancing the data of the relation extraction model, thereby improving the diversity of the data set and reducing the cost of manually constructing the data set; the invention uses the probability map thought to convert the joint probability distribution p (s, p, o) into the conditional probability distribution p (p, o | s) and the half pointer-half labeling mode to solve the problems that one subject word corresponds to a plurality of object words, the subject word and the object word correspond to a plurality of relations and one object word corresponds to a plurality of subject words in the relation extraction; the invention adopts entity segment classification to solve the problem of overlong entity prediction caused by missing tail pointers in the pointer network prediction process.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (10)
1. A scholars information relation extraction method based on boundary and segment classification is characterized by comprising the following steps:
step S1: acquiring personal information and text contents of different teachers;
step S2: carrying out similar entity word replacement and amplification training data on entity words in the text;
step S3: embedding the text by using a pre-training model and extracting semantic features;
step S4: recognizing the boundary of a main word and classifying entity fragments;
step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
step S6: and performing user portrayal according to the recognition and classification results.
2. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S1 includes the following steps:
step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
step S102: extracting text content in the html file obtained in the step S101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
step S104: and constructing an entity relation extraction data set by the labeled text.
3. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S2 includes the following steps:
step S201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF;
step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.
4. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S3 includes the following steps:
step S301: embedding the text by using a BERT pre-training model and extracting semantic features;
step S302: text matrix T ═ T obtained by segmenting text1,t2,…,tnN represents the character length after sentence segmentation; t is tnDenotes the n-thText word segmentation;
step S303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
5. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S4 includes the following steps:
step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head-to-tail pointer by using a sigmoid activation function, where the formula is as follows:
where a represents a sigmoid function,e respectively represents the probability that the ith character is used as a head pointer of the main word and the probability that the ith character is used as a tail pointer of the main word; start and end respectively represent a head pointer and a tail pointer, wherein a subscript s represents a main word; wstart、WendTrainable weight matrices representing the probability of predicting head pointers, trainable weight matrices representing the probability of predicting tail pointers, respectively, bstart、bendRespectively representing the offset of a prediction head pointer and the offset of a prediction tail pointer; hi denotes the encoding vector of the i-th character obtained by step S303; the probability exceeds a certain threshold value, thenMarking the token allocation corresponding to the token as 1, otherwise marking the token allocation as 0;
step S402: go through the token with all head pointers marked as 1, and its position in the sequence is given asFind the first tail pointer after the head pointer, whose position in the sequence is ordered asObtaining the sequenceLet is HsubWherein H is the encoding vector H in step S303, and the subscript sub represents the main word sequence;
step S403: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and a code vector S of the subject word sequence is inputsubThe formula is as follows:
ssub=LSTM(Hsub)
wherein LSTM represents a bidirectional long-short time cyclic neural network;
step S404: the subject word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Ps=soffmax(WTssub+b)
wherein P issIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, b represents the bias; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
6. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein the step S5 includes:
step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of a corresponding object word by using a sigmoid activation function, wherein the formula is as follows:
wherein r represents a subject word and object word relationship, whereinAndthe head pointer and tail pointer probabilities that represent the ith character in the sequence as an object word,andis a trainable weight matrix that predicts the head and tail pointers of the relationship r,andthe offset of the head and tail pointers representing the predicted relationship r;
step S502: traversing token of which all head pointers are marked as 1 in each relation r, and the position of the token in the sequence isFind outThe first tail pointer, located after the head pointer, has a position in the sequence ofObject word sequence for predicting corresponding relation r of subject wordLet is
Step S503: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and the obtained encoding vector S of the object word is input, with the following formula:
where the LSTM is shared with the LSTM parameters in step S403,the object word coding vector is the corresponding relation r of the subject word, and obj represents the object word;
step S504: the object word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
wherein P issIs the probability of whether the object word sequence is an entity, where the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
7. A system for extracting scholars' information relation based on boundary and segment classification is characterized in that the system comprises the following modules:
module M1: acquiring personal information and text contents of different teachers;
module M2: carrying out similar entity word replacement and amplification training data on entity words in the text;
module M3: embedding the text by using a pre-training model and extracting semantic features;
module M4: recognizing the boundary of a main word and classifying entity fragments;
module M5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
module M6: and performing user portrayal according to the recognition and classification results.
8. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M1 includes the following modules:
a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
the module M104: and constructing an entity relation extraction data set by the labeled text.
9. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M2 includes the following modules:
module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF;
module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.
10. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M3 includes the following modules:
module M301: embedding the text by using a BERT pre-training model and extracting semantic features;
the module M302: text matrix T ═ T obtained by segmenting text1,t2,…,tnN represents the character length after sentence segmentation; t is tnRepresenting an nth text participle;
module M303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110685661.2A CN113468887A (en) | 2021-06-21 | 2021-06-21 | Student information relation extraction method and system based on boundary and segment classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110685661.2A CN113468887A (en) | 2021-06-21 | 2021-06-21 | Student information relation extraction method and system based on boundary and segment classification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113468887A true CN113468887A (en) | 2021-10-01 |
Family
ID=77868803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110685661.2A Pending CN113468887A (en) | 2021-06-21 | 2021-06-21 | Student information relation extraction method and system based on boundary and segment classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113468887A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114239566A (en) * | 2021-12-14 | 2022-03-25 | 公安部第三研究所 | Method, device and processor for realizing two-step Chinese event accurate detection based on information enhancement and computer readable storage medium thereof |
CN114783559A (en) * | 2022-06-23 | 2022-07-22 | 浙江太美医疗科技股份有限公司 | Medical image report information extraction method and device, electronic equipment and storage medium |
CN115510866A (en) * | 2022-11-16 | 2022-12-23 | 国网江苏省电力有限公司营销服务中心 | Knowledge extraction method and system oriented to entity relationship cooperation in electric power field |
CN116227483A (en) * | 2023-02-10 | 2023-06-06 | 南京南瑞信息通信科技有限公司 | Word boundary-based Chinese entity extraction method, device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657135A (en) * | 2018-11-13 | 2019-04-19 | 华南理工大学 | A kind of scholar user neural network based draws a portrait information extraction method and model |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN112215004A (en) * | 2020-09-04 | 2021-01-12 | 中国电子科技集团公司第二十八研究所 | Application method in extraction of text entities of military equipment based on transfer learning |
CN112487807A (en) * | 2020-12-09 | 2021-03-12 | 重庆邮电大学 | Text relation extraction method based on expansion gate convolution neural network |
-
2021
- 2021-06-21 CN CN202110685661.2A patent/CN113468887A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657135A (en) * | 2018-11-13 | 2019-04-19 | 华南理工大学 | A kind of scholar user neural network based draws a portrait information extraction method and model |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN112215004A (en) * | 2020-09-04 | 2021-01-12 | 中国电子科技集团公司第二十八研究所 | Application method in extraction of text entities of military equipment based on transfer learning |
CN112487807A (en) * | 2020-12-09 | 2021-03-12 | 重庆邮电大学 | Text relation extraction method based on expansion gate convolution neural network |
Non-Patent Citations (1)
Title |
---|
张秋颖等: "基于BERT-BiLSTM-CRF的学者主页信息抽取", 《计算机应用研究》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114239566A (en) * | 2021-12-14 | 2022-03-25 | 公安部第三研究所 | Method, device and processor for realizing two-step Chinese event accurate detection based on information enhancement and computer readable storage medium thereof |
CN114239566B (en) * | 2021-12-14 | 2024-04-23 | 公安部第三研究所 | Method, device, processor and computer readable storage medium for realizing accurate detection of two-step Chinese event based on information enhancement |
CN114783559A (en) * | 2022-06-23 | 2022-07-22 | 浙江太美医疗科技股份有限公司 | Medical image report information extraction method and device, electronic equipment and storage medium |
CN114783559B (en) * | 2022-06-23 | 2022-09-30 | 浙江太美医疗科技股份有限公司 | Medical image report information extraction method and device, electronic equipment and storage medium |
CN115510866A (en) * | 2022-11-16 | 2022-12-23 | 国网江苏省电力有限公司营销服务中心 | Knowledge extraction method and system oriented to entity relationship cooperation in electric power field |
CN116227483A (en) * | 2023-02-10 | 2023-06-06 | 南京南瑞信息通信科技有限公司 | Word boundary-based Chinese entity extraction method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108182295B (en) | Enterprise knowledge graph attribute extraction method and system | |
CN110019839B (en) | Medical knowledge graph construction method and system based on neural network and remote supervision | |
CN110427623B (en) | Semi-structured document knowledge extraction method and device, electronic equipment and storage medium | |
CN113177124B (en) | Method and system for constructing knowledge graph in vertical field | |
CN113468887A (en) | Student information relation extraction method and system based on boundary and segment classification | |
CN111767732B (en) | Document content understanding method and system based on graph attention model | |
CN106778878B (en) | Character relation classification method and device | |
CN114580424B (en) | Labeling method and device for named entity identification of legal document | |
CN113051356A (en) | Open relationship extraction method and device, electronic equipment and storage medium | |
CN116070602B (en) | PDF document intelligent labeling and extracting method | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN111143574A (en) | Query and visualization system construction method based on minority culture knowledge graph | |
CN115952791A (en) | Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium | |
CN111710428A (en) | Biomedical text representation method for modeling global and local context interaction | |
CN112699685A (en) | Named entity recognition method based on label-guided word fusion | |
CN114756681A (en) | Evaluation text fine-grained suggestion mining method based on multi-attention fusion | |
CN113836306B (en) | Composition automatic evaluation method, device and storage medium based on chapter component identification | |
Tarride et al. | A comparative study of information extraction strategies using an attention-based neural network | |
CN113312918B (en) | Word segmentation and capsule network law named entity identification method fusing radical vectors | |
CN112417155B (en) | Court trial query generation method, device and medium based on pointer-generation Seq2Seq model | |
CN111563374B (en) | Personnel social relationship extraction method based on judicial official documents | |
CN112148879A (en) | Computer readable storage medium for automatically labeling code with data structure | |
CN115659989A (en) | Web table abnormal data discovery method based on text semantic mapping relation | |
CN114661900A (en) | Text annotation recommendation method, device, equipment and storage medium | |
CN115344668A (en) | Multi-field and multi-disciplinary science and technology policy resource retrieval method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211001 |