CN113468887A - Student information relation extraction method and system based on boundary and segment classification - Google Patents

Student information relation extraction method and system based on boundary and segment classification Download PDF

Info

Publication number
CN113468887A
CN113468887A CN202110685661.2A CN202110685661A CN113468887A CN 113468887 A CN113468887 A CN 113468887A CN 202110685661 A CN202110685661 A CN 202110685661A CN 113468887 A CN113468887 A CN 113468887A
Authority
CN
China
Prior art keywords
entity
text
word
module
boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110685661.2A
Other languages
Chinese (zh)
Inventor
曹安蕲
唐果
傅洛伊
王新兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110685661.2A priority Critical patent/CN113468887A/en
Publication of CN113468887A publication Critical patent/CN113468887A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a scholars information relation extraction method based on boundary and segment classification, which comprises the following steps: step S1: acquiring personal information and text contents of different teachers; step S2: carrying out similar entity word replacement and amplification training data on entity words in the text; step S3: embedding the text by using a pre-training model and extracting semantic features; step S4: recognizing the boundary of a main word and classifying entity fragments; step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments; step S6: and performing user portrayal according to the recognition and classification results. By utilizing the thought of a probability map and combining a half pointer-half labeling mode, the problems that in relation extraction, one subject word corresponds to a plurality of subject words, and the relations between two same entities are different are solved. The method for enhancing entity fragment classification by using the boundary can reduce the influence caused by the prediction error of the tail pointer and improve the accuracy rate of entity relationship extraction.

Description

Student information relation extraction method and system based on boundary and segment classification
Technical Field
The invention relates to the technical field of machine learning and natural language processing, in particular to a student information relation extraction method and system based on boundary and segment classification.
Background
The method is characterized in that information such as student names, mailboxes, titles, personal homepages, education experiences and work experiences is required to be extracted from texts, most student information is from the personal homepages or introductory webpages (Baidu encyclopedia and school teacher directories), the problems of few information sources, high data noise and high data information redundancy exist, furthermore, a certain grammatical difference exists between html files obtained by preprocessing the information texts of the students from webpages and natural texts in the traditional sense, so that the information is difficult to extract by a self-defined rule method, however, the problems of large workload, low efficiency and the like exist in artificially extracting the student information of the webpages, and therefore, the extraction of the information of the webpage texts by using an entity relationship extraction technology in natural language processing is very important.
Entity relationship extraction is an important branch in the field of information extraction, and comprises two subtasks: entity identification and relationship extraction, namely identifying entities from a natural text, extracting the relationship between each entity pair, and finally forming a relationship triple < s, p, o >, wherein s represents a subject word (subject), p represents a predicate, namely a relationship (predicate), and o represents a subject word (object). The entity refers to concepts such as time, place, people and organization in the text; relationships refer to semantic relationships between entities.
At present, the entity relationship extraction mainly adopts a neural network model, and two realization modes are mainly adopted: 1. a pipeline form; 2. a joint decimation mode. The pipeline method is characterized in that after entity recognition is finished, relationship extraction between entities is directly carried out, although pipeline learning is easy to realize and the two extraction models are high in flexibility, the entity recognition model and the relationship extraction model can adopt independent data sets, but errors of the entity recognition model can affect the effect of the relationship extraction model, the phenomenon of error propagation occurs, meanwhile, the entity recognition model can obtain a plurality of redundant entities, the difficulty and the complexity of subsequent relationship extraction tasks are increased, and the pipeline learning ignores the connection between the two tasks, so that researchers fuse named entity recognition and relationship extraction into one task to carry out combined learning, the propagation of errors can be relieved to a certain degree through the combined learning, the two tasks are fused into one model, the efficiency of the model learning and prediction is increased, and the robustness of the model is improved, in recent years, the precision of natural language processing of tasks in different fields is greatly improved by the aid of the Transformer model, and the data volume of the training model can be reduced by fine-tuning the Transformer pre-training model.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a student information relation extraction method and system based on boundary and segment classification.
The invention provides a scholars information relation extraction method based on boundary and segment classification, which comprises the following steps:
step S1: acquiring personal information and text contents of different teachers;
step S2: carrying out similar entity word replacement and amplification training data on entity words in the text;
step S3: embedding the text by using a pre-training model and extracting semantic features;
step S4: recognizing the boundary of a main word and classifying entity fragments;
step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
step S6: and performing user portrayal according to the recognition and classification results.
Preferably, the step S1 includes the steps of:
step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
step S102: extracting text content in the html file obtained in the step S101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
step S104: and constructing an entity relation extraction data set by the labeled text.
Preferably, the step S2 includes the steps of:
step S201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF;
step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.
Preferably, the step S3 includes the steps of:
step S301: embedding the text by using a BERT pre-training model and extracting semantic features;
step S302: text matrix T ═ T obtained by segmenting text1,t2,···,tnN represents the character length after sentence segmentation; t is tnRepresenting an nth text participle;
step S303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,···,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,···,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
Preferably, the step S4 includes the steps of:
step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head-to-tail pointer by using a sigmoid activation function, where the formula is as follows:
Figure BDA0003124511740000031
Figure BDA0003124511740000032
where a represents a sigmoid function,
Figure BDA0003124511740000033
respectively representing the probability that the ith character is used as a head pointer of the main word and the probability that the ith character is used as a tail pointer of the main word; start and end respectively represent a head pointer and a tail pointer, wherein a subscript s represents a main word; wstart、WendTrainable weight matrices representing the probability of predicting head pointers, trainable weight matrices representing the probability of predicting tail pointers, respectively, bstart、bendRespectively representing the offset of a prediction head pointer and the offset of a prediction tail pointer; h isiAn encoding vector representing the i-th character obtained by step S303; if the probability exceeds a certain threshold value, the token distribution corresponding to the probability is marked as 1, otherwise, the probability is marked as 0;
step S402: go through the token with all head pointers marked 1, whose position in the sequence is given by ystartsFinding the first tail pointer after the head pointer, whose position in the sequence is given by
Figure BDA0003124511740000034
Obtaining the sequence
Figure BDA0003124511740000035
Let is HsubWherein H is the encoding vector H in step S303, and the subscript sub represents the main word sequence;
step S403: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and a code vector S of the subject word sequence is inputsubThe formula is as follows:
ssub=LSTM(Hsub)
wherein LSTM represents a bidirectional long-short time cyclic neural network;
step S404: the subject word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Ps=softmax(WTssub+b)
wherein P issIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, b represents the bias; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
Preferably, the step S5 includes:
step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of a corresponding object word by using a sigmoid activation function, wherein the formula is as follows:
Figure BDA0003124511740000041
Figure BDA0003124511740000042
wherein r represents a subject word and object word relationship, wherein
Figure BDA0003124511740000043
And
Figure BDA0003124511740000044
the head pointer and tail pointer probabilities that represent the ith character in the sequence as an object word,
Figure BDA0003124511740000045
and
Figure BDA0003124511740000046
is a trainable weight matrix that predicts the head and tail pointers of the relationship r,
Figure BDA0003124511740000047
and
Figure BDA0003124511740000048
the offset of the head and tail pointers representing the predicted relationship r;
step S502: traversing token of which all head pointers are marked as 1 in each relation r, and the position of the token in the sequence is
Figure BDA0003124511740000049
Find the first tail pointer after the head pointer, whose position in the sequence is ordered as
Figure BDA00031245117400000410
Object word sequence for predicting corresponding relation r of subject word
Figure BDA00031245117400000411
Let is
Figure BDA00031245117400000412
Step S503: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and the obtained encoding vector S of the object word is input, with the following formula:
Figure BDA00031245117400000413
where the LSTM is shared with the LSTM parameters in step S403,
Figure BDA00031245117400000414
the object word coding vector is the corresponding relation r of the subject word, and obj represents the object word;
step S504: the object word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Figure BDA00031245117400000415
wherein P issIs the probability of whether the object word sequence is an entity, where the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
The invention also provides a system for extracting the relationship between scholars information based on boundary and segment classification, which comprises the following modules:
module M1: acquiring personal information and text contents of different teachers;
module M2: carrying out similar entity word replacement and amplification training data on entity words in the text;
module M3: embedding the text by using a pre-training model and extracting semantic features;
module M4: recognizing the boundary of a main word and classifying entity fragments;
module M5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
module M6: and performing user portrayal according to the recognition and classification results.
Preferably, the module M1 includes the following modules:
a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
the module M104: and constructing an entity relation extraction data set by the labeled text.
Preferably, the module M2 includes the following modules:
module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF;
module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.
Preferably, the module M3 includes the following modules:
module M301: embedding the text by using a BERT pre-training model and extracting semantic features;
the module M302: text matrix T ═ T obtained by segmenting text1,t2,…,tnN represents the character length after sentence segmentation; t is tnRepresenting an nth text participle;
module M303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
Compared with the prior art, the invention has the following beneficial effects:
1. the method uses the trained entity recognition model BERT-CRF to construct the entity library and is used for enhancing the data of the relation extraction model, thereby improving the diversity of the data set and reducing the cost of manually constructing the data set;
2. the invention uses the probability map thought to convert the joint probability distribution p (s, p, o) into the conditional probability distribution p (p, o | s) and the half pointer-half labeling mode to solve the problems that one subject word corresponds to a plurality of object words, the subject word and the object word correspond to a plurality of relations and one object word corresponds to a plurality of subject words in the relation extraction;
3. the invention adopts entity segment classification to solve the problem of overlong entity prediction caused by missing tail pointers in the pointer network prediction process.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the relationship extraction model prediction of the present invention;
FIG. 2 is a flow chart of data set amplification according to the present invention;
FIG. 3 is a diagram illustrating a segment classification relation extraction model based on boundary enhancement according to the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Referring to fig. 1, the invention provides a scholars information relation extraction method based on boundary and segment classification, comprising the following steps:
step S1: acquiring text contents of personal information of different teachers from Baidu encyclopedia and teacher directories; step S1 includes: crawls the learner information, preprocesses the text, and constructs a basic data set. Step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list; step S102: extracting text contents (deleting html tags) in the html file obtained in the step S101 to obtain a whole plain text file of the teacher personal information; step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences; step S104: and manually marking the text to construct an entity relation extraction data set. The data format is as follows:
{ "text": 2012.12 obtained the agronomic Master academic degree at the college of the institute of Polymer chemistry, university of Mediterranean China. ",
"spo_list":[
"agricultural Master degree", 26 "Master degree-time", "2012.12", 0",
…]}
the 'text' field is a text obtained from a webpage, and the 'spo _ list' field records < s, p, o > triples in the text, wherein each list is a triplet, the first element in the list is a subject word, the second element is the position of the first character of the subject word in the text, the third element is the relationship between the subject word and an object word, the 4 th element is an object word, and the last element is the position of the first character of the object word in the text.
Step S2: referring to fig. 2, the entity words in the text are subjected to the similar entity word replacement amplification training data; step S2 includes: and constructing an entity library, amplifying data by using the entity library, and dividing a training set and a test set. Step S201: constructing an entity library (entity types comprise time, organization, name, courtyard, position, job title and job title) by an entity recognition model BERT-CRF; the format is as follows;
{ "time ["2012-10","2002"," 2012.10", … ],
"degree [" Benke "," Master "," doctor ", … ],
…}
the method comprises the fields of time, hierarchy, attribute, parent, title, joband name, wherein each field record category is an entity of the field.
Step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF (the entity type is the same as the entity type in the entity library in the step S201); step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.
Step S3: referring to fig. 3, word embedding is performed on a text and semantic features are extracted by using a BERT pre-training model; step S3 includes: and loading a pre-trained BERT model to embed the text to obtain a word vector. Step S301: text matrix T ═ T obtained by segmenting text1,t2,…tnN represents the character length after sentence segmentation; the input embedding vector for each character comes from character embedding, position embedding andand character type embedded sum value X ═ X1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hn}。
Step S4: recognizing the boundary of a main word and classifying entity fragments; step S4 includes: and predicting the head and tail pointers of the main body, and classifying the obtained entity sequences.
Step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head pointer or a tail pointer by using a sigmoid activation function, wherein the formula is as follows;
Figure BDA0003124511740000071
Figure BDA0003124511740000072
where a represents a sigmoid function,
Figure BDA0003124511740000073
and
Figure BDA0003124511740000074
representing the probability of the ith character as a head and tail pointer of the subject word; start and end represent head and tail pointers, where s represents a body word; wstartAnd WendTrainable weight matrices representing the probabilities of predicting head and tail pointers, bstartAnd bendRepresenting the offsets of the predicted head and tail pointers; h isiAn encoding vector representing the i-th character obtained by step S303; and if the probability exceeds a certain threshold value, marking the token assignment corresponding to the probability as 1, otherwise, marking the probability as 0.
Step S402: go through the token whose head pointer is marked 1 (its position in the sequence is ordered as
Figure BDA0003124511740000075
) Find outThe first tail pointer after the head pointer (whose position in the sequence is ordered to be
Figure BDA0003124511740000076
) Obtaining the sequence
Figure BDA0003124511740000077
(let be H)sub)。
Figure BDA0003124511740000078
Wherein L issubA boundary of a main word is represented,
Figure BDA0003124511740000081
indicating whether the ith character tag is a head-to-tail pointer,
Figure BDA0003124511740000082
representing the probability of predicting head and tail pointers by the ith character model;
step S403: the subject word sequence obtained in step S402 is input as a bidirectional LSTM (long-short recurrent neural network) to obtain a coding vector S of the subject word, and the formula is as follows:
ssub=LSTM(Hsub)
wherein LSTM represents a bidirectional long-short time cyclic neural network;
step S404: the subject word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Ps=softmax(WTssub+b)
wherein P issIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity class number, and b represents the bias. If the probability exceeds a certain threshold, its corresponding entity class is marked as 1, otherwise it is marked as 0. The loss function of the entity classification of the main word adopts a cross entropy loss functionThe formula is as follows:
Figure BDA0003124511740000083
where class represents the type of real entity,
Figure BDA0003124511740000084
a loss function representing classification of the subject word segments, n represents the number of the subject word segments, i represents the ith entity segment, and j represents a subject word type label.
Step S5: and identifying object word boundaries and corresponding relation boundaries and classifying entity fragments. Step S5 includes: and predicting the object word boundary of the corresponding relation and predicting the entity type of the object word.
Step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of the object word in the corresponding relationship by using a sigmoid activation function, wherein the formula is as follows:
Figure BDA0003124511740000085
Figure BDA0003124511740000086
wherein r represents a subject word and object word relationship, wherein
Figure BDA0003124511740000087
And
Figure BDA0003124511740000088
the head pointer and tail pointer probabilities that represent the ith character in the sequence as an object word,
Figure BDA0003124511740000089
and
Figure BDA00031245117400000810
is a trainable weight matrix that predicts the head and tail pointers of the relationship r,
Figure BDA00031245117400000811
and
Figure BDA00031245117400000812
the offsets of the head and tail pointers representing the predicted relationship r.
Step S502: traverse each relation r with token labeled 1 for all head pointers (its position in the sequence is ordered as
Figure BDA00031245117400000813
) The first tail pointer (whose position in the sequence is given by
Figure BDA00031245117400000814
) Predicting the object word sequence of the corresponding relation r of the subject word
Figure BDA00031245117400000815
(order is
Figure BDA00031245117400000816
). The binary cross entropy loss function adopted by the head-to-tail prediction of the object words and the loss function of the pointer network has the following formula:
Figure BDA0003124511740000091
wherein L isobjA boundary of a main word is represented,
Figure BDA0003124511740000092
indicating whether the ith character tag is a head-to-tail pointer,
Figure BDA0003124511740000093
representing the probability of predicting head and tail pointers by the ith character model;
step S503: the subject word sequence obtained in step S402 is used as a bidirectional LSTM (long-short recurrent neural network) to input the obtained encoding vector S of the object word, and the formula is as follows:
Figure BDA0003124511740000094
wherein the LSTM shares with the LSTM parameters in step S403, S _ obj ^ r is the object word encoding vector of the corresponding relation r of the subject word, and obj represents the object word.
Step S504: the object word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Figure BDA0003124511740000095
wherein P issIs the probability of whether the object word sequence is an entity, where the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404. If the probability exceeds a certain threshold, its corresponding entity class is marked as 1, otherwise it is marked as 0. The loss function of the entity classification of the object words adopts a cross entropy loss function, and the formula is as follows:
Figure BDA0003124511740000096
where class represents the type of real entity,
Figure BDA0003124511740000097
a loss function representing classification of the subject word segments, n represents the number of the subject word segments, i represents the ith entity segment, and j represents a subject word type label.
Step S505, the loss function of the whole model is subject word boundary loss, object word boundary loss and entity classification loss, and the formula is as follows:
Figure BDA0003124511740000098
and during training, an adaptive learning algorithm Adam is used for parameter optimization.
Firstly, the data set is amplified through the entity recognition model BERT-CRF, so that the diversity of the data set can be greatly increased, the labor cost is reduced, and the efficiency is improved. Secondly, the invention adopts the probability map idea to decompose the triple joint probability into condition probability, and simultaneously uses a half pointer-half label mode to solve the problems that one subject word corresponds to a plurality of object words, one object word corresponds to a plurality of subject words and the subject word and the object word correspond to a plurality of relations in the relation extraction. Finally, the problem that the entity is too long due to the fact that the tail pointer is wrongly predicted in the pointer network prediction process can be solved by using entity fragment classification. Meanwhile, the boundary identification focuses on the context information of the entity, the entity classification focuses on the information of the entity, and the two are jointly learned, so that the model prediction accuracy can be improved. The student portrait about the student is constructed through entity and relationship extraction, so that the user can be helped to pay attention to the student information more efficiently, and meanwhile, the student education or work migration diagram can be analyzed through the accurate and concise information. And the work can be conveniently transferred to information extraction in other fields.
Step S6: and performing user portrayal according to the recognition and classification results.
The invention also provides a system for extracting the relationship between scholars information based on boundary and segment classification, which comprises the following modules: module M1: acquiring personal information and text contents of different teachers from Baidu encyclopedia and classroom directories; a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list; the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher; the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences; the module M104: and constructing an entity relation extraction data set by the labeled text.
Module M2: carrying out similar entity word replacement and amplification training data on entity words in the text; module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles; the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF; module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.
Module M3: embedding the text by using a BERT pre-training model and extracting semantic features; module M301: text matrix T ═ T obtained by segmenting text1,t2,···,tnN represents the character length after sentence segmentation;
the module M302: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,···,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,···,hn}。
Module M4: recognizing the boundary of a main word and classifying entity fragments;
module M5: and identifying object word boundaries and corresponding relation boundaries and classifying entity fragments.
Module M6: and performing user portrayal according to the recognition and classification results.
The method uses the trained entity recognition model BERT-CRF to construct the entity library and is used for enhancing the data of the relation extraction model, thereby improving the diversity of the data set and reducing the cost of manually constructing the data set; the invention uses the probability map thought to convert the joint probability distribution p (s, p, o) into the conditional probability distribution p (p, o | s) and the half pointer-half labeling mode to solve the problems that one subject word corresponds to a plurality of object words, the subject word and the object word correspond to a plurality of relations and one object word corresponds to a plurality of subject words in the relation extraction; the invention adopts entity segment classification to solve the problem of overlong entity prediction caused by missing tail pointers in the pointer network prediction process.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A scholars information relation extraction method based on boundary and segment classification is characterized by comprising the following steps:
step S1: acquiring personal information and text contents of different teachers;
step S2: carrying out similar entity word replacement and amplification training data on entity words in the text;
step S3: embedding the text by using a pre-training model and extracting semantic features;
step S4: recognizing the boundary of a main word and classifying entity fragments;
step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
step S6: and performing user portrayal according to the recognition and classification results.
2. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S1 includes the following steps:
step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
step S102: extracting text content in the html file obtained in the step S101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
step S104: and constructing an entity relation extraction data set by the labeled text.
3. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S2 includes the following steps:
step S201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF;
step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.
4. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S3 includes the following steps:
step S301: embedding the text by using a BERT pre-training model and extracting semantic features;
step S302: text matrix T ═ T obtained by segmenting text1,t2,…,tnN represents the character length after sentence segmentation; t is tnDenotes the n-thText word segmentation;
step S303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
5. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S4 includes the following steps:
step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head-to-tail pointer by using a sigmoid activation function, where the formula is as follows:
Figure FDA0003124511730000021
Figure FDA0003124511730000022
where a represents a sigmoid function,
Figure FDA0003124511730000023
e respectively represents the probability that the ith character is used as a head pointer of the main word and the probability that the ith character is used as a tail pointer of the main word; start and end respectively represent a head pointer and a tail pointer, wherein a subscript s represents a main word; wstart、WendTrainable weight matrices representing the probability of predicting head pointers, trainable weight matrices representing the probability of predicting tail pointers, respectively, bstart、bendRespectively representing the offset of a prediction head pointer and the offset of a prediction tail pointer; hi denotes the encoding vector of the i-th character obtained by step S303; the probability exceeds a certain threshold value, thenMarking the token allocation corresponding to the token as 1, otherwise marking the token allocation as 0;
step S402: go through the token with all head pointers marked as 1, and its position in the sequence is given as
Figure FDA0003124511730000024
Find the first tail pointer after the head pointer, whose position in the sequence is ordered as
Figure FDA0003124511730000025
Obtaining the sequence
Figure FDA0003124511730000026
Let is HsubWherein H is the encoding vector H in step S303, and the subscript sub represents the main word sequence;
step S403: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and a code vector S of the subject word sequence is inputsubThe formula is as follows:
ssub=LSTM(Hsub)
wherein LSTM represents a bidirectional long-short time cyclic neural network;
step S404: the subject word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Ps=soffmax(WTssub+b)
wherein P issIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, b represents the bias; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
6. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein the step S5 includes:
step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of a corresponding object word by using a sigmoid activation function, wherein the formula is as follows:
Figure FDA0003124511730000031
Figure FDA0003124511730000032
wherein r represents a subject word and object word relationship, wherein
Figure FDA0003124511730000033
And
Figure FDA0003124511730000034
the head pointer and tail pointer probabilities that represent the ith character in the sequence as an object word,
Figure FDA0003124511730000035
and
Figure FDA0003124511730000036
is a trainable weight matrix that predicts the head and tail pointers of the relationship r,
Figure FDA0003124511730000037
and
Figure FDA0003124511730000038
the offset of the head and tail pointers representing the predicted relationship r;
step S502: traversing token of which all head pointers are marked as 1 in each relation r, and the position of the token in the sequence is
Figure FDA0003124511730000039
Find outThe first tail pointer, located after the head pointer, has a position in the sequence of
Figure FDA00031245117300000310
Object word sequence for predicting corresponding relation r of subject word
Figure FDA00031245117300000311
Let is
Figure FDA00031245117300000312
Step S503: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and the obtained encoding vector S of the object word is input, with the following formula:
Figure FDA00031245117300000313
where the LSTM is shared with the LSTM parameters in step S403,
Figure FDA00031245117300000314
the object word coding vector is the corresponding relation r of the subject word, and obj represents the object word;
step S504: the object word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:
Figure FDA00031245117300000315
wherein P issIs the probability of whether the object word sequence is an entity, where the matrix W ∈ Rd×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.
7. A system for extracting scholars' information relation based on boundary and segment classification is characterized in that the system comprises the following modules:
module M1: acquiring personal information and text contents of different teachers;
module M2: carrying out similar entity word replacement and amplification training data on entity words in the text;
module M3: embedding the text by using a pre-training model and extracting semantic features;
module M4: recognizing the boundary of a main word and classifying entity fragments;
module M5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;
module M6: and performing user portrayal according to the recognition and classification results.
8. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M1 includes the following modules:
a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;
the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;
the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;
the module M104: and constructing an entity relation extraction data set by the labeled text.
9. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M2 includes the following modules:
module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;
the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF;
module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.
10. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M3 includes the following modules:
module M301: embedding the text by using a BERT pre-training model and extracting semantic features;
the module M302: text matrix T ═ T obtained by segmenting text1,t2,…,tnN represents the character length after sentence segmentation; t is tnRepresenting an nth text participle;
module M303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding1,x2,…,xnThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information1,h2,…,hnIn which xnRepresents the nth sum value; h isnRepresenting the nth code vector value.
CN202110685661.2A 2021-06-21 2021-06-21 Student information relation extraction method and system based on boundary and segment classification Pending CN113468887A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110685661.2A CN113468887A (en) 2021-06-21 2021-06-21 Student information relation extraction method and system based on boundary and segment classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110685661.2A CN113468887A (en) 2021-06-21 2021-06-21 Student information relation extraction method and system based on boundary and segment classification

Publications (1)

Publication Number Publication Date
CN113468887A true CN113468887A (en) 2021-10-01

Family

ID=77868803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110685661.2A Pending CN113468887A (en) 2021-06-21 2021-06-21 Student information relation extraction method and system based on boundary and segment classification

Country Status (1)

Country Link
CN (1) CN113468887A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239566A (en) * 2021-12-14 2022-03-25 公安部第三研究所 Method, device and processor for realizing two-step Chinese event accurate detection based on information enhancement and computer readable storage medium thereof
CN114783559A (en) * 2022-06-23 2022-07-22 浙江太美医疗科技股份有限公司 Medical image report information extraction method and device, electronic equipment and storage medium
CN115510866A (en) * 2022-11-16 2022-12-23 国网江苏省电力有限公司营销服务中心 Knowledge extraction method and system oriented to entity relationship cooperation in electric power field
CN116227483A (en) * 2023-02-10 2023-06-06 南京南瑞信息通信科技有限公司 Word boundary-based Chinese entity extraction method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657135A (en) * 2018-11-13 2019-04-19 华南理工大学 A kind of scholar user neural network based draws a portrait information extraction method and model
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112215004A (en) * 2020-09-04 2021-01-12 中国电子科技集团公司第二十八研究所 Application method in extraction of text entities of military equipment based on transfer learning
CN112487807A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Text relation extraction method based on expansion gate convolution neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657135A (en) * 2018-11-13 2019-04-19 华南理工大学 A kind of scholar user neural network based draws a portrait information extraction method and model
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112215004A (en) * 2020-09-04 2021-01-12 中国电子科技集团公司第二十八研究所 Application method in extraction of text entities of military equipment based on transfer learning
CN112487807A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Text relation extraction method based on expansion gate convolution neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张秋颖等: "基于BERT-BiLSTM-CRF的学者主页信息抽取", 《计算机应用研究》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239566A (en) * 2021-12-14 2022-03-25 公安部第三研究所 Method, device and processor for realizing two-step Chinese event accurate detection based on information enhancement and computer readable storage medium thereof
CN114239566B (en) * 2021-12-14 2024-04-23 公安部第三研究所 Method, device, processor and computer readable storage medium for realizing accurate detection of two-step Chinese event based on information enhancement
CN114783559A (en) * 2022-06-23 2022-07-22 浙江太美医疗科技股份有限公司 Medical image report information extraction method and device, electronic equipment and storage medium
CN114783559B (en) * 2022-06-23 2022-09-30 浙江太美医疗科技股份有限公司 Medical image report information extraction method and device, electronic equipment and storage medium
CN115510866A (en) * 2022-11-16 2022-12-23 国网江苏省电力有限公司营销服务中心 Knowledge extraction method and system oriented to entity relationship cooperation in electric power field
CN116227483A (en) * 2023-02-10 2023-06-06 南京南瑞信息通信科技有限公司 Word boundary-based Chinese entity extraction method, device and storage medium

Similar Documents

Publication Publication Date Title
CN108182295B (en) Enterprise knowledge graph attribute extraction method and system
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN113177124B (en) Method and system for constructing knowledge graph in vertical field
CN113468887A (en) Student information relation extraction method and system based on boundary and segment classification
CN111767732B (en) Document content understanding method and system based on graph attention model
CN106778878B (en) Character relation classification method and device
CN114580424B (en) Labeling method and device for named entity identification of legal document
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN116070602B (en) PDF document intelligent labeling and extracting method
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
CN115952791A (en) Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium
CN111710428A (en) Biomedical text representation method for modeling global and local context interaction
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN114756681A (en) Evaluation text fine-grained suggestion mining method based on multi-attention fusion
CN113836306B (en) Composition automatic evaluation method, device and storage medium based on chapter component identification
Tarride et al. A comparative study of information extraction strategies using an attention-based neural network
CN113312918B (en) Word segmentation and capsule network law named entity identification method fusing radical vectors
CN112417155B (en) Court trial query generation method, device and medium based on pointer-generation Seq2Seq model
CN111563374B (en) Personnel social relationship extraction method based on judicial official documents
CN112148879A (en) Computer readable storage medium for automatically labeling code with data structure
CN115659989A (en) Web table abnormal data discovery method based on text semantic mapping relation
CN114661900A (en) Text annotation recommendation method, device, equipment and storage medium
CN115344668A (en) Multi-field and multi-disciplinary science and technology policy resource retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211001