CN113468887A

CN113468887A - Student information relation extraction method and system based on boundary and segment classification

Info

Publication number: CN113468887A
Application number: CN202110685661.2A
Authority: CN
Inventors: 曹安蕲; 唐果; 傅洛伊; 王新兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-10-01

Abstract

The invention provides a scholars information relation extraction method based on boundary and segment classification, which comprises the following steps: step S1: acquiring personal information and text contents of different teachers; step S2: carrying out similar entity word replacement and amplification training data on entity words in the text; step S3: embedding the text by using a pre-training model and extracting semantic features; step S4: recognizing the boundary of a main word and classifying entity fragments; step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments; step S6: and performing user portrayal according to the recognition and classification results. By utilizing the thought of a probability map and combining a half pointer-half labeling mode, the problems that in relation extraction, one subject word corresponds to a plurality of subject words, and the relations between two same entities are different are solved. The method for enhancing entity fragment classification by using the boundary can reduce the influence caused by the prediction error of the tail pointer and improve the accuracy rate of entity relationship extraction.

Description

Student information relation extraction method and system based on boundary and segment classification

Technical Field

The invention relates to the technical field of machine learning and natural language processing, in particular to a student information relation extraction method and system based on boundary and segment classification.

Background

The method is characterized in that information such as student names, mailboxes, titles, personal homepages, education experiences and work experiences is required to be extracted from texts, most student information is from the personal homepages or introductory webpages (Baidu encyclopedia and school teacher directories), the problems of few information sources, high data noise and high data information redundancy exist, furthermore, a certain grammatical difference exists between html files obtained by preprocessing the information texts of the students from webpages and natural texts in the traditional sense, so that the information is difficult to extract by a self-defined rule method, however, the problems of large workload, low efficiency and the like exist in artificially extracting the student information of the webpages, and therefore, the extraction of the information of the webpage texts by using an entity relationship extraction technology in natural language processing is very important.

Entity relationship extraction is an important branch in the field of information extraction, and comprises two subtasks: entity identification and relationship extraction, namely identifying entities from a natural text, extracting the relationship between each entity pair, and finally forming a relationship triple < s, p, o >, wherein s represents a subject word (subject), p represents a predicate, namely a relationship (predicate), and o represents a subject word (object). The entity refers to concepts such as time, place, people and organization in the text; relationships refer to semantic relationships between entities.

At present, the entity relationship extraction mainly adopts a neural network model, and two realization modes are mainly adopted: 1. a pipeline form; 2. a joint decimation mode. The pipeline method is characterized in that after entity recognition is finished, relationship extraction between entities is directly carried out, although pipeline learning is easy to realize and the two extraction models are high in flexibility, the entity recognition model and the relationship extraction model can adopt independent data sets, but errors of the entity recognition model can affect the effect of the relationship extraction model, the phenomenon of error propagation occurs, meanwhile, the entity recognition model can obtain a plurality of redundant entities, the difficulty and the complexity of subsequent relationship extraction tasks are increased, and the pipeline learning ignores the connection between the two tasks, so that researchers fuse named entity recognition and relationship extraction into one task to carry out combined learning, the propagation of errors can be relieved to a certain degree through the combined learning, the two tasks are fused into one model, the efficiency of the model learning and prediction is increased, and the robustness of the model is improved, in recent years, the precision of natural language processing of tasks in different fields is greatly improved by the aid of the Transformer model, and the data volume of the training model can be reduced by fine-tuning the Transformer pre-training model.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a student information relation extraction method and system based on boundary and segment classification.

The invention provides a scholars information relation extraction method based on boundary and segment classification, which comprises the following steps:

step S1: acquiring personal information and text contents of different teachers;

step S2: carrying out similar entity word replacement and amplification training data on entity words in the text;

step S3: embedding the text by using a pre-training model and extracting semantic features;

step S4: recognizing the boundary of a main word and classifying entity fragments;

step S5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;

step S6: and performing user portrayal according to the recognition and classification results.

Preferably, the step S1 includes the steps of:

step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;

step S102: extracting text content in the html file obtained in the step S101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;

step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;

step S104: and constructing an entity relation extraction data set by the labeled text.

Preferably, the step S2 includes the steps of:

step S201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;

step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF;

step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.

Preferably, the step S3 includes the steps of:

step S301: embedding the text by using a BERT pre-training model and extracting semantic features;

step S302: text matrix T ═ T obtained by segmenting text₁，t₂，···，t_nN represents the character length after sentence segmentation; t is t_nRepresenting an nth text participle;

step S303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding₁，x₂，···，x_nThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information₁，h₂，···，h_nIn which x_nRepresents the nth sum value; h is_nRepresenting the nth code vector value.

Preferably, the step S4 includes the steps of:

step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head-to-tail pointer by using a sigmoid activation function, where the formula is as follows:

where a represents a sigmoid function,

respectively representing the probability that the ith character is used as a head pointer of the main word and the probability that the ith character is used as a tail pointer of the main word; start and end respectively represent a head pointer and a tail pointer, wherein a subscript s represents a main word; w_start、W_endTrainable weight matrices representing the probability of predicting head pointers, trainable weight matrices representing the probability of predicting tail pointers, respectively, b_start、b_endRespectively representing the offset of a prediction head pointer and the offset of a prediction tail pointer; h is_iAn encoding vector representing the i-th character obtained by step S303; if the probability exceeds a certain threshold value, the token distribution corresponding to the probability is marked as 1, otherwise, the probability is marked as 0;

step S402: go through the token with all head pointers marked 1, whose position in the sequence is given by y_startsFinding the first tail pointer after the head pointer, whose position in the sequence is given by

Obtaining the sequence

Let is H_subWherein H is the encoding vector H in step S303, and the subscript sub represents the main word sequence;

step S403: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and a code vector S of the subject word sequence is input_subThe formula is as follows:

s_sub＝LSTM(H_sub)

wherein LSTM represents a bidirectional long-short time cyclic neural network;

step S404: the subject word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:

P_s＝softmax(W^Ts_sub+b)

wherein P is_sIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ R^d×kD is the subject word encoding dimension, k is the entity category number, b represents the bias; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.

Preferably, the step S5 includes:

step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of a corresponding object word by using a sigmoid activation function, wherein the formula is as follows:

wherein r represents a subject word and object word relationship, wherein

And

the head pointer and tail pointer probabilities that represent the ith character in the sequence as an object word,

and

is a trainable weight matrix that predicts the head and tail pointers of the relationship r,

and

the offset of the head and tail pointers representing the predicted relationship r;

step S502: traversing token of which all head pointers are marked as 1 in each relation r, and the position of the token in the sequence is

Find the first tail pointer after the head pointer, whose position in the sequence is ordered as

Object word sequence for predicting corresponding relation r of subject word

Let is

Step S503: the subject word sequence obtained in step S402 is traversed as a bidirectional LSTM, and the obtained encoding vector S of the object word is input, with the following formula:

where the LSTM is shared with the LSTM parameters in step S403,

the object word coding vector is the corresponding relation r of the subject word, and obj represents the object word;

step S504: the object word encoding vector S obtained in step S403 is used as Linear layer input, and a softmax activation function is used to obtain a probability of whether the subject word sequence is an entity, where the formula is as follows:

wherein P is_sIs the probability of whether the object word sequence is an entity, where the matrix W ∈ R^d×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404; if the probability exceeds a certain threshold, the corresponding entity class is marked as 1, otherwise, the corresponding entity class is marked as 0.

The invention also provides a system for extracting the relationship between scholars information based on boundary and segment classification, which comprises the following modules:

module M1: acquiring personal information and text contents of different teachers;

module M2: carrying out similar entity word replacement and amplification training data on entity words in the text;

module M3: embedding the text by using a pre-training model and extracting semantic features;

module M4: recognizing the boundary of a main word and classifying entity fragments;

module M5: recognizing object word boundaries and corresponding relation boundaries and classifying entity fragments;

module M6: and performing user portrayal according to the recognition and classification results.

Preferably, the module M1 includes the following modules:

a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list;

the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher;

the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences;

the module M104: and constructing an entity relation extraction data set by the labeled text.

Preferably, the module M2 includes the following modules:

module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles;

the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF;

module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.

Preferably, the module M3 includes the following modules:

module M301: embedding the text by using a BERT pre-training model and extracting semantic features;

the module M302: text matrix T ═ T obtained by segmenting text₁，t₂，…，t_nN represents the character length after sentence segmentation; t is t_nRepresenting an nth text participle;

module M303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding₁，x₂，…，x_nThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information₁，h₂，…，h_nIn which x_nRepresents the nth sum value; h is_nRepresenting the nth code vector value.

Compared with the prior art, the invention has the following beneficial effects:

1. the method uses the trained entity recognition model BERT-CRF to construct the entity library and is used for enhancing the data of the relation extraction model, thereby improving the diversity of the data set and reducing the cost of manually constructing the data set;

2. the invention uses the probability map thought to convert the joint probability distribution p (s, p, o) into the conditional probability distribution p (p, o | s) and the half pointer-half labeling mode to solve the problems that one subject word corresponds to a plurality of object words, the subject word and the object word correspond to a plurality of relations and one object word corresponds to a plurality of subject words in the relation extraction;

3. the invention adopts entity segment classification to solve the problem of overlong entity prediction caused by missing tail pointers in the pointer network prediction process.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the relationship extraction model prediction of the present invention;

FIG. 2 is a flow chart of data set amplification according to the present invention;

FIG. 3 is a diagram illustrating a segment classification relation extraction model based on boundary enhancement according to the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Referring to fig. 1, the invention provides a scholars information relation extraction method based on boundary and segment classification, comprising the following steps:

step S1: acquiring text contents of personal information of different teachers from Baidu encyclopedia and teacher directories; step S1 includes: crawls the learner information, preprocesses the text, and constructs a basic data set. Step S101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list; step S102: extracting text contents (deleting html tags) in the html file obtained in the step S101 to obtain a whole plain text file of the teacher personal information; step S103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences; step S104: and manually marking the text to construct an entity relation extraction data set. The data format is as follows:

{ "text": 2012.12 obtained the agronomic Master academic degree at the college of the institute of Polymer chemistry, university of Mediterranean China. ",

"spo_list":[

"agricultural Master degree", 26 "Master degree-time", "2012.12", 0",

…]}

the 'text' field is a text obtained from a webpage, and the 'spo _ list' field records < s, p, o > triples in the text, wherein each list is a triplet, the first element in the list is a subject word, the second element is the position of the first character of the subject word in the text, the third element is the relationship between the subject word and an object word, the 4 th element is an object word, and the last element is the position of the first character of the object word in the text.

Step S2: referring to fig. 2, the entity words in the text are subjected to the similar entity word replacement amplification training data; step S2 includes: and constructing an entity library, amplifying data by using the entity library, and dividing a training set and a test set. Step S201: constructing an entity library (entity types comprise time, organization, name, courtyard, position, job title and job title) by an entity recognition model BERT-CRF; the format is as follows;

{ "time ["2012-10","2002"," 2012.10", … ],

"degree [" Benke "," Master "," doctor ", … ],

…}

the method comprises the fields of time, hierarchy, attribute, parent, title, joband name, wherein each field record category is an entity of the field.

Step S202: performing entity recognition on the sentence obtained in the step S103 through an entity recognition model BERT-CRF (the entity type is the same as the entity type in the entity library in the step S201); step S203: and performing similar entity replacement on the entity of each sentence obtained in the step S202 to amplify the data set, wherein the replaced entity is from the entity library constructed in the step S201.

Step S3: referring to fig. 3, word embedding is performed on a text and semantic features are extracted by using a BERT pre-training model; step S3 includes: and loading a pre-trained BERT model to embed the text to obtain a word vector. Step S301: text matrix T ═ T obtained by segmenting text₁,t₂,…t_nN represents the character length after sentence segmentation; the input embedding vector for each character comes from character embedding, position embedding andand character type embedded sum value X ═ X₁,x₂,…,x_nThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information₁,h₂,…,h_n}。

Step S4: recognizing the boundary of a main word and classifying entity fragments; step S4 includes: and predicting the head and tail pointers of the main body, and classifying the obtained entity sequences.

Step S401: inputting the coding vector obtained in the step S302 as a Linear layer, and obtaining the probability of whether each token is a head pointer or a tail pointer by using a sigmoid activation function, wherein the formula is as follows;

where a represents a sigmoid function,

and

representing the probability of the ith character as a head and tail pointer of the subject word; start and end represent head and tail pointers, where s represents a body word; w_startAnd W_endTrainable weight matrices representing the probabilities of predicting head and tail pointers, b_startAnd b_endRepresenting the offsets of the predicted head and tail pointers; h is_iAn encoding vector representing the i-th character obtained by step S303; and if the probability exceeds a certain threshold value, marking the token assignment corresponding to the probability as 1, otherwise, marking the probability as 0.

Step S402: go through the token whose head pointer is marked 1 (its position in the sequence is ordered as

) Find outThe first tail pointer after the head pointer (whose position in the sequence is ordered to be

) Obtaining the sequence

(let be H)_sub)。

Wherein L is_subA boundary of a main word is represented,

indicating whether the ith character tag is a head-to-tail pointer,

representing the probability of predicting head and tail pointers by the ith character model;

step S403: the subject word sequence obtained in step S402 is input as a bidirectional LSTM (long-short recurrent neural network) to obtain a coding vector S of the subject word, and the formula is as follows:

s_sub＝LSTM(H_sub)

wherein LSTM represents a bidirectional long-short time cyclic neural network;

P_s＝softmax(W^Ts_sub+b)

wherein P is_sIs the probability of whether the sequence of subject words is an entity, where s represents a subject word, and the matrix W ∈ R^d×kD is the subject word encoding dimension, k is the entity class number, and b represents the bias. If the probability exceeds a certain threshold, its corresponding entity class is marked as 1, otherwise it is marked as 0. The loss function of the entity classification of the main word adopts a cross entropy loss functionThe formula is as follows:

where class represents the type of real entity,

a loss function representing classification of the subject word segments, n represents the number of the subject word segments, i represents the ith entity segment, and j represents a subject word type label.

Step S5: and identifying object word boundaries and corresponding relation boundaries and classifying entity fragments. Step S5 includes: and predicting the object word boundary of the corresponding relation and predicting the entity type of the object word.

Step S501: adding the encoding vector S of the subject word with the entity type 1 predicted in the step S404 and the vector H to be used as Linear layer input, and obtaining the probability of whether each token is a head-tail pointer of the object word in the corresponding relationship by using a sigmoid activation function, wherein the formula is as follows:

wherein r represents a subject word and object word relationship, wherein

And

and

and

the offsets of the head and tail pointers representing the predicted relationship r.

Step S502: traverse each relation r with token labeled 1 for all head pointers (its position in the sequence is ordered as

) The first tail pointer (whose position in the sequence is given by

) Predicting the object word sequence of the corresponding relation r of the subject word

(order is

). The binary cross entropy loss function adopted by the head-to-tail prediction of the object words and the loss function of the pointer network has the following formula:

wherein L is_objA boundary of a main word is represented,

indicating whether the ith character tag is a head-to-tail pointer,

step S503: the subject word sequence obtained in step S402 is used as a bidirectional LSTM (long-short recurrent neural network) to input the obtained encoding vector S of the object word, and the formula is as follows:

wherein the LSTM shares with the LSTM parameters in step S403, S _ obj ^ r is the object word encoding vector of the corresponding relation r of the subject word, and obj represents the object word.

wherein P is_sIs the probability of whether the object word sequence is an entity, where the matrix W ∈ R^d×kD is the subject word encoding dimension, k is the entity category number, and the parameters are shared with step S404. If the probability exceeds a certain threshold, its corresponding entity class is marked as 1, otherwise it is marked as 0. The loss function of the entity classification of the object words adopts a cross entropy loss function, and the formula is as follows:

where class represents the type of real entity,

Step S505, the loss function of the whole model is subject word boundary loss, object word boundary loss and entity classification loss, and the formula is as follows:

and during training, an adaptive learning algorithm Adam is used for parameter optimization.

Firstly, the data set is amplified through the entity recognition model BERT-CRF, so that the diversity of the data set can be greatly increased, the labor cost is reduced, and the efficiency is improved. Secondly, the invention adopts the probability map idea to decompose the triple joint probability into condition probability, and simultaneously uses a half pointer-half label mode to solve the problems that one subject word corresponds to a plurality of object words, one object word corresponds to a plurality of subject words and the subject word and the object word correspond to a plurality of relations in the relation extraction. Finally, the problem that the entity is too long due to the fact that the tail pointer is wrongly predicted in the pointer network prediction process can be solved by using entity fragment classification. Meanwhile, the boundary identification focuses on the context information of the entity, the entity classification focuses on the information of the entity, and the two are jointly learned, so that the model prediction accuracy can be improved. The student portrait about the student is constructed through entity and relationship extraction, so that the user can be helped to pay attention to the student information more efficiently, and meanwhile, the student education or work migration diagram can be analyzed through the accurate and concise information. And the work can be conveniently transferred to information extraction in other fields.

The invention also provides a system for extracting the relationship between scholars information based on boundary and segment classification, which comprises the following modules: module M1: acquiring personal information and text contents of different teachers from Baidu encyclopedia and classroom directories; a module M101: acquiring text contents of personal information of different teachers from the Internet according to a teacher list; the module M102: extracting text contents in the html file obtained in the module M101, deleting html tags, and obtaining a whole plain text file of the personal information of the teacher; the module M103: dividing the text into sentences according to the Chinese sentence numbers and the longest length threshold of the sentences; the module M104: and constructing an entity relation extraction data set by the labeled text.

Module M2: carrying out similar entity word replacement and amplification training data on entity words in the text; module M201: constructing an entity library through an entity identification model BERT-CRF, wherein entity categories comprise time, institutions, names, courtyards, academic positions, jobs and titles; the module M202: carrying out entity recognition on the sentence obtained in the module M103 through an entity recognition model BERT-CRF; module M203: and performing similar entity replacement on the entity of each sentence obtained by the module M202 to amplify the data set, wherein the replaced entity is from the entity library constructed by the module M201.

Module M3: embedding the text by using a BERT pre-training model and extracting semantic features; module M301: text matrix T ═ T obtained by segmenting text₁，t₂，···，t_nN represents the character length after sentence segmentation;

the module M302: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding₁，x₂，···，x_nThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information₁，h₂，···，h_n}。

module M5: and identifying object word boundaries and corresponding relation boundaries and classifying entity fragments.

The method uses the trained entity recognition model BERT-CRF to construct the entity library and is used for enhancing the data of the relation extraction model, thereby improving the diversity of the data set and reducing the cost of manually constructing the data set; the invention uses the probability map thought to convert the joint probability distribution p (s, p, o) into the conditional probability distribution p (p, o | s) and the half pointer-half labeling mode to solve the problems that one subject word corresponds to a plurality of object words, the subject word and the object word correspond to a plurality of relations and one object word corresponds to a plurality of subject words in the relation extraction; the invention adopts entity segment classification to solve the problem of overlong entity prediction caused by missing tail pointers in the pointer network prediction process.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A scholars information relation extraction method based on boundary and segment classification is characterized by comprising the following steps:

2. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S1 includes the following steps:

3. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S2 includes the following steps:

4. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S3 includes the following steps:

step S302: text matrix T ═ T obtained by segmenting text₁，t₂，…，t_nN represents the character length after sentence segmentation; t is t_nDenotes the n-thText word segmentation;

step S303: the input embedding vector for each character comes from the sum of the values X ═ X { X } of the character embedding, the position embedding, and the character type embedding₁，x₂，…，x_nThe method is used as the input of a BERT pre-training model to obtain a coding vector H ═ H containing semantic information₁，h₂，…，h_nIn which x_nRepresents the nth sum value; h is_nRepresenting the nth code vector value.

5. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein said step S4 includes the following steps:

where a represents a sigmoid function,

e respectively represents the probability that the ith character is used as a head pointer of the main word and the probability that the ith character is used as a tail pointer of the main word; start and end respectively represent a head pointer and a tail pointer, wherein a subscript s represents a main word; w_start、W_endTrainable weight matrices representing the probability of predicting head pointers, trainable weight matrices representing the probability of predicting tail pointers, respectively, b_start、b_endRespectively representing the offset of a prediction head pointer and the offset of a prediction tail pointer; hi denotes the encoding vector of the i-th character obtained by step S303; the probability exceeds a certain threshold value, thenMarking the token allocation corresponding to the token as 1, otherwise marking the token allocation as 0;

step S402: go through the token with all head pointers marked as 1, and its position in the sequence is given as

Obtaining the sequence

s_sub＝LSTM(H_sub)

wherein LSTM represents a bidirectional long-short time cyclic neural network;

P_s＝soffmax(W^Ts_sub+b)

6. The method for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 1, wherein the step S5 includes:

wherein r represents a subject word and object word relationship, wherein

And

and

and

Find outThe first tail pointer, located after the head pointer, has a position in the sequence of

Object word sequence for predicting corresponding relation r of subject word

Let is

where the LSTM is shared with the LSTM parameters in step S403,

7. A system for extracting scholars' information relation based on boundary and segment classification is characterized in that the system comprises the following modules:

8. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M1 includes the following modules:

9. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M2 includes the following modules:

10. The system for extracting scholars' information relationship based on boundary and segment classification as claimed in claim 7, wherein the module M3 includes the following modules: