CN112101009A

CN112101009A - Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions

Info

Publication number: CN112101009A
Application number: CN202011008324.1A
Authority: CN
Inventors: 郑丽敏; 吕庆
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2020-12-18
Anticipated expiration: 2040-09-23
Also published as: CN112101009B

Abstract

The invention discloses a method for judging similarity of a relation frame with a dream of red mansions based on a knowledge graph, which comprises the following steps: collecting and processing data; adding an attention mechanism on the basis of BERT to obtain WBERT; constructing a WBERT + BILSTM + CNN + ATTENTION mechanism + CRF as a named entity recognition model; constructing a WBERT + dynamic IDCNN + FC as a relational extraction model; training a named entity recognition model and a relation extraction model to obtain an optimal model, obtaining a dream of Red mansions character relation triple through the optimal model, marking the entity with a serial number according to the occurrence frequency and the entrance and exit degree of the entity, classifying, and continuously changing the entity serial number according to the character relation; adding Weights on the basis of the triples to form quadruplets, wherein the Weights are the importance degree of the relationship; storing the quadruple into NEO4J, and compiling an algorithm to align the entities; extracting a novel tetrad with a comparison, and comparing the tetrad with the similarity of the tetrad of the dream of the red chamber; compared with the method of comparing each sentence one by one to obtain the similarity of two novels, the method can compare the character relation frames to obtain the similarity.

Description

Knowledge graph-based method for judging similarity of people relationship frame of dream of Red mansions

Technical Field

The invention relates to a method for judging similarity of a relation frame between a novel and a dream of red mansions, in particular to a method for judging similarity of relation frames between other novel and dream of red mansions based on a knowledge graph.

Background

In recent years, network attack-enduring events are frequent, which causes serious contusion to the power and enthusiasm of literature innovation, forms the expulsion of concept of integrity, originality and other equivalent values, and is harmful to the cultural originality of a nation.

The dream of red mansions, as one of four famous works in China, is a classical literature giant work, becomes the object of numerous assailants, the assailants modify the relationship frame of the characters in the dream of red mansions, some of the assailants even just change one name, the relationship frame of the characters in the novel is compared with the relationship frame of the characters in the dream of red mansions, the similarity degree comparison analysis is carried out, and the similarity of the character relationship structure can be judged.

With the rapid development of machine learning, the NLP technology is applied to more and more fields, and the NLP technology is used for detecting a frame which inherits the dream of red mansions, firstly, a knowledge graph of the relationship of people in the dream of red mansions is required to be constructed, and then, the input text is compared with the relationship frame of people in the dream of red mansions in similarity. In the process, data needs to be collected, a character relation dictionary needs to be built, data is labeled to obtain training data of a named entity recognition model and a relation extraction model, and the key point lies in the building of the named entity recognition model and the relation extraction model, then a knowledge graph is built, and the similarity of a novel to be compared and a dream of Red mansions character framework is compared.

Then, the names and the relations of the new texts are extracted, and the accuracy is low on the premise of no large amount of training data; the accuracy of entity and relationship extraction of the named entity identification and relationship extraction model in the specific field can be further improved; the traditional text similarity comparison is to compare each sentence by sentence so as to obtain the similarity of two novels, and the similarity of a character relation frame cannot be obtained.

Disclosure of Invention

The invention aims to construct a knowledge graph by using a people relationship framework of a dream red mansions, provides a new entity relationship extraction method, can extract a frame of an untrained fiction entity relationship, and then judges the degree of attack by using a self-defined similarity comparison method, and comprises the following steps:

1. gathering data

Collecting the people, the relations and the main places of the dream of red mansions, sorting and integrating data with multiple sources, and checking for missing and leakage to obtain relatively comprehensive data; the first character of the character name with higher frequency in the novel is collected, and if the first character of the character name with higher frequency is not in the common names, the first character is added.

2. Data processing

(1) Constructing a character dictionary for the sorted people, places and new people in the dream of red mansions, which specifically comprises the following steps: constructing a character dictionary by using the methods of character + PER label and location + LOC label for the characters and the locations; adding B-PER labels to surnames in the new family names; adding a character dictionary; writing a python code matching character dictionary, and converting the txt file of the whole dream of the red building into a txt file in a standard BIO form; dividing the txt file in the BIO form into a training set and a testing set according to the proportion of 7:3, and cutting by k-fold to take different parts of the training set as a verification set;

(2) adding an unknown relation into the collected dream relations of the red buildings, and constructing a relation dictionary in a number + relation mode; labeling sentences by a method of numbers and sentences according to a relational dictionary, wherein the names of people in the sentences are represented by masks and wildcards; dividing the labeled data set file into a training set and a test set according to the proportion of 8:2, and cutting by k-fold to take different parts of the training set as a verification set;

3. building models

(1) Constructing a WBERT model: experiments prove that the text understanding of each layer of BERT is different, so that the BERT model is finely adjusted;

1) the 12-layer transport generated representation of BERT is given a weight, which is initialized as: a is_i＝Dense_unit＝1(represent_i) (where ai denotes the initial weight of the ith layer, Dense denotes the fully-connected layer, present_iRepresents the output of the ith layer, unit 1 represents the final dimensionality reduction of the vector to one dimension,thereby obtaining a₁-a₁₂These 12 initialization weights;

2) determining the weight values by training, a₁-a₁₂Comparing the 12 initialization weights to obtain the value a with the maximum weight value₀；

3) A is to_i(represent_i) (where i is not equal to 0, ai represents the weight of the ith layer, present_iRepresenting the output of the ith layer) is maximally pooled by one pooling layer, which is a 3 × 3 × 768 core;

4) a is to₀(represent₀)(a₀Denotes a₁-a₁₂Middle maximum weight value, present₀Representing the output corresponding to this value) is spliced with the pooled vector;

5) reducing the dimension of the splicing vector obtained in the step 4) to 512 dimensions through a layer of full-connection layer: output ═ sense_unit＝512(where output represents the final output, density represents the fully-connected layer, and unit 512 represents the final dimensionality reduction of the vector to 512 dimensions);

(2) constructing a named entity recognition model:

1) the input part is WBERT (WBERT is a model obtained by fine-tuning BERT in the description 3 (1)) and the input sequence code and the output of the named entity identification model are spliced, wherein the output of the named entity identification model is converted into a sequence with fixed dimensionality, which has the same length as the input sequence through an argmax function;

2) processing a BIO text (the BIO text is a training set of (1) in the description 2) by WBERT to obtain a word vector code (WBERT is a model obtained by finely adjusting BERTt of (1) in the description 3);

3) inputting the word vectors obtained in the step 2) into CNN and BILSTM in parallel, wherein CNN is used for extracting local features, and BILSTM is used for extracting global features; some local features are more reasonably represented, and some global features are better represented, so that a weight is given to the features extracted by CNN and BILSTM, and the weight is initialized as follows: a is_CNN/BILSTM＝Dense_unit＝1(represent_CNN/BILSTM) (wherein a)_CNN/BILSTMDenotes the initial weight of CNN/BILSTM, Dense denotes the fully-connected layer，represent_CNN/BILSTMRepresenting the output of the CNN/BILSTM layer, and unit 1 representing the final dimensionality reduction of the vector to one dimension);

4) determining weight values by training, using a pooling layer pair a with a kernel size of 3 × 3 × 512_CNN＝(represent_CNN) And a_BILSTM＝(represent_BILSTM) Respectively performing maximum pooling (wherein a_CNN/BILSTMRepresenting the weight, present, after CNN/BILSTM training_CNN/BILSTMOutput representing the CNN/BILSTM layer);

5) splicing the output of the pooling layer obtained in the step (4);

6) the CRF layer adds some constraints to the last predicted label to ensure that the predicted label is legitimate: when a predicted sequence has a high score (or a maximum probability), the labels corresponding to the output maximum probability values are not taken from all positions, and the transition probability is considered to be added to the maximum, namely, the output rule is met (B-Per cannot be followed by I-Loc; B-Per represents the beginning of a name of a person, followed by the end of the name of the person, and I-Loc represents the end of a name of a place), for example, the most likely sequence of output is (I-L, I-P, O, I-L, I-P; I-L represents the end of the name of the place, I-P represents the end of the name of the person, and 0 represents redundant irrelevant characters), but because the probability of I-O- > I-P in the transition probability matrix is very small or even negative, according to the comprehensive score (probability), such sequences do not get the highest score (probability), i.e., are not the desired sequences; in order to solve the problem, a CRF layer is added after the splicing layer;

(3) constructing a relation extraction model:

1) stitching the input sequence encoding with the output of the named entity recognition model with WBERT: performing WBERT processing on a training data set (the training data set is obtained by (2) in the description 2) to obtain a feature sequence; splicing with the output of the named entity recognition model, wherein the output of the named entity recognition model is converted into a sequence with fixed dimensionality (the named entity recognition model is (2) in the description 3) which has the same length as the input sequence through an argmax function;

2) extracting features by using the dynamic IDCNN, setting the expansion coefficient of the IDCNN as a variable by using the dynamic IDCNN layer, and obtaining the optimal value of the expansion coefficient through training; the IDCNN layer expands the visual field of feature extraction through convolution on the basis of CNN, compared with the CNN layer which is spliced after feature extraction and then pooled, the IDCNN layer does not need pooling operation, and therefore loss of features is reduced; however, the convolution expansion coefficient of IDCNN has different effects when set to different values in different texts: setting the convolution expansion value of the initial IDCNN to 1, i ═ 1 (equivalent to the feature extraction effect of CNN when i is equal to 1); setting circulation: i is i + 1; finding an optimal i value through training;

3) the FC layer splices the local features;

3. construction of dream of Red mansions knowledge map

(1) Extracting characters, places and relations by taking the total fiction txt text of the dream of red mansions as the input of a named entity recognition model and a relation extraction model to obtain a character relation triple;

(2) marking sequence numbers of entities by taking the occurrence frequency and the access degree of the entities as the entity, dividing the entities into 1 to 5 levels, and continuously changing the sequence numbers of the entities according to the character relationship; on the basis of the triples, Weights (the Weights are the importance degree of human-object relationship and are determined by the levels of two entities in the triples) are added to form quadruplets:

1) defining the importance of the characters according to the frequency and the entrance and exit, and numbering the characters from 1 to n according to the importance degree (the size of n is determined by the number of the extracted entities);

2) each character is assigned with an importance parameter, and the importance parameters with the serial numbers from 1 to n are respectively n to 1;

3) dividing the characters marked with the serial numbers from 1 to n into levels from 1 to 5 according to the proportion of 1:2:3:4: 5;

4) adding importance parameters of the characters according to the character relationship triples, wherein the importance parameters are respectively added with 5 to 1 size on the basis of the original characters related to the characters with 1 to 5 levels, such as: the importance parameter of the entity 1 is n, a triple related to the entity 1 is found, and if the found triple finds that the entity 1 is related to the entity 2 and the entity 2 belongs to a person at the level 1, the importance parameter of the entity 1 is added with 5 on the basis of n;

5) sorting the characters according to the importance parameters again;

6) repeating the steps (2) to (5) until the serial number of the person is unchanged;

(3) storing the obtained quadruple into an NEO4J graph database, compiling an alignment algorithm, setting a threshold value to be 70%, and performing entity fusion if the similarity is greater than 70% (the similarity threshold value can be set according to the requirement);

5. similarity discrimination

(1) Extracting the characters, the relevant positions and the relations of the characters by using the trained named entity recognition model and the relation extraction model for the novels to be compared (the trained named entity recognition model and the relation extraction model are obtained by inputting data and training the named entity recognition model and the relation extraction model which are constructed in the description 3 (2) and the description 3) respectively);

(2) marking sequence numbers of entities by taking the occurrence frequency and the access degree of the entities as the entity, dividing the entities into 1 to 5 levels, and continuously changing the sequence numbers of the entities according to the character relationship; adding Weights on the basis of the triples to form quadruplets (wherein the method for obtaining the entity serial numbers is the same as the process in the step (2) in the power 5, and the Weights are the importance degree of the relationship of the persons and are determined by the levels of the two entities in the triples);

(3) finding out the quadruple where the entity is located, and obtaining the similarity percentage of 0% -100% according to the entity level and the Weights contrast frame relation similarity of all the quadruple where the entity is located.

It is worth to say that the method can be used for detecting the belonged to the dream of the red chamber, and can be applied to detecting the belonged to other novel frames after being slightly modified.

Drawings

The invention may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, and the following detailed description.

FIG. 1 is a flow chart of a method for judging similarity of a character relationship frame with a dream of Red mansions based on a knowledge graph;

FIG. 2 is a WBERT model diagram;

FIG. 3 is a diagram of a named entity recognition model;

FIG. 4 is a diagram of a relational extraction model.

Detailed Description

Embodiments of the present invention will be further described with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for judging similarity of a character relationship frame with a dream of Red mansions based on a knowledge graph, which is explained as follows:

a method for judging similarity of a character relationship frame with a dream of Red mansions based on a knowledge graph mainly comprises the following five parts: collecting data, processing the data, constructing a model, constructing a dream of Red mansions knowledge graph and judging the similarity;

1. gathering data

(1) Collecting the people, the relations and the main places of the dream of the red mansions, sorting and integrating the data of multiple sources, checking the missing and the omission, and obtaining relatively comprehensive data capable of expressing the relation frame of the people of the dream of the red mansions;

(2) the method includes the steps of collecting common names, collecting the first character in the character names with high occurrence frequency in the novel, wherein the first character is, for example, solitary defeat, the solitary is not found in the common names, but the occurrence frequency in the novel is high, and only the single character is taken to be added into the common names.

2. Data processing

(1) Adding an unknown relation into the collected dream relations of the red buildings, and constructing a relation dictionary in a number + relation mode; labeling sentences by a method of numbers and sentences according to a relational dictionary, wherein the names of people in the sentences are represented by masks and wildcards; dividing the labeled data set file into a training set and a test set according to the proportion of 8:2, and cutting by k-fold to take different parts of the training set as a verification set;

(2) constructing a character dictionary for the sorted people, places and new people in the dream of red mansions, which specifically comprises the following steps: constructing a character dictionary by using the methods of character + PER label and location + LOC label for the characters and the locations; adding B-PER labels to surnames in the new family names; adding a character dictionary; writing a python code matching character dictionary, and converting the txt file of the whole dream of the red building into a txt file in a standard BIO form; dividing the txt file in the BIO form into a training set and a testing set according to the proportion of 7:3, and cutting by k-fold to take different parts of the training set as a verification set;

3. building models

1) the 12-layer transport generated representation of BERT is given a weight, which is initialized as: a is_i＝Dense_unit＝1(represent_i) (where ai denotes the initial weight of the ith layer, Dense denotes the fully-connected layer, present_iRepresents the output of the ith layer, and unit 1 represents the final dimensionality reduction of the vector to one dimension, resulting in a₁-a₁₂These 12 initialization weights;

5) reducing the dimension of the spliced vector obtained in the step 4) to 512 dimensions through a layer of full-connection layer: output ═ sense_unit＝512(where output represents the final output, density represents the fully-connected layer, and unit 512 represents the final dimensionality reduction of the vector to 512 dimensions);

(2) constructing a named entity recognition model:

1) the input part is WBERT and splices the input sequence code and the output of the named entity identification model, wherein the output of the named entity identification model is converted into a sequence with fixed dimensionality, and the length of the sequence is the same as that of the input sequence through an argmax function;

2) performing WBERT processing on a BIO text (the BIO text is a training set of (2) in the description 2 of the attached drawing 1) to obtain word vector codes (WBERT is obtained by fine-tuning (1) to BERT in the description 3 of the attached drawing 1);

3) inputting the word vectors obtained in the step 2) into CNN and BILSTM in parallel, wherein CNN is used for extracting local features, and BILSTM is used for extracting global features; some local features are more reasonably represented, and some global features are better represented, so that a weight is given to the features extracted by CNN and BILSTM, and the weight is initialized as follows: a is_CNN/BILSTM＝Dense_unit＝1(represent_CNN/BILSTM) (wherein a)_CNN/BILSTMDenotes the initial weight of CNN/BILSTM, Dense denotes the full connection layer, present_CNN/BILSTMRepresenting the output of the CNN/BILSTM layer, and unit 1 representing the final dimensionality reduction of the vector to one dimension);

5) splicing the output of the pooling layer obtained in the step 4);

6) the CRF layer adds some constraints to the last predicted label to ensure that the predicted label is legitimate: when a predicted sequence has a high score (or a maximum probability), the label corresponding to the maximum probability value is not selected from all positions, the transition probability is considered to be added to the maximum, and the output rules are met (B-Per cannot be followed by I-Loc; B-Per represents the beginning of a name of a person, followed by the end of the name of the person, and I-Loc represents the end of the name of a place), for example, the most likely sequence of output is (I-L, I-P, O, I-L, I-P; I-L represents the end of the name of the place, I-P represents the end of the name of the person, and 0 represents redundant irrelevant characters), but because the probability of I-O- > I-P in the transition probability matrix is very small or even negative, according to the comprehensive score (probability), such sequences do not get the highest score (probability), i.e., are not the desired sequences; in order to solve the problem, a CRF layer is added after the splicing layer;

(3) constructing a relationship extraction model

1) Stitching the input sequence encoding with the output of the named entity recognition model with WBERT: processing a training data set (the training data set is obtained by (2) in the description 2 of the attached drawing 1, and WBERT is obtained by fine-tuning BERT in (1) in the description 3 of the attached drawing 1) through WBERT to obtain a feature sequence; splicing with the output of the named entity recognition model, wherein the output of the named entity recognition model is converted into a sequence with fixed dimensionality, which has the same length as the input sequence and is obtained through an argmax function;

3) the FC layer splices the local features.

4. Construction of dream of Red mansions knowledge map

(1) Taking the text of the full-text novel in the dream of red mansions txt as the input of a named entity recognition model and a relation extraction model to extract people, places and relations, and obtaining a people relation triple (the named entity recognition model and the relation extraction model are (2) and (3) in the description 3 of the attached drawing 1 respectively);

5) sorting the characters according to the importance parameters again;

6) and repeatedly executing 2) to 5) until the human figure number is unchanged.

(3) And storing the obtained quadruple in an NEO4J graph database, writing an alignment algorithm, setting a threshold value to be 70%, and performing entity fusion when the similarity is greater than 70% (the similarity threshold value can be set according to the requirement).

5. Similarity discrimination

(1) Extracting the characters, the relevant places of the characters and the relations by using the trained named entity recognition model and the relation extraction model for the novels to be compared (the trained named entity recognition model and the relation extraction model are respectively obtained by inputting data and training (2) and (3) in the description 3 of the attached drawing 1);

(2) marking sequence numbers of entities by taking the occurrence frequency and the access degree of the entities as the entity, dividing the entities into 1 to 5 levels, and continuously changing the sequence numbers of the entities according to the character relationship; adding Weights on the basis of the triples to form quadruplets (wherein the method for obtaining the entity serial numbers is the same as the process (2) in the 4 in the description of the attached figure 1, and Weights are the importance degree of the character relationship and are determined by the levels of two entities in the triples);

FIG. 2 is a WBERT model diagram illustrating the improvement to the native BERT model:

constructing a WBERT model: experiments prove that the text understanding of each layer of BERT is different, so that the BERT model is finely adjusted;

1. the 12-layer transport generated representation of BERT is given a weight, which is initialized as: a is_i＝Dense_unit＝1(represent_i) (wherein a)_iDenotes the initial weight of the ith layer, Dense denotes the fully connected layer, present_iRepresents the output of the ith layer, and unit 1 represents the final dimensionality reduction of the vector to one dimension, resulting in a₁-a₁₂These 12 initialization weights;

2. determining the weight values by training, a₁-a₁₂Comparing the 12 initialization weights to obtain the value a with the maximum weight value₀；

3. A is to_i(represent_i) (where i is not equal to 0, ai represents the weight of the ith layer, present_iRepresenting the output of the ith layer) is maximally pooled by one pooling layer, which is a 3 × 3 × 768 core;

4. a is to₀(represent₀)(a₀Denotes a₁-a₁₂Middle maximum weight value, present₀Representing the output corresponding to this value) is spliced with the pooled vector;

5. and then reducing the dimension to 512 dimensions through a layer of full connection layer: output ═ sense_unit＝512((4) get the stitching vector) (where output represents the final output, density represents the fully connected layer, and unit 512 represents the final dimensionality reduction of the vector to 512 dimensions).

FIG. 3 is a diagram of a named entity recognition model illustrating the structure of the named entity recognition model:

1. the input part is WBERT, and the input sequence code and the output of the named entity recognition model are spliced, wherein the output of the named entity recognition model is converted into a sequence with fixed dimensionality (WBERT is the model of the attached figure 2) which has the same length with the input sequence through an argmax function;

2. carrying out WBERT processing on a BIO text (the BIO text is a training set of (2) in the description 2 of the attached drawing 1) to obtain word vector codes (WBERT is a model of the attached drawing 2);

3. inputting the word vectors obtained in step 2 into CNN and BILSTM in parallel, wherein CNN is used for extracting local features, and BILSTM is used for extracting global features; some local features are more reasonably represented, and some global features are better represented, so that a weight is given to the features extracted by CNN and BILSTM, and the weight is initialized as follows: a is_CNN/BILSTM＝Dense_unit＝1(represent_CNN/BILSTM)(wherein a)_CNN/BILSTMDenotes the initial weight of CNN/BILSTM, Dense denotes the full connection layer, present_CNN/BILSTMRepresenting the output of the CNN/BILSTM layer, and unit 1 representing the final dimensionality reduction of the vector to one dimension);

4. determining weight values by training, using a pooling layer pair a with a kernel size of 3 × 3 × 512_CNN＝(represent_CNN) And a_BILSTM＝(represent_BILSTM) Respectively performing maximum pooling (wherein a_CNN/BILSTMRepresenting the weight, present, after CNN/BILSTM training_CNN/BILSTMOutput representing the CNN/BILSTM layer);

5. splicing the output of the pooling layer obtained in the step 4;

6. the CRF layer adds some constraints to the last predicted label to ensure that the predicted label is legitimate: when a predicted sequence has a high score (or a maximum probability), the labels corresponding to the output maximum probability values are not taken from all positions, and the transition probability is considered to be added to the maximum, namely, the output rule is met (B-Per cannot be followed by I-Loc; B-Per represents the beginning of a name of a person, followed by the end of the name of the person, and I-Loc represents the end of a name of a place), for example, the most likely sequence of output is (I-L, I-P, O, I-L, I-P; I-L represents the end of the name of the place, I-P represents the end of the name of the person, and 0 represents redundant irrelevant characters), but because the probability of I-O- > I-P in the transition probability matrix is very small or even negative, according to the comprehensive score (probability), such sequences do not get the highest score (probability), i.e., are not the desired sequences; to solve this problem, a CRF layer is added after the splice layer.

FIG. 4 is a relational extraction model illustrating the structure of the relational extraction model:

1. stitching the input sequence encoding with the output of the named entity recognition model using WBERT (WBERT is the model in fig. 2):

(1) performing WBERT processing on a training data set (the training data set is obtained in (2) in the description 2 of the attached drawing 1) to obtain a feature sequence;

(2) splicing with the output of the named entity recognition model, wherein the output of the named entity recognition model is converted into a sequence with fixed dimensionality, which has the same length as the input sequence and is obtained through an argmax function;

2. extracting features by using the dynamic IDCNN, setting the expansion coefficient of the IDCNN as a variable by using the dynamic IDCNN layer, and obtaining the optimal value of the expansion coefficient through training; the IDCNN layer expands the visual field of feature extraction through convolution on the basis of CNN, compared with the CNN layer which is spliced after feature extraction and then pooled, the IDCNN layer does not need pooling operation, and therefore loss of features is reduced; however, the convolution expansion coefficient of IDCNN has different effects when set to different values in different texts:

(1) setting the convolution expansion value of the initial IDCNN to 1, i ═ 1 (equivalent to the feature extraction effect of CNN when i is equal to 1);

(2) setting circulation: i is i + 1;

(3) and finding an optimal i value through training.

3. The FC layer splices the local features;

the embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for judging similarity of a character relationship framework of a dream of Red mansions based on a knowledge graph is characterized by comprising the following steps: the method for collecting completions of data and self-defining structures of common names and character dictionaries and labeling the data set comprises the following steps:

(1) collecting the people and the relation of the dream of Red mansions and the places related to the people, sorting and integrating the data of multiple sources, and checking for missing and filling leaks to obtain relatively comprehensive data;

(2) the method comprises the steps of collecting common names, collecting the first character of the character name with higher occurrence frequency in the novel, if the common names do not exist, adding the character name into the novel, wherein for example, the single alone is negated, the single alone does not exist in the common names, but the frequency of the character name appearing in the novel is very high, and only the single character is taken to be added into the common names;

(3) constructing a character dictionary for the sorted people, places and new people in the dream of red mansions, which specifically comprises the following steps: constructing a character dictionary by using the methods of character + PER label and location + LOC label for the characters and the locations; adding B-PER labels to surnames in the new family names; adding a character dictionary; writing a python code matching character dictionary, and converting the txt file of the whole dream of the red building into a txt file in a standard BIO form; dividing the txt file in the BIO form into a training set and a testing set according to the proportion of 7:3, and cutting by k-fold to take different parts of the training set as a verification set;

(4) adding an unknown relation into the collected dream relations of the red buildings, and constructing a relation dictionary in a number + relation mode; labeling sentences by a method of numbers and sentences according to a relational dictionary, wherein the names of people in the sentences are represented by masks and wildcards; and dividing the labeled data set files into a training set and a testing set according to the ratio of 8:2, and using k-fold cutting to take different parts of the training set as a verification set.

2. A method for judging similarity of a character relationship framework of a dream of Red mansions based on a knowledge graph is characterized by comprising the following steps: improvement of the native BERT model resulted in WBERT:

(1) experiments prove that the text understanding of each layer of BERT is different, so that the BERT model is finely adjusted;

(2) the 12-layer transport generated representation of BERT is given a weight, which is initialized as: a is_i＝Dense_unit＝1(represent_i) (wherein a)_iDenotes the initial weight of the ith layer, Dense denotes the fully connected layer, present_iRepresents the output of the ith layer, and unit 1 represents the final dimensionality reduction of the vector to one dimension, resulting in a₁-a₁₂These 12 initialization weights;

(3) determining the weight values by training, a₁-a₁₂Comparing the 12 initialization weights to obtain the value a with the maximum weight value₀；

(4) Ai (present)_i) (where i is not equal to 0, ai represents the weight of the ith layer, present_iRepresenting the output of the ith layer) are maximally pooled by one pooling layer, which is a 3 × 3 × 768 core;

(5) a is to₀(represent₀)(a₀Denotes a₁-a₁₂Middle maximum weight value, present₀Representing the output corresponding to this value) is spliced with the pooled vector;

(6) and reducing the dimension of the splicing vector obtained in the step (5) to 512 dimensions through a full connecting layer: output ═ sense_unit＝512(where output represents the final output, density represents the fully-connected layer, and unit 512 represents the final dimensionality reduction of the vector to 512 dimensions).

3. A method for judging similarity of a character relationship framework of a dream of Red mansions based on a knowledge graph is characterized by comprising the following steps: the named entity recognition model consists of four parts: WBERT, BILSTM + CNN, ATTENTION mechanism, CRF layer:

(1) processing a BIO text (the BIO text is a training set of (3) in the power 1) by WBERT to obtain a word vector code (WBERT is obtained by carrying out fine adjustment on BERT in the power 2);

(2) inputting the word vectors obtained in the step (1) into CNN and BILSTM in parallel, wherein CNN is used for extracting local features, and BILSTM is used for extracting global features; some local features are more reasonably represented, and some global features are better represented, so that a weight is given to the features extracted by CNN and BILSTM, and the weight is initialized as follows: a is_CNN/BILSTM＝Dense_unit＝1(represent_CNN/BILSTM) (wherein a)_CNN/BILSTMDenotes the initial weight of CNN/BILSTM, Dense denotes the full connection layer, present_CNN/BILSTMRepresenting the output of the CNN/BILSTM layer, and unit 1 representing the final dimensionality reduction of the vector to one dimension);

(3) determining weight values by training, using a pooling layer pair a with a kernel size of 3 × 3 × 512_CNN＝(represent_CNN) And a_BILSTM＝(represent_BILSTM) Maximum pooling is performed separately (where CNN/BILSTM denotes the weight after CNN/BILSTM training, represent_CNN/BILSTMOutput representing the CNN/BILSTM layer);

(4) splicing the output of the pooling layer obtained in the step (3);

(5) the CRF layer adds some constraints to the last predicted label to ensure that the predicted label is legitimate: when a predicted sequence has a high score (or a maximum probability), the labels corresponding to the output maximum probability values are not taken from all positions, and the transition probability is considered to be added to the maximum, namely, the output rule is met (B-Per cannot be followed by I-Loc; B-Per represents the beginning of a name of a person, followed by the end of the name of the person, and I-Loc represents the end of a name of a place), for example, the most likely sequence of output is (I-L, I-P, O, I-L, I-P; I-L represents the end of the name of the place, I-P represents the end of the name of the person, and 0 represents redundant irrelevant characters), but because the probability of I-O- > I-P in the transition probability matrix is very small or even negative, according to the comprehensive score (probability), such sequences do not get the highest score (probability), and are not the desired sequences; to solve this problem, a CRF layer is added after the splice layer.

4. A method for judging similarity of a character relationship framework of a dream of Red mansions based on a knowledge graph is characterized by comprising the following steps: the relation extraction model consists of three parts: input layer, dynamic IDCNN layer, FC:

(1) the input part is WBERT (WBERT is obtained by trimming BERT in the authority 2) and the output of the input sequence coding and named entity identification module (named entity identification model is a model constructed by the authority 3) is spliced:

1) processing a training data set (the training data set is obtained by (4) in the power 1) through WBERT (the WBERT is obtained by carrying out BERT fine adjustment in the power 2) to obtain a characteristic sequence;

2) splicing with the output of a named entity recognition model (the named entity recognition model is a model constructed by a right 3), wherein the output of the named entity recognition model is converted into a sequence with fixed dimensionality, and the length of the sequence is the same as that of the input sequence through an argmax function;

(2) the expansion coefficient of the IDCNN is set as a variable by the dynamic IDCNN layer, and the optimal value of the expansion coefficient is obtained through training; the IDCNN layer expands the visual field of feature extraction through convolution on the basis of CNN, compared with the CNN layer which is spliced after feature extraction and then pooled, the IDCNN layer does not need pooling operation, and therefore loss of features is reduced; however, the convolution expansion coefficient of IDCNN has different effects when set to different values in different texts:

1) setting the convolution expansion value of the initial IDCNN to 1, i ═ 1 (equivalent to the feature extraction effect of CNN when i is equal to 1);

2) setting circulation: i is i + 1;

3) finding an optimal i value through training;

(3) the FC layer splices the local features.

5. A method for judging similarity of a character relationship framework of a dream of Red mansions based on a knowledge graph is characterized by comprising the following steps: constructing a dream of Red mansions knowledge graph:

(1) taking the text of the full-text of the Honghou dream as the input of a named entity identification model and a relation extraction model (the named entity identification model is a model constructed by a power 3; the relation extraction model is a model constructed by a power 4) to extract people, places and relations, and obtaining a person relation triple;

5) sorting the characters according to the importance parameters again;

and storing the obtained quadruple in an NEO4J graph database, writing an alignment algorithm, setting a threshold value to be 70%, and performing entity fusion when the similarity is greater than 70% (the similarity threshold value can be set according to the requirement).

6. A method for judging similarity of a character relationship framework of a dream of Red mansions based on a knowledge graph is characterized by comprising the following steps: the method for comparing the similarity with the frame of the dream of red mansions comprises the following steps:

(1) extracting the characters, the relevant places of the characters and the relations by using the trained named entity recognition model and the relation extraction model (the trained named entity recognition model and the relation extraction model are obtained by inputting data and training the named entity recognition model and the relation extraction model which are respectively constructed by the right 3 and the right 4);