CN115481247A

CN115481247A - Author name disambiguation method based on comparative learning and heterogeneous graph attention network

Info

Publication number: CN115481247A
Application number: CN202211151607.0A
Authority: CN
Inventors: 宫继兵; 房小涵; 彭吉全; 赵祎; 赵金烨; 王成龙; 黄朝园
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-16

Abstract

The invention discloses an author name disambiguation method based on comparative learning and heterogeneous graph attention network, belonging to the technical field of entity disambiguation constructed by knowledge maps, comprising the steps of using MongoDB to access information such as paper names, authors and organizations, using a python character processing library to clean data, removing noise to obtain more standard text, and cleaning the text into data suitable for subsequent steps; performing characterization learning on the paper by using contrast learning to obtain the embedding of the uniform codes of the paper; clustering the papers by using a purity priority principle, and relieving the problem of paper combination to obtain a paper cluster; aligning the thesis clusters obtained in the last step by using a heteromorphic image attention network; and an over-splitting detection and over-splitting alignment algorithm is provided, so that the disambiguation quality of the thesis is ensured. The invention better realizes the disambiguation problem of the same author and solves the problems of paper merging and paper splitting to a certain extent.

Description

Author name disambiguation method based on comparative learning and heterogeneous graph attention network

Technical Field

The invention relates to the technical field of entity disambiguation of knowledge graph construction, in particular to an author name disambiguation method based on comparative learning and heterogeneous graph attention network.

Background

Whether the big data of the present day or the meta-universe of the recent fire and heat, in the knowledge informatization process, how to disambiguate the entities with the same name is an important and challenging problem. The problem generally exists in the fields of academic database construction, information retrieval, automatic question answering, recommendation systems and the like, and has important research significance. The author name disambiguation has important research value in academic database construction, and a large number of scholars participate in related research in recent years. The disambiguation in the academic database construction aspect is mainly in the aspect of same-name authors, and a large number of papers in the current system have wrong distribution, wherein the phenomenon that English names of Chinese scholars have ambiguity is particularly serious. Many of these are historical errors that occur during the operation of the author name disambiguation system, and these errors will grow as the number of system papers increases.

In the process of surveying academic database construction, historical errors are divided into two sub-scenes of paper merging and paper splitting. The problem of paper overcombination refers to a paper with other experts in a certain expert-based library, and the problem of paper overcompletion refers to a paper of the same expert being split into a plurality of clusters. These two phenomena are widely present in the running process of the AND algorithm, AND these errors can seriously affect the stable execution of the subsequent algorithm if the errors are not emphasized AND solved, which is a big challenge in the current AND research.

Disclosure of Invention

The invention provides an author name disambiguation method based on contrast learning and a heterogeneous graph attention network, which converts the disambiguation problem into an alignment problem by preliminarily clustering papers through technologies such as a heterogeneous graph neural network, clustering and contrast learning, better realizes the disambiguation problem of authors with the same name, and solves the problems of paper over-merging and paper over-splitting to a certain extent.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

an author name disambiguation method based on a contrast learning and heterogeneous graph attention network comprises the following steps:

s1, data preprocessing: mongoDB is used for accessing the information of the thesis name, the author and the organization, a python character processing library is used for cleaning the data, noise is removed, a more standard text is obtained, and the data are cleaned into data suitable for the subsequent steps;

s2, the paper represents and learns: performing characterization learning on the paper by using contrast learning to obtain the embedding of the uniform codes of the paper;

s3, paper primary clustering: clustering the papers by using a purity priority principle, and relieving the problem of paper combination to obtain a paper cluster;

s4, aligning paper clusters: aligning the paper cluster obtained in the last step by using a heteromorphic graph attention network;

s5, obtaining a paper disambiguation result: and an over-split detection and over-split alignment algorithm is provided, so that the dissimilarity quality of the thesis is ensured.

The technical scheme of the invention is further improved as follows: s2, the method specifically comprises the following steps:

s21, obtaining the paper representation by using a language pre-training model BERT, wherein the process is described as follows:

in the formula (I), the compound is shown in the specification,

is the i-th paper of the author a,

is a paper

A corresponding characterization vector;

s22, constructing a correct example pair

Negative example pair of structure

And combining the positive and negative examples;

s23, introducing a trained objective function h = f (bert (x)), a trained objective loss l _i The description is as follows:

where N is the minimum batch _ size, τ is the temperature hyperparameter, sim (h) ₁ ，h ₂ ) Is cosine similarity

S24, obtaining a representation vector v of the thesis finally after training _i 。

The technical scheme of the invention is further improved as follows: s3, specifically comprising:

s31, dividing the thesis into more clusters according to rules by taking the clustering process as a disambiguation intermediate process, and reducing the occurrence of different authors in the same cluster;

s32, clustering is carried out through the LightGBN and the hierarchical clustering model, and the negative gradient of the loss function is used as a residual error approximate value of the current decision tree to fit a new decision tree;

s33, put forwardIndex Recall _over-merge To describe the overcombination phenomenon of the clustering result, the index is defined as:

in the formula, P represents the number of cases of two papers of the same author in the same cluster; FN indicates the number of cases where two identical author papers are in two clusters, respectively; m is the number of ideal clustering results, and N is the number of actual clustering results; recall _over-merge Clustering with higher values results in a lower degree of over-splitting.

The technical scheme of the invention is further improved as follows: s4, specifically comprising:

s41, generating candidate pairs for author entities with the same name;

s42, constructing a heteromorphic graph for each author entity, and if the mechanism and the co-author names between the candidate pairs are the same or the papers are similar, connecting the entities with each other to obtain a heteromorphic graph G (V, E);

and S43, determining author matching by using the heteromorphic image attention network.

The technical scheme of the invention is further improved as follows: in S43, the method specifically includes:

s431, obtaining semantic embedding of each thesis entity through the representation learning model of S2, and training the heterogeneous graph constructed in S42 through a LINE model to obtain structure embedding of each entity;

s432, combining the two kinds of embedding together as an input feature f, and finding out the importance among different author entities e through self-attribute, wherein the process is described as follows:

t _ij ＝self-attention(Wf _i ，Wf _j )

wherein W is a shared weight matrix for each

Refer to e _i All neighbor nodes of (1).

The technical scheme of the invention is further improved as follows: s5, specifically comprising:

s51, generating a non-repeated pair < name: cid1 and name: cid2> according to the rule of permutation and combination to construct a heteromorphic graph;

s52, detecting whether a group of pair belongs to an author or not by using a pre-trained HGAT;

s53, aligning the paper clusters by giving an alignment rule;

s54, the process needs to be carried out for multiple times, the times are defined as loops, and the finally obtained cluster _ pubs is the final disambiguation result.

The technical scheme of the invention is further improved as follows: in S53, specifically, the method includes:

s531, calculating adjacent edge nodes of each node, and connecting a group of edges with highest similarity scores of aligned two nodes;

s532, after all the nodes are judged, the dfs is used for realizing the connected subgraph algorithm, the alignment rule is obtained, and the nodes are combined.

Due to the adoption of the technical scheme, the invention has the technical progress that:

1. according to the invention, by comparing the learning technology to finely adjust the BERT-based paper representation, the learned paper representation is more suitable for the task of author name disambiguation.

2. According to the method, the similarity among the papers is calculated according to the paper representation obtained in the last step, so that the papers are preliminarily clustered to obtain fine-grained paper clusters, the disambiguation problem is converted into the alignment problem, the text semantic information of the papers is fully utilized for clustering, and the high-purity fine-grained paper clusters are generated.

3. In order to obtain a final disambiguation result, fine-grained paper clusters need to be aligned, in the process, a heteromorphic graph network is constructed by using each attribute of each paper cluster, then the representation of each paper cluster is learned by using a isomerous graph neural network, finally the similarity between every two paper clusters is calculated, the two paper clusters are most similar to each other to be aligned, structural information in the paper is considered in the process, and the paper clusters are learned through the isomerous graph neural network, so that the final paper disambiguation result is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts;

FIG. 1 is an algorithmic flow chart of an author name disambiguation method based on a comparative learning and heterogeneous graph attention network provided by the present invention;

FIG. 2 is a diagram of an algorithmic model framework for an author name disambiguation method based on a comparative learning and heterogeneous graph attention network provided by the present invention.

Detailed Description

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides an author name disambiguation method based on comparative learning AND heterogeneous graph attention network, solves the problem that the stable execution of a subsequent algorithm is seriously influenced by the phenomena of paper overmerging AND paper oversplitting which are widely generated in the running process of an AND algorithm in the prior art, emphatically considers two error scenes possibly generated in the AND process, AND provides the AND algorithm aiming at the problem AND how to apply the AND algorithm in a big data scene.

Part of the technical term interpretation:

author name disambiguation: author Name Disambiguation-AND, which correctly matches the same Author in the academic database, disambiguates the same Author.

The invention is further described in detail below with reference to the drawings and examples:

as shown in fig. 1 and 2, an author name disambiguation method based on a contrast learning and heterogeneous graph attention network includes the following steps:

s1, preprocessing data;

using MongoDB to access information such as a thesis name, an author, an organization and the like, using a python character processing library to clean data, removing noise to obtain a more standard text, and cleaning the text into data suitable for subsequent steps;

s2, performing paper characterization learning;

performing characterization learning on the paper by using contrast learning to obtain the embedding of the uniform codes of the paper;

s3, performing primary clustering on the thesis;

clustering the papers by using the principle of purity priority, and relieving the problem of paper over-merging;

s4, aligning the paper clusters;

aligning the paper cluster obtained in the last step by using a heteromorphic graph attention network;

s5, obtaining a paper disambiguation result;

an over-split detection and over-split alignment algorithm is provided, so that the paper disambiguation quality is guaranteed;

carrying out a specific implementation process;

s1, aiming at the data noise problem existing in a data set and factors possibly influencing disambiguation quality in the data, firstly preprocessing the data set, cleaning and analyzing the data, including cleaning abnormal data, analyzing samples from different characteristic angles, then performing characteristic engineering on the data, and taking the processed data as follow-up model training to provide input;

s2, in the paper characterization learning, firstly, preliminarily obtaining paper characterization by using a language pre-training model, then, referring to comparative learning, constructing positive and negative example pairs, and finally obtaining a characterization vector of the paper after training; the method specifically comprises the following steps:

s21, firstly, acquiring a preliminary characterization of the paper through a language pre-training model BERT, wherein the process can be described as follows:

in the formula (I), the compound is shown in the specification,

is an i-th paper of the author a,

is a paper

A corresponding characterization vector;

s22, drawing papers with similar paper similarity together by using a method of comparing and learning SimCSE, drawing papers with low paper similarity, constructing a positive example and a negative example and combining the positive example and the negative example; the method specifically comprises the following steps:

s221, a positive example structure: two BERT encoders were used to obtain each for a given article with author name a

And

the vectors generated in each time in the BERT process are not identical, but the two semantics are identical, thereby forming a positive example pair

In addition, in order to better make the papers of the same author closer in the obtained vector space, the same authorDifferent thesis of

Also regarded as positive samples, thereby constituting a positive case pair

S222, negative example structure: to make the paper farther apart between different authors of the same name, it is treated as a negative example

Thereby obtaining a negative example pair

S23, adding p _pos And p _neg Are combined to form

Wherein x _i Is the basis for the measurement of the measurement,

is a positive example of the situation,

is a negative example. For training the implicit relationship between the two, a training objective function h = f (BERT (x)) is introduced after the BERT-Encoder, where f is a linear layer function. Loss of training target l _i As shown in the following equation:

wherein N is the minimum batch _ size, τ is the (temperature) hyperparameter, and sim (h) ₁ ，h ₂ ) Is cosine similarity

After training, a representative vector v of a paper can be obtained _i 。

S3, in the preliminary clustering of the thesis clusters, clustering is carried out through a clustering model according to the principle of purity priority, clusters with proper number are generated as much as possible, and then the clustering condition is adjusted reasonably according to the overcombination indexes; the method specifically comprises the following steps:

s31, in order to deal with the problem of excessive combination of the papers, the clustering process is taken as an intermediate process of disambiguation. In the clustering process, the papers are divided into more clusters as much as possible according to a certain rule, so that the situation that different authors of the papers appear in the same cluster can be effectively reduced.

S32, providing an index Recall _over-merge To describe the overcombination phenomenon of the clustering result, the index is defined as the formula:

in the formula, TP represents the number of cases that two papers of the same author are in the same cluster; FN indicates the number of cases where two identical author papers are in two clusters, respectively; m is the number of ideal clustering results, and N is the number of actual clustering results; recall _over-merge Clustering with higher values results in lower degrees of over-splitting.

S4, in the process of aligning the thesis clusters, firstly connecting author entities to obtain a heteromorphic graph. Then, determining author matching by using an abnormal picture attention network; the method specifically comprises the following steps:

s41, generating candidate pairs for author entities (clusters) with the same name;

s42, constructing a heterogeneous graph for each author entity, and if the names of organizations and co-workers between the candidate pairs are the same or the papers are similar, connecting the entities and the co-workers with each other to obtain a heterogeneous graph G (V, E);

s43, determining author matching by using the attention network of the heteromorphic image;

s431, obtaining the semantic embedding of each thesis entity through the representation learning model of the S2, and training the heterogeneous graph constructed in the S42 through a LINE model to obtain the structure embedding of each entity;

s432, combining the two embedding together as an input characteristic f, wherein the process can be described as solving by using self-attention among different author entities e, and the process is that the node e _i To e _j Importance of t _ij The formula is as follows:

t _ij ＝self-attention(Wf _i ，Wf _j )

wherein W is a shared weight matrix for each

E of finger _i All the neighbor nodes of (2); wherein the normalized attention coefficient is as follows:

s5, finally aligning the thesis cluster through multiple alignment rules to finally obtain a final disambiguation result;

the method comprises the following steps:

s51, generating no-repeat Pairs < name: cid1 and name: cid2> according to the rule of permutation and combination, and constructing a heteromorphic graph;

s53, aligning the paper clusters by giving an alignment rule;

s532, after all the nodes are judged, a unified subgraph algorithm is realized by using dfs, an alignment rule is obtained, and merging is carried out;

s54, the process needs to be carried out for multiple times, and the times are defined as loops; the final cluster _ pubs is the final disambiguation result.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An author name disambiguation method based on a comparative learning and heterogeneous graph attention network is characterized in that: the method comprises the following steps:

s3, performing primary clustering on the thesis: clustering the papers by using a purity priority principle, and relieving the problem of paper combination to obtain a paper cluster;

s4, aligning paper clusters: aligning the thesis clusters obtained in the last step by using a heteromorphic image attention network;

2. The author name disambiguation method based on comparative learning and heterogeneous graph attention network as recited in claim 1, wherein: s2, specifically comprising:

s21, obtaining a paper representation by using a language pre-training model BERT, wherein the process is described as follows:

in the formula (I), the compound is shown in the specification,

is the i-th paper of the author a,

is a paper

A corresponding characterization vector;

s22, constructing a correct example pair

Negative example pair of structure

And combining the positive and negative examples;

s23, introducing a training objective function h = f (bert (x)), and a training objective loss

The description is as follows:

3. The author name disambiguation method based on a comparative learning and heterogeneous graph attention network of claim 1, further comprising: in S3, specifically including:

s33, providing an index Recall _over-merge To describe the overcombination phenomenon of the clustering result, the index is defined as:

in the formula, P represents the number of cases that two papers of the same author are in the same cluster; FN represents the number of cases that two same author papers are respectively in two clusters; m is the number of ideal clustering results, and N is the number of actual clustering results; recall _over-merge Clustering with higher values results in lower degrees of over-splitting.

4. The author name disambiguation method based on comparative learning and heterogeneous graph attention network as recited in claim 1, wherein: s4, specifically comprising:

s41, generating candidate pairs for author entities with the same name;

5. The author name disambiguation method based on comparative learning and heterogeneous graph attention network as recited in claim 4, wherein: in S43, the method specifically includes:

t _ij ＝self-attention(Wf _i ，Wf _j )

where W is a shared weight matrix for each

Refer to e _i All neighbor nodes of (1).

6. The author name disambiguation method based on a comparative learning and heterogeneous graph attention network of claim 1, further comprising: in S5, specifically including:

s51, generating non-repeated Pairs < name: cid1 and name: cid2> according to the rule of permutation and combination, and constructing a heteromorphic graph;

s52, detecting whether a group of pair belongs to an author by using a pre-trained HGAT;

s53, aligning the paper cluster by giving an alignment rule;

s54, the process needs to be carried out for multiple times, the times are defined as the loops, and the finally obtained cluster _ pubs is the final disambiguation result.

7. The author name disambiguation method based on comparative learning and heterogeneous graph attention networks of claim 6, further comprising: in S53, specifically, the method includes: