WO2023155508A1 - 一种基于图卷积神经网络和知识库的论文相关性分析方法 - Google Patents

一种基于图卷积神经网络和知识库的论文相关性分析方法 Download PDF

Info

Publication number
WO2023155508A1
WO2023155508A1 PCT/CN2022/131993 CN2022131993W WO2023155508A1 WO 2023155508 A1 WO2023155508 A1 WO 2023155508A1 CN 2022131993 W CN2022131993 W CN 2022131993W WO 2023155508 A1 WO2023155508 A1 WO 2023155508A1
Authority
WO
WIPO (PCT)
Prior art keywords
paper
papers
knowledge base
neural network
graph
Prior art date
Application number
PCT/CN2022/131993
Other languages
English (en)
French (fr)
Inventor
吴岳辛
范春晓
邹俊伟
王艺潼
刘峻辰
Original Assignee
北京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京邮电大学 filed Critical 北京邮电大学
Publication of WO2023155508A1 publication Critical patent/WO2023155508A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of computer technology processing, in particular to a paper correlation analysis method based on a graph convolutional neural network and a knowledge base.
  • Paper category classification is to divide the literature according to the research field and research task, and add category labels to the paper entities in the paper collection knowledge base. This division has predetermined categories, and there is no intersection between categories.
  • the paper community discovery refers to dividing the literature into "communities” and adding community tags to the paper entities in the collection knowledge base.
  • the so-called "community” is a collection of papers with tags. The internal connection of the collection is relatively close, and the external connection with the collection is relatively sparse. Unlike paper classification, community discovery does not have predetermined labels, and there may be overlap between "communities”.
  • the present invention proposes a new paper correlation analysis method: extract key information from the paper collection, build a knowledge base, combine the graph convolutional neural network, and propose an improved Inception-GCN model to complete the paper Classification, use the NOCO model to complete the paper community discovery, and then complete the correlation analysis of the papers in the paper collection.
  • a knowledge base is a collection of knowledge stored, organized, managed and used in a computer for describing concepts and their interrelationships in the physical world. Knowledge is expressed in the form of "entity-relationship-entity” or “entity-attribute-attribute value”, and the knowledge base is such a triplet
  • the knowledge base is a complex network knowledge structure, which can more appropriately describe, store, and manage intricate knowledge systems to meet subsequent analysis needs.
  • Graph Convolution Network is a scalable method for semi-supervised learning of graph data based on convolutional neural network variables. It is a deep learning graph embedding method without random walks. Graph data is different from traditional sequences and images, and is an infinite-dimensional non-Euclidean space data. The graph has unordered nodes of variable size, each with a varying number of neighbors. The complexity of graph data makes the existing deep learning methods face great challenges in processing.
  • the graph convolutional neural network extends the convolution operation from traditional data to graph data. It is essentially the same as the convolutional neural network and is a feature extractor. It is the basis of many complex graph neural network models. We can use the features extracted by GCN to perform downstream tasks such as node classification, graph classification, and link prediction on graph data.
  • One of the existing technologies related to the technical solution of the present invention is an academic big data analysis method based on the citation relationship between papers [Tan Zhaowei, Liu Changfeng, Zhou Jinguang, etc. Relational academic big data analysis method:, CN105808729B[P].2019.].
  • the invention provides an academic big data analysis method based on the citation relationship between papers.
  • the implementation process includes the following three steps: (1) After performing correlation analysis and processing on the local paper data set, construct a paper citation network in the database; (2) Construct an analysis algorithm according to the citation relationship in the paper citation network, and obtain the importance and mutual relationship of the nodes in the paper citation network through the analysis algorithm and obtain the importance of the paper relative to the central paper; The one-to-one citation relationship is converted into a mapping set of citing directions and a mapping set of cited directions.
  • the paper citation network the development path between the specified papers is obtained through the extraction algorithm, and the papers are calculated according to the importance of the papers obtained in (2). Calculate the importance of the path.
  • the citation relationship between papers does play a vital role in the correlation analysis of papers, but it is far from enough to only consider the citation relationship between papers.
  • the scale of the collection of papers is very large, and there is no direct or indirect citation relationship between many papers with very similar research fields or research tasks. Considering only reference relationships loses dependency information.
  • the technical scheme of the present invention not only takes into account the citation relationship, but also the author relationship between papers, the relationship of common technical terms, the attribute of the category of the paper, the attribute of the community of the paper, and the like. Ability to retain paper information from multiple dimensions and analyze relevance.
  • FIG. 2 Another existing technology related to the technical solution of the present invention, as shown in Figure 2, is a method for constructing a paper classification model based on a gated graph attention network [Wang Meihong, Qiu Linling, Li Han, etc. Method and system for constructing paper classification model based on gated graph attention network:.].
  • the article classification model proposed by this inventive technology includes several layers connected in sequence. Among them, each layer includes a graph neural network structure and a classifier. The graph neural network structure in the first layer is directly connected to the classifier, and the graph neural network structure and the classifier in the t-th layer are connected through a gate structure. is an integer greater than 1; determine the feature matrix of each paper sample i in the sample data set, input the feature matrix of each paper sample into the classification model, and use the type of each paper sample as a label to train the classification model.
  • the paper classification model proposed by the technical solution of the invention introduces a gating mechanism based on the graph attention network to aggregate information of distant nodes, which can improve the classification accuracy to a certain extent.
  • a gating mechanism based on the graph attention network to aggregate information of distant nodes, which can improve the classification accuracy to a certain extent.
  • the purpose of the present invention is to provide a new paper correlation analysis method based on graph convolutional neural network and knowledge base.
  • a paper correlation analysis method based on a graph convolutional neural network and a knowledge base comprising the following steps:
  • Step 1) extract key information from the collection of papers, and construct a collection of papers knowledge base
  • Step 2) classify the papers, divide the documents of the papers according to the content and the direction involved, and propose an improved Inception-GCN model combined with the graph convolutional neural network to complete the classification of the papers on the constructed knowledge base of the papers.
  • Step 2.1 using external knowledge to mark part of the categories of the collection of papers;
  • Step 2.2 combined with the graph convolutional neural network, an improved Inception-GCN model semi-supervised classification algorithm is proposed to classify unlabeled papers;
  • Step 2.3 complete the classification of paper categories on the constructed paper collection knowledge base, and add the obtained category attributes to the paper entity of the paper collection knowledge base;
  • Step 3 use the NOCO model based on the graph convolutional neural network to complete the community discovery of the paper collection, and add the obtained community attributes to the paper entity of the paper collection knowledge base.
  • the collection of papers knowledge base includes three nonlinear relationships: citation relationship between papers, authorship relationship between papers and authors, and inclusion relationship between papers and technical terms.
  • the professional terms are obtained through partial manual annotation combined with named entity recognition methods.
  • the named entity recognition method is one of the SpaCy named entity recognition method, the NLTK named entity recognition method or the Stanford NER named entity recognition method.
  • the improved Inception-GCN model semi-supervised classification algorithm absorbs and executes multiple convolutions with different perceptual domains, and stitches depth slices of different filters into the same layer, thereby merging the results.
  • the specific steps are as follows: :
  • the activation function of the first layer is ReLU
  • the activation function of the second layer is softmax
  • X is the characteristic matrix of the nodes on the initial graph
  • A is the adjacency matrix
  • W(l) is the unique weight matrix of each layer, namely The matrix to be trained, (l) indicates the matrix of which layer;
  • said step 3) also includes: converting the paper partition task in the paper collection into a community discovery task on the graph knowledge base, the specific steps are:
  • Step 3.1 using the Bernoulli-Poisson model to model the graph structure, using the community attribution vector of each node as a parameter to generate a probability distribution as the value on the node adjacency matrix;
  • Step 3.2 using the graph convolutional neural network model to model the community membership vector representing the node and the adjacency matrix and attribute vector of the node on the graph to generate a community membership matrix;
  • Step 3.3 according to the community membership matrix, output the membership vector for each node, and add the community attribute to the paper entity.
  • the generation method of the parameter generation probability distribution is:
  • the weighting process After balancing the parameter weights, that is, according to whether the nodes on the graph are related or not, the weighting process is performed, and the loss function used is:
  • F (l) is a row vector representing the community affiliation of node l, that is, the lth row of the matrix F of .
  • the described graph convolutional neural network model adopts a two-layer graph convolutional neural network, and the formula is:
  • Each layer uses ReLU as the activation function to reduce the amount of calculations, X represents the input, Indicates that the graph has a critical matrix with self-edges, W(l) is a unique weight matrix for each layer, which is the matrix to be trained, and (l) indicates which layer of the matrix it is;
  • the existing technology generally only adopts the citation relationship or the author relationship between papers. Many papers with the same research fields and similar research problems do not directly or indirectly exist between the above two relationships, so many papers are generally lost in the existing technology. inter-correlation information.
  • the technical solution of the present invention constructs the paper knowledge base, it not only adopts the citation and author relationship between the above-mentioned papers, but also adds the "professional term” entity, and integrates the non-linear relationship between the "paper” entity and the "professional term” entity. Relationships are added to the knowledge base.
  • the technical terms are obtained through partial manual annotation combined with the named entity recognition method, which can extract the key content related to the field and technology in the paper to a great extent. This is not available in the citation relationship and author relationship.
  • the present invention proposes a new graph node classification model: Inception-GCN model.
  • Inception-GCN model Combining the Inception method originally used for the CNN model with the GCN model can enable the new model to effectively solve the overfitting and smoothing problems while enhancing the feature learning ability.
  • this model can achieve better results than existing techniques when used in paper node classification.
  • FIG. 1 is a schematic diagram of the overall framework of the prior art solution 1;
  • FIG. 2 is a schematic diagram of the overall framework of the second prior art solution
  • Figure 3 is a schematic diagram of the Inception network structure.
  • a new paper correlation analysis method based on graph convolutional neural network and knowledge base of the present invention comprises the following steps:
  • Step 1) extract key information from the collection of papers, and construct a collection of papers knowledge base
  • the present invention selects the following three nonlinear relationships to construct the knowledge base of the collection of papers: the citation relationship between the papers, the writing relationship between the papers and the authors, and the inclusion relationship between the papers and professional terms.
  • the technical terms are not included in the paper data set, and are obtained through partial manual annotation combined with named entity recognition methods.
  • this embodiment preferentially chooses SpaCy as the named entity recognition method for professional terms.
  • Table 1 shows the entity and entity attributes of the finally constructed thesis collection knowledge base, and Table 2 shows the relationship between entities.
  • Step 2 classify the collection of papers, and divide the documents in the collection of papers according to the content and the direction involved. This division has predetermined categories, and there is no intersection between categories.
  • the core idea is to transform the category classification problem of the paper collection into the node classification problem in the knowledge base.
  • Step 2.1 using external knowledge to mark part of the categories of the collection of papers;
  • the SciKG large-scale knowledge map from Aminer contains many inter-concept relationships in the computer field, from which the literature that intersects with the collection of papers is searched for category labeling.
  • Step 2.2 combined with the graph convolutional neural network, an improved Inception-GCN model semi-supervised classification algorithm is proposed to classify unlabeled papers;
  • the present invention introduces the Inception network structure into GCN to solve the above problems.
  • the Inception network structure module is shown in Figure 3. It performs multiple convolutions with different perceptual domains in parallel, and stitches the depth slices of different filters into the same layer to combine the results.
  • the activation function of the first layer is ReLU
  • the activation function of the second layer is softmax
  • X is the characteristic matrix of the nodes on the initial graph
  • A is the adjacency matrix
  • W(l) is the unique weight matrix of each layer. That is, the matrix to be trained, (l) indicates which layer of the matrix this is.
  • Step 2.3 complete the classification of paper categories on the constructed paper collection knowledge base, and add the obtained category attributes to the paper entity of the knowledge base.
  • Step 3 use the NOCO model based on the graph convolutional neural network to complete the community discovery of the paper collection, and add the obtained community attributes to the paper entity of the knowledge base.
  • the present invention chooses the NOCO model proposed by Shchur et al [Shchur, Oleksandr, Günnemann, Stephan. Overlapping Community Detection with Graph Neural Networks [C]. The First International Workshop on Deep Learning on Graphs: Methods and Applications (DLG'19) 2019.] , complete the paper community discovery task in the knowledge base created in this paper. The model has proved that it can restore the original community well in the unsupervised situation on some datasets with correct community annotations.
  • the NOCO model consists of two parts: a Bernoulli-Poisson model and a graph convolutional neural network model.
  • the Bernoulli-Poisson model is used to model the graph structure.
  • the value on the node adjacency matrix is set as the result of a probability distribution, and the community belonging vector of each node is used as the parameter of the probability distribution.
  • the graph convolutional neural network model models the vector representing node community membership and the adjacency matrix and attribute vector of nodes on the graph to obtain the node community membership vector.
  • Step 3.1 use the Bernoulli-Poisson model to model the graph structure, and use the community belonging vector of each node as a parameter to generate a probability distribution as the value on the node adjacency matrix.
  • the loss function used is obtained.
  • F (l) is a row vector representing the community affiliation of node l, that is, the lth row of the matrix F of .
  • step 3.2 the graph convolutional neural network model is used to model the vector representing node community membership and the adjacency matrix and attribute vector of nodes on the graph to generate a community membership matrix.
  • the model uses a two-layer graph convolutional neural network, the formula is:
  • Each layer uses ReLU as the activation function to reduce the amount of computation.
  • X represents the input, Indicates that the graph has a critical matrix with self-edges, W(l) is the unique weight matrix of each layer, which is the matrix to be trained, and (l) indicates which layer of the matrix it is.
  • Step 3.3 the model outputs affiliation vectors for each node, adding the community attribute to the paper entity.
  • each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments.
  • the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiments.
  • the device and system embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, It can be located in one place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without creative effort.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种新的论文相关性分析方法:在论文集中提取关键信息,构建论文集知识库,结合图卷积神经网络,提出改进的Inception-GCN模型完成论文类别划分,使用NOCO模型完成论文社区发现,进而完成论文集知识库中论文的相关性分析。以及一个新的图节点分类模型:Inception-GCN模型。将原本用于CNN模型的Inception方法与GCN模型结合,能够使新模型在增强特征学***滑问题。通过实验表明,将该模型用于论文节点分类,可以达到比现有技术更好的效果。

Description

一种基于图卷积神经网络和知识库的论文相关性分析方法 技术领域
本申请涉及计算机技术处理领域,尤其涉及一种基于图卷积神经网络和知识库的论文相关性分析方法。
背景技术
21世纪,学术研究成果的不断涌现体现了时代的进步和科技的发展,但同时对大量成果的存储、分析、管理工作也是十分费力的。近年来,各领域内论文数量剧增,创新点多样;而人们针对特定领域、任务相关的论文查阅、统计需求却愈发强烈,这给论文分析技术带来了不小的挑战。
在对论文进行相关性分析时,最重要的两个子任务是论文的类别划分和论文社区发现。论文类别划分是将文献按照研究领域和研究任务进行划分,给论文集知识库中的论文实体添加类别标签。这种划分有着预先确定好的类别,类别之间不存在交集。而论文社区发现是指将文献划分到一个个“社区”中,给论文集知识库的论文实体添加社区标签。所谓“社区”就是带有标签的论文集合,集合内部联系较为紧密,同集合外部联系较为稀疏。不同于论文类别划分,社区发现没有预先确定好的标签,“社区”之间可能有重叠。
本发明以上述两个子任务为落脚点,提出了一种新的论文相关性分析方法:在论文集中提取关键信息,构建知识库,结合图卷积神经网络,提出改进的Inception-GCN模型完成论文类别划分,使用NOCO模型完成论文社区发现,进而完成论文集中论文的相关性分析。
相关关键技术
知识库:
知识库是用于描述物理世界中概念及其相互关系的,在计算机中存储、组织、管理和使用的知识集合。知识以“实体-关系-实体”或者“实体-属性-属性值”的形式表达,知识库便是这样三元组的
集合。由于实体间通过关系相互连接,所以知识库是一个复杂的网状知识结构,能够更加贴切的描述、存储、管理错综复杂的知识体系,满足后续的分析需求。
图卷积神经网络:
图卷积神经网络(Graph Convolution Network)是一种基于卷积神经网络变量的图数据半监督学习的可扩展方法,是一种无随机游走的深度学习图嵌入方法。图数据不同于传统的序列、图像,是无限维的非欧氏空间数据。图上有大小可变的无序节点,每个节点都有不同数量的相邻节点。图数据的复杂性使得现有的深度学习方法在处理时面临着巨大的挑战。图卷积神经网络将卷积运算从传统数据推广到图数据上,本质上同卷积神经网络一样,是一个特征提取器。它是很多复杂图神经网络模型的基础,我们可以利用GCN提取出的特征对图数据进行节点分类(node classification)、图分类(graph classification)、边预测(link prediction)等下游工作。
与本发明技术方案相关的现有技术一
与本发明技术方案相关的现有技术其中之一,如附图1所示出的,是基于论文间引用关系的学术大数据分析方法[谈兆炜,刘长风,周劲光,等.基于论文间引用关系的学术大数据分析方法:,CN105808729B[P].2019.]。该发明提供了一种基于论文间引用关系的学术大数据分析方法,实现过程包括以下三个步骤:(1)对本地的论文数据集进行相关性分析和处理后在数据库中构建论文引用网络;(2)根据论文引用网络中的引用关系构建分析算法,通过该分析算法获得所述论文引用网络中节点的重要性及相互关系并获得论文相对于中心论文的重要度;(3)将论文一对一的引用关系转化为引用方向的映射集和被引用方向的映射集,在所述论文引用网络中通过提取算法获得指定论文间的发展路径,并按照(2)中获得的论文重要度来计算路径的重要度。
现有技术一的缺点
论文之间的引用关系的确对论文的相关性分析起着至关重要的作用,但是仅考虑到论文之间的引用关系是远远不够的。论文集的规模是非常庞大的,有很多研究领域或研究任务非常相似的论文之间并不存在直接或者间接的引用关系。仅考虑引用关系会丢失相关性信息。本发明技术方案不仅考虑到了引用关系,还考虑到了论文间的作者关系、共同存在的专业术语关系、论文类别属性、论文社区属性等。能够从多个维度保留论文信息,分析相关性。
与本发明相关的现有技术二
现有技术二的技术方案
另一个与本发明技术方案相关的现有技术,如附图2所示出的,是一种基于门控图注意力网络的论文分类模型构建方法[王美红,邱淋灵,李涵,等.基于门控图注意力网络的论文分类模型构建方法及***:.]。该发明技术提出的论文分类模型包括依次连接的若 干层。其中,各层均包括一图神经网络结构以及一分类器,第一层中图神经网络结构和分类器直接连接,第t层中图神经网络结构和分类器之间通过门控结构连接,t为大于1的整数;确定样本数据集中各论文样本i的特征矩阵将各论文样本的特征矩阵输入分类模型,并以各论文样本的类型为标签对分类模型进行训练。
现有技术二的缺点
该发明技术方案提出的论文分类模型在图注意力网络的基础上引入了门控机制,聚合远距离节点信息,能够在一定程度上提升分类的准确度。但由于模型参与训练的参数量非常庞大,对数据集有很高的要求,训练困难且易出现过拟合问题。
发明内容
本发明的发明目的是提供一种新的基于图卷积神经网络和知识库的论文相关性分析方法。首先,在论文集中提取关键性息,构建论文知识库;然后将论文集的类别划分问题转化成知识库中的节点分类问题,结合图卷积神经网络提出改进的Inception-GCN模型在构建好的知识库上完成论文类别划分工作,将得到的类别属性添加到知识库的论文实体中;最终使用基于图卷积神经网络的NOCO模型完成论文集的社区发现,将得到的社区属性添加到知识库的论文实体中。
为实现本发明的发明目的,本发明提供的技术方案是:
一种基于图卷积神经网络和知识库的论文相关性分析方法,包括以下步骤:
步骤一),在论文集中提取关键性息,构建论文集知识库;
步骤二),论文类别划分,将论文集的文献按照内容和涉及的方向进行划分,结合图卷积神经网络提出改进的Inception-GCN模型在构建好的论文集知识库上完成论文类别划分工作,将得到的类别属性添加到论文集知识库的论文实体中,具体包括;
步骤2.1),利用外部知识对论文集进行部分类别标注;
步骤2.2),结合图卷积神经网络提出改进的Inception-GCN模型半监督分类算法对未标注论文进行分类;
步骤2.3),在构建好的论文集知识库上完成论文类别划分工作,将得到的类别属性添加到论文集知识库的论文实体中;
步骤三),使用基于图卷积神经网络的NOCO模型完成论文集的社区发现,将得到的社区属性添加到论文集知识库的论文实体中。
优选地,所述的步骤一)中,所述的论文集知识库中包含论文间的引用关系、论文与作者的著作关系、论文与专业术语的包含关系三种非线性关系。
优选地,所述的专业术语为通过部分人工标注结合命名实体识别方法得到的。
优选地,所述的命名实体识别方法为SpaCy命名实体识别方法、NLTK命名实体识别方法或Stanford NER命名实体识别方法中的一种。
优选地,所述的改进的Inception-GCN模型半监督分类算法吸纳执行多个具有不同感知域的卷积,并将不同滤波器的深度切片拼接到同一层中,从而将结果合并,具体步骤为:
记感知域为R的简单串联而成的图卷积网络为h R(X,A)
Figure PCTCN2022131993-appb-000001
其中,第一层的激活函数为ReLU,第二层的激活函数为softmax,X为初始的图上节点的特征矩阵,A为邻接矩阵,W(l)是每层独有的权重矩阵,即要训练的矩阵,(l)表示这是第几层的矩阵;
合并后Inception-GCN:
Figure PCTCN2022131993-appb-000002
其中∪ R=1,2,3h R(...,A)表示接收相同输入的R个并行额分支,各分支输出的拼接作为总体输出。
优选地,所述的步骤三)中还包括:将论文集中的论文分区任务转化成图知识库上的社区发现任务,其具体步骤为:
步骤3.1),利用伯努利-泊松模型对图结构进行建模,利用各节点的社区归属向量作为参数生成概率分布,作为节点邻接矩阵上的值;
步骤3.2),利用图卷积神经网络模型对表示节点的社区归属向量和图上节点的邻接矩阵以及属性向量进行建模,生成社区从属矩阵;
步骤3.3),根据社区从属矩阵,为每个节点输出从属关系向量,将社区属性添加到论文实体。
优选地,所述的步骤3.1)中,所述的参数生成概率分布的生成方式为,
当给出从属关系
Figure PCTCN2022131993-appb-000003
时,邻接矩阵各项A uv为按这个式子的独立同分布采样:Auv~Bernoulli(1-exp(-F uF v T)),这个分布上参数F的对数似然函数为:
Figure PCTCN2022131993-appb-000004
经过平衡参数权重,即按照图上节点间是否相关,进行加权处理,得所用的损失函数为;
Figure PCTCN2022131993-appb-000005
其中,F (l)为表示了节点l的社区从属关系的行向量,即的矩阵F的第l行。
优选地,所述的步骤3.2)中,
所述的图卷积神经网络模型采用两层图卷积神经网络,公式为:
Figure PCTCN2022131993-appb-000006
其中每一层都是用ReLU作为激活函数,减少运算量,X表述输入,
Figure PCTCN2022131993-appb-000007
表示图带有自边的临界矩阵,W(l)是每层独有的权重矩阵,这也就是要训练的矩阵,(l)表示这是第几层的矩阵;
通过寻找合适的神经网络参数θ,得到最终的从属矩阵F:
Figure PCTCN2022131993-appb-000008
本发明的有益效果是:
本发明有效解决现有技术存在的如下技术问题:
(1)论文之间的非线性关系提取单一。
现有技术普遍仅的采用了论文之间的引用关系或作者关系,很多研究领域相同、研究问题相似的 论文之间并不直接或间接存在上述两种关系,因此现有技术普遍丢失了很多论文间相关性信息。本发明技术方案在构建论文知识库的时候,不仅采用了上述论文之间的引用和作者关系,还增加“专业术语”实体,并将“论文”实体和“专业术语”实体之间的非线性关系添加到知识库中。专业术语是通过部分人工标注结合命名实体识别方法得到的,能够极大程度上提取论文中的领域、技术相关的关键内容。这都是在引用关系和作者关系中无法得到的。
(2)现有的技术普遍仅通过实现论文分类任务对论文集进行相关性分析。论文分类按照预先设定好的类别进行划分,能够给论文添加的类别属性信息有限。本发明在实现论文类别划分的同时,还实现了论文社区发现任务。论文社区发现可以无监督对论文集进行分析,得到的社区属性相比于类别属性涉及范围更广,内容更丰富。两个任务相辅相成,能够得到论文集更全面的相关性分析结果。
(3)本发明提出一个新的图节点分类模型:Inception-GCN模型。将原本用于CNN模型的Inception方法与GCN模型结合,能够使新模型在增强特征学***滑问题。通过实验表明,将该模型用于论文节点分类,可以达到比现有技术更好的效果。
附图说明
图1为现有技术方案一的整体框架示意图;
图2为现有技术方案二的整体框架示意图;
图3为Inception网络结构示意图。
具体实施方式
下面详细描述本发明的实施方式,所述实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本发明,而不能解释为对本发明的限制。
本发明一种新的基于图卷积神经网络和知识库的论文相关性分析方法,包括以下步骤:
步骤一),在论文集中提取关键性息,构建论文集知识库;
通过对论文之间的关系进行分析,本发明选择以下三种非线性关系构建论文集知 识库:论文间的引用关系、论文与作者的著作关系、论文与专业术语的包含关系。其中专业术语在论文数据集中并不包含,是通过部分人工标注结合命名实体识别方法得到的。在对比了SpaCy、NLTK、Stanford NER等常用命名实体识别方法后,本实施例优先选择SpaCy作为专业术语命名实体识别方法。
最后构建得到的论文集知识库实体与实体属性如表1所示,实体之间的关系如表2所示。
Figure PCTCN2022131993-appb-000009
表1 实体与属性
论文 作者 专有术语
引用/被引关系 著作关系 包含关系
表2 实体间关系
步骤二),论文集类别划分,将论文集的文献按照内容和涉及的方向进行划分。这种划分有预先确定的类别,并且类别间不存在交集。核心思想是将论文集的类别划分问题转化成知识库中的节点分类问题,细致步骤如下:
步骤2.1),利用外部知识对论文集进行部分类别标注;
来自Aminer的SciKG大型知识图谱包含了很多计算机领域的内容概念际关系,从中搜索与论文集交叉的文献进行类别标注。
步骤2.2),结合图卷积神经网络提出改进的Inception-GCN模型半监督分类算法对未标注论文分类;
传统的图卷积神经网络(GCN)公式:
Figure PCTCN2022131993-appb-000010
存在的问题:如果要增强GCN的学习能力,可以采用层数加深和每层的特征增加两种方法,这两种方法都可以加大GCN的感知域。但与此同时,会加大训练困难和过拟合风险。
本发明将Inception网络结构引入GCN中以解决上述问题。Inception网络结构模块如图3所示,并行执行多个具有不同感知域的卷积,并将不同滤波器的深度切片拼接到同 一层中,从而将结果合并。
记感知域为R的简单串联而成的图卷积网络为h R(X,A)
Figure PCTCN2022131993-appb-000011
其中,第一层的激活函数为ReLU,第二层的激活函数为softmax,X为初始的图上节点的特征矩阵,A为邻接矩阵,W(l)是每层独有的权重矩阵,这也就是要训练的矩阵,(l)表示这是第几层的矩阵。
合并后Inception-GCN:
Figure PCTCN2022131993-appb-000012
其中∪ R=1,2,3h R(...,A)表示接收相同输入的R个并行额分支,各分支输出的拼接作为总体输出。
步骤2.3),在构建好的论文集知识库上完成论文类别划分工作,将得到的类别属性添加到知识库的论文实体中。
步骤三),使用基于图卷积神经网络的NOCO模型完成论文集的社区发现,将得到的社区属性添加到知识库的论文实体中。
本发明选择Shchur等提出的NOCO模型[Shchur,Oleksandr,Günnemann,Stephan.Overlapping Community Detection with Graph Neural Networks[C].The First International Workshop on Deep Learning on Graphs:Methods and Applications(DLG’19)2019.],在本文创建的知识库中完成论文社区发现任务。该模型已经在一些有着正确社区标注的数据集上证明了其在无监督情形下能够很好的恢复原有的社区。
NOCO模型由两部分结构组成:伯努利-泊松模型和图卷积神经网络模型。伯努利-泊松模型用于对图结构进行建模,将节点邻接矩阵上的值设定为一个概率分布的结果,以各节点的社区归属向量作为这个概率分布的参数。图卷积神经网络模型对表示节点社区从属的向量和图上节点的邻接矩阵以及属性向量进行建模,得到节点社区从属向量。
NOCO模型完成社区发现并将社区属性添加到论文实体的具体步骤如下:
步骤3.1),利用伯努利-泊松模型对图结构进行建模,利用各节点的社区归属向量作为参数生成概率分布,作为节点邻接矩阵上的值。
生成方式为,当给出从属关系
Figure PCTCN2022131993-appb-000013
时,邻接矩阵各项A uv为按这个式子的独立同分布采样:Auv~Bernoulli(1-exp(-F uF v T)),这个分布上参数F的对数似然函数为
Figure PCTCN2022131993-appb-000014
经过平衡参数权重,即分别按图上空边的数目和图上边的数目进行加权处理,得所用的损失函数。
Figure PCTCN2022131993-appb-000015
其中,F (l)为表示了节点l的社区从属关系的行向量,即的矩阵F的第l行。
步骤3.2),利用图卷积神经网络模型对表示节点社区从属的向量和图上节点的邻接矩阵以及属性向量进行建模,生成社区从属矩阵。
模型采用两层图卷积神经网络,公式为:
Figure PCTCN2022131993-appb-000016
其中每一层都是用ReLU作为激活函数,减少运算量。X表述输入,
Figure PCTCN2022131993-appb-000017
表示图带有自边的临界矩阵,W(l)是每层独有的权重矩阵,这也就是要训练的矩阵,(l)表示这是第几层的矩阵。
通过寻找合适的神经网络参数θ,得到最终的从属矩阵F:
Figure PCTCN2022131993-appb-000018
步骤3.3),模型为每个节点输出从属关系向量,将社区属性添加到论文实体。
通过以上实施例的说明,本发明主要技术关键贡献在于:
(1)新的论文集知识库构建方法:考虑了多种非线性关系,包含了“专业术语”等丰富的实体,添加了论文类别和社区等属性。
(2)对图卷积神经网络进行改进,提出了Inception-GCN模型用于论文类别划分任务。
(3)将论文相关性分析,落实在论文类别划分和论文社区发现两个子任务上,优化 分析效果。
本领域普通技术人员可以理解:附图只是一个实施例的示意图,附图中的模块或流程并不一定是实施本发明所必须的。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置或***实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置及***实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。

Claims (8)

  1. 一种基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,包括以下步骤:
    步骤一),在论文集中提取关键性息,构建论文集知识库;
    步骤二),论文类别划分,将论文集的文献按照内容和涉及的方向进行划分,结合图卷积神经网络提出改进的Inception-GCN模型在构建好的论文集知识库上完成论文类别划分工作,将得到的类别属性添加到论文集知识库的论文实体中,具体包括;
    步骤2.1),利用外部知识对论文集进行部分类别标注;
    步骤2.2),结合图卷积神经网络提出改进的Inception-GCN模型半监督分类算法对未标注论文进行分类;
    步骤2.3),在构建好的论文集知识库上完成论文类别划分工作,将得到的类别属性添加到论文集知识库的论文实体中;
    步骤三),使用基于图卷积神经网络的NOCO模型完成论文集的社区发现,将得到的社区属性添加到论文集知识库的论文实体中。
  2. 根据权利要求1所述的基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,所述的步骤一)中,所述的论文集知识库中包含论文间的引用关系、论文与作者的著作关系、论文与专业术语的包含关系三种非线性关系。
  3. 根据权利要求2所述的基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,所述的专业术语为通过部分人工标注结合命名实体识别方法得到的。
  4. 根据权利要求3所述的一种基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,所述的命名实体识别方法为SpaCy命名实体识别方法、NLTK命名实体识别方法或Stanford NER命名实体识别方法中的一种。
  5. 根据权利要求1所述的基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,所述的步骤2.2)中,所述的改进的Inception-GCN模型半监督分类算法吸纳执行多个具有不同感知域的卷积,并将不同滤波器的深度切片拼接到同一层中,从而将结果合并,具体步骤为:
    记感知域为R的简单串联而成的图卷积网络为h R(X,A)
    Figure PCTCN2022131993-appb-100001
    其中,第一层的激活函数为ReLU,第二层的激活函数为softmax,X为初始的图上节点的特征矩阵,A为邻接矩阵,W(l)是每层独有的权重矩阵,即要训练的矩阵,(l)表示这是第几层的矩阵;
    合并后Inception-GCN:
    Figure PCTCN2022131993-appb-100002
    其中U R=1,2,3h R(...,A)表示接收相同输入的R个并行额分支,各分支输出的拼接作为总体输出。
  6. 根据权利要求1所述的基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,所述的步骤三)中还包括:将论文集中的论文分区任务转化成图知识库上的社区发现任务,其具体步骤为:
    步骤3.1),利用伯努利-泊松模型对图结构进行建模,利用各节点的社区归属向量作为参数生成概率分布,作为节点邻接矩阵上的值;
    步骤3.2),利用图卷积神经网络模型对表示节点的社区归属向量和图上节点的邻接矩阵以及属性向量进行建模,生成社区从属矩阵;
    步骤3.3),根据社区从属矩阵,为每个节点输出从属关系向量,将社区属性添加到论文实体。
  7. 根据权利要求6所述的基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,所述的步骤3.1)中,所述的参数生成概率分布的生成方式为,
    当给出从属关系
    Figure PCTCN2022131993-appb-100003
    时,邻接矩阵各项A uv为按这个式子的独立同分布采样:Auv~Bernoulli(1-exp(-F uF v T)),这个分布上参数F的对数似然函数为:
    Figure PCTCN2022131993-appb-100004
    经过平衡参数权重,即按照图上节点间是否相关,进行加权处理,得所用的损失函数为;
    Figure PCTCN2022131993-appb-100005
    其中,F (l)为表示了节点l的社区从属关系的行向量,即的矩阵F的第l行。
  8. 根据权利要求6所述的基于图卷积神经网络和知识库的论文相关性分析方法,其特征在于,所述的步骤3.2)中,
    所述的图卷积神经网络模型采用两层图卷积神经网络,公式为:
    Figure PCTCN2022131993-appb-100006
    其中每一层都是用ReLU作为激活函数,减少运算量,X表述输入,
    Figure PCTCN2022131993-appb-100007
    表示图带有自边的临界矩阵,W(l)是每层独有的权重矩阵,这也就是要训练的矩阵,(l)表示这是第几层的矩阵;
    通过寻找合适的神经网络参数θ,得到最终的从属矩阵F:
    Figure PCTCN2022131993-appb-100008
PCT/CN2022/131993 2022-02-18 2022-11-15 一种基于图卷积神经网络和知识库的论文相关性分析方法 WO2023155508A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210150878.8 2022-02-18
CN202210150878.8A CN114741519A (zh) 2022-02-18 2022-02-18 一种基于图卷积神经网络和知识库的论文相关性分析方法

Publications (1)

Publication Number Publication Date
WO2023155508A1 true WO2023155508A1 (zh) 2023-08-24

Family

ID=82275780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/131993 WO2023155508A1 (zh) 2022-02-18 2022-11-15 一种基于图卷积神经网络和知识库的论文相关性分析方法

Country Status (2)

Country Link
CN (1) CN114741519A (zh)
WO (1) WO2023155508A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992098A (zh) * 2023-09-26 2023-11-03 杭州康晟健康管理咨询有限公司 引文网络数据处理方法及***
CN117828513B (zh) * 2024-03-04 2024-06-04 北京邮电大学 一种论文主题无关引用检查方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114741519A (zh) * 2022-02-18 2022-07-12 北京邮电大学 一种基于图卷积神经网络和知识库的论文相关性分析方法
CN117807237B (zh) * 2024-02-28 2024-05-03 苏州元脑智能科技有限公司 基于多元数据融合的论文分类方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632296A (zh) * 2020-12-31 2021-04-09 上海交通大学 基于知识图谱具有可解释性的论文推荐方法及***、终端
CN112749757A (zh) * 2021-01-21 2021-05-04 厦门大学 基于门控图注意力网络的论文分类模型构建方法及***
CN112836050A (zh) * 2021-02-04 2021-05-25 山东大学 针对关系不确定性的引文网络节点分类方法及***
CN113312480A (zh) * 2021-05-19 2021-08-27 北京邮电大学 基于图卷积网络的科技论文层级多标签分类方法及设备
CN114741519A (zh) * 2022-02-18 2022-07-12 北京邮电大学 一种基于图卷积神经网络和知识库的论文相关性分析方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632296A (zh) * 2020-12-31 2021-04-09 上海交通大学 基于知识图谱具有可解释性的论文推荐方法及***、终端
CN112749757A (zh) * 2021-01-21 2021-05-04 厦门大学 基于门控图注意力网络的论文分类模型构建方法及***
CN112836050A (zh) * 2021-02-04 2021-05-25 山东大学 针对关系不确定性的引文网络节点分类方法及***
CN113312480A (zh) * 2021-05-19 2021-08-27 北京邮电大学 基于图卷积网络的科技论文层级多标签分类方法及设备
CN114741519A (zh) * 2022-02-18 2022-07-12 北京邮电大学 一种基于图卷积神经网络和知识库的论文相关性分析方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992098A (zh) * 2023-09-26 2023-11-03 杭州康晟健康管理咨询有限公司 引文网络数据处理方法及***
CN116992098B (zh) * 2023-09-26 2024-02-13 杭州康晟健康管理咨询有限公司 引文网络数据处理方法及***
CN117828513B (zh) * 2024-03-04 2024-06-04 北京邮电大学 一种论文主题无关引用检查方法及装置

Also Published As

Publication number Publication date
CN114741519A (zh) 2022-07-12

Similar Documents

Publication Publication Date Title
WO2023155508A1 (zh) 一种基于图卷积神经网络和知识库的论文相关性分析方法
CN110674407B (zh) 基于图卷积神经网络的混合推荐方法
CN110633366A (zh) 一种短文本分类方法、装置和存储介质
Zhang et al. Cross-domain recommendation with semantic correlation in tagging systems
CN114565053B (zh) 基于特征融合的深层异质图嵌入模型
CN113962293B (zh) 一种基于LightGBM分类与表示学习的姓名消歧方法和***
CN113468291B (zh) 基于专利网络表示学习的专利自动分类方法
CN114564573A (zh) 基于异构图神经网络的学术合作关系预测方法
Nauata et al. Structured label inference for visual understanding
Li et al. Multi-view clustering via adversarial view embedding and adaptive view fusion
Aziguli et al. A robust text classifier based on denoising deep neural network in the analysis of big data
Jurek-Loughrey et al. Semi-supervised and unsupervised approaches to record pairs classification in multi-source data linkage
Chuang et al. TPR: Text-aware preference ranking for recommender systems
Souravlas et al. Probabilistic community detection in social networks
Hu et al. EGC: A novel event-oriented graph clustering framework for social media text
Li et al. Semi-supervised variational user identity linkage via noise-aware self-learning
CN113159976B (zh) 一种微博网络重要用户的识别方法
CN114722304A (zh) 异质信息网络上基于主题的社区搜索方法
CN114637846A (zh) 视频数据处理方法、装置、计算机设备和存储介质
CN113779248A (zh) 数据分类模型训练方法、数据处理方法及存储介质
To et al. An adaptive machine learning framework with user interaction for ontology matching
Yu et al. Workflow recommendation based on graph embedding
Rastogi et al. Unsupervised Classification of Mixed Data Type of Attributes Using Genetic Algorithm (Numeric, Categorical, Ordinal, Binary, Ratio-Scaled)
Ostrowski Predictive semantic social media analysis
Simon et al. MFRRI: Research on multi-feature joint recommendation algorithm based on graph neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926812

Country of ref document: EP

Kind code of ref document: A1