CN116467463A

CN116467463A - Multi-mode knowledge graph representation learning system and product based on sub-graph learning

Info

Publication number: CN116467463A
Application number: CN202310416736.6A
Authority: CN
Inventors: 王平辉; 梁润颖
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-07-21

Abstract

The invention provides a multi-mode knowledge graph representation learning system and a product based on sub-graph learning, which relate to the technical field of representation learning and comprise the following components: the multi-mode sub-image construction subsystem extracts multi-mode structural information of a head entity in the target triplet to obtain a multi-mode sub-image, wherein the multi-mode sub-image comprises a visual scene image and a self-center image of the head entity; the neighborhood feature aggregation subsystem fuses a plurality of channel feature components of the entity features to obtain a multi-mode embedded representation of the entity; and the link prediction subsystem predicts the missing entity and relationship in the target triplet based on the multi-mode embedded representation of the entity. The invention uses a novel picture structure information extraction mechanism, adopts a high-efficiency picture alignment mechanism to introduce structured information in the picture, realizes effective aggregation of target entity neighborhood topology information, lays a foundation for high-efficiency fusion of subsequent multi-mode information, and can efficiently learn multi-mode knowledge graph embedded representation.

Description

Multi-mode knowledge graph representation learning system and product based on sub-graph learning

Technical Field

The invention relates to the technical field of representation learning, in particular to a multi-mode knowledge graph representation learning system and a multi-mode knowledge graph representation learning product based on sub-graph learning.

Background

Representing learning has become a valuable method of learning from relational data, but how to encode increasingly rich multimodal information has also become a significant challenge. Since most of the knowledge in real world applications can be presented in the form of a graph, the transformation of these raw knowledge acquired from the real world into low-dimensional vectors preserving the intrinsic properties of the graph by graph representation learning techniques enables us to find deeper relationships from these complex knowledge. Therefore, it is an important issue to accurately and efficiently represent the multimodal knowledge patterns existing in the real world.

The current knowledge graph representation learning technology is widely applied to the fields of information retrieval, recommendation systems, knowledge questions and answers and the like. However, knowledge-graph representation for multiple modalities still exceeds the capabilities of most existing knowledge-graph representation methods. In practical applications, we can often enhance the defect of single-modality knowledge through rich information in the picture. However, even if only the embedded representation of part of the picture information needs to be learned, the existing knowledge graph representation learning model still cannot effectively find more beneficial information from the picture information, and thus cannot effectively cope with the application scenario of the multi-mode knowledge graph.

Disclosure of Invention

The invention provides a multi-modal knowledge graph representation learning system and a multi-modal knowledge graph representation learning product based on sub-graph learning, which are used for solving the problem that the existing knowledge graph representation learning model still cannot effectively find beneficial information from picture information, so that the application scene of the multi-modal knowledge graph cannot be effectively applied.

Based on the first aspect, in an embodiment of the present invention, a multi-modal knowledge graph representation learning system based on sub-graph learning is provided, including:

the multi-mode sub-image construction subsystem extracts multi-mode structural information of a head entity in the target triplet to obtain a multi-mode sub-image, wherein the multi-mode sub-image comprises a visual scene image and a self-center image of the head entity;

the neighborhood feature aggregation subsystem fuses a plurality of channel feature components of the node features to obtain a multi-mode embedded representation of the entity;

and the link prediction subsystem predicts the missing entity and relationship in the target triplet based on the multi-mode embedded representation of the entity.

Based on the first aspect, the extracting the multi-modal structure information of the head entity in the target triplet to obtain a multi-modal subgraph includes:

inputting a head entity picture, extracting a visual object by using a fast R-CNN, and constructing a visual scene graph;

And (3) inputting a target triplet (h, r, t), searching M-hop neighbor nodes by taking the head entity h as the center, and constructing a self-center diagram of the head entity.

Based on the first aspect, the neighborhood feature aggregation subsystem includes:

the feature decoupling module is used for mapping the multi-mode subgraphs to different mode decoding channels based on the feature decoupling layer, obtaining initial embedded space vector representation of the head entity, projecting the visual coding information to the representation space of the structural information, and carrying out linear fusion to obtain preliminary multi-mode entity feature embedding;

the neighborhood feature learning module aligns the multi-modal subgraphs by adopting a graph alignment mechanism to obtain a graph pair Ji Quan matrix guided by visual information;

and the multi-mode information fusion module fuses the multi-mode entity characteristic embedding based on the graph pair Ji Quanchong matrix to obtain the multi-mode embedded representation of the entity.

Based on the first aspect, the initial embedded spatial vector representation includes a visual information feature representation and a structured information feature representation;

the feature-based decoupling layer maps the multi-modal subgraphs to different modal decoding channels to obtain an initial embedded spatial vector representation of the header entity, comprising:

coding visual information in the visual scene graph by adopting a first coder to obtain the visual information characteristic representation;

And adopting a second encoder to encode the structured information in the self-centroids of the header entities to obtain the characteristic representation of the structured information.

Based on the first aspect, the aligning the multi-modal subgraphs by using a graph alignment mechanism to obtain a graph pair Ji Quan re-matrix guided by visual information includes:

obtaining node embedded characterizations of any two graphs in the multi-modal subgraph, and calculating similarity values between every two nodes to obtain a similarity matrix;

soft alignment of a greedy algorithm is adopted to calculate alignment scores of the visual scene graph and the self-centering graph in the multi-modal subgraph;

based on the alignment scores, a graph pair Ji Quan matrix is constructed.

Based on the first aspect, predicting missing entities and relationships in a target triplet based on the multimodal embedded representation of the entities, comprising:

constructing a scoring function of the link prediction task;

adding a network parameter regular term and a multi-mode embedded regular term into the score function to construct an integral score function;

and predicting missing entities and relations in the target triples based on the integral loss function, and optimizing a model.

Based on the first aspect, the overall score function is defined as follows:

wherein l= (1-lambda) ₂ )φ _s +λ ₂ φ _m The weighted summation representing the multi-modal embedded score, and subscripts m and s represent the multi-modal and single atlas structured information modality respectively;a score calculated by the representation triplet based on the multimodal embedded representation; beta is a super parameter for realizing gain adjustment for diversity and relevance balance;negative sampling of CAnd the sample triplet set is used for randomly replacing a head entity or a tail entity in an input target triplet to be other entities, wherein a random replacement relation is other relations, (h, r, t) is the target triplet, and (h ', r ', t ') is the prediction triplet.

Based on a second aspect, an embodiment of the present invention provides a method for learning a multi-modal knowledge graph representation based on sub-graph learning, where the method is used in the multi-modal knowledge graph representation learning system based on sub-graph learning according to any one of the first aspect, and the method includes:

selecting one non-traversed target triplet in a target triplet head entity picture and a multi-modal knowledge spectrum triplet in the head entity picture set, inputting the non-traversed target triplet into a multi-modal sub-picture construction subsystem, and outputting a visual scene graph and a head entity self-center graph;

inputting the visual scene graph and the head entity self-centering graph into a neighborhood feature aggregation subsystem, and outputting a multi-mode embedded representation of an entity;

Embedding the multi-modal characteristics of the target triples into a prediction result representing an input link prediction subsystem and outputting an entity or a relation;

repeating the learning process until each target triplet in the multi-modal knowledge-graph triplet set is traversed.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

a memory for storing one or more programs;

a processor;

the multi-modal knowledge graph representation learning system based on sub-graph learning of any one of the above first aspects is implemented when the one or more programs are executed by the processor.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the multi-modal knowledge graph representation learning system based on sub-graph learning of any one of the above first aspects.

The invention at least comprises the following advantages:

(1) The invention uses a novel picture structure information extraction mechanism, adopts a high-efficiency picture alignment mechanism to introduce structured information in pictures, realizes effective aggregation of target entity neighborhood topology information, lays a foundation for high-efficiency fusion of subsequent multi-mode information, and can efficiently learn multi-mode knowledge graph embedded representation;

(2) According to the invention, the feature decoupling layer is adopted to map the multi-modal feature decomposition of the entity to a plurality of embedded spaces, the visual modal feature is projected into the image structure feature space, meanwhile, the feature aggregation of the entity in all self-centroids is realized through the aligned multi-modal perceived attention weights, meanwhile, the multi-modal feature of the target entity is correspondingly updated, the stability and the characterization capability of the embedded representation are improved, the complex association relationship implied by the feature layer is fully considered, and the characterization capability of the embedded representation is further improved;

(3) According to the method, firstly, the most important neighborhood topology is concentrated and condensed for each target entity node, and neighborhood information is transmitted and aggregated on the basis, so that the application flexibility of the method in a scene that only a small number of entities have multi-modal information and need to learn embedded representation is improved, and the method can be flexibly applied to a requirement scene of multi-modal knowledge graph embedded representation.

Drawings

FIG. 1 is a diagram showing a multi-modal knowledge graph representation learning system based on sub-graph learning in an embodiment of the invention;

FIG. 2 is a schematic flow chart of a multi-modal sub-graph construction subsystem according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-modal knowledge graph representation learning system based on sub-graph learning in an embodiment of the invention;

FIG. 4 is a block diagram of a neighborhood feature aggregation sub-system according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a neighborhood feature aggregation subsystem according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of a link prediction subsystem according to an embodiment of the present invention;

FIG. 7 is a flowchart of a multi-modal knowledge graph representation learning method based on sub-graph learning in an embodiment of the invention;

fig. 8 is a schematic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In practical application, the defect of single-mode knowledge can be enhanced through rich information in the picture. However, even if only the embedded representation of part of the picture information needs to be learned, the existing knowledge graph representation learning model still cannot effectively find more beneficial information from the picture information, and thus cannot effectively cope with the application scenario of the multi-mode knowledge graph. In order to overcome the defects, the invention aims to provide a multi-mode knowledge graph representation learning system based on sub-graph learning. On one hand, the system adopts a graph alignment mechanism to realize efficient alignment of the visual scene graph and the entity self-centering graph and introduce structural information in the graph into representation learning, thereby effectively solving the problems of difficult fusion, difficult learning and the like of visual mode information in the multi-mode knowledge graph; on the other hand, aiming at the application scene of the multi-mode knowledge graph, the method does not need all the entities to have pictures, so that the method can be greatly adapted to different knowledge graphs in the real world at present.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

In a first aspect, an embodiment of the present invention provides an architecture diagram of a multi-modal knowledge graph representation learning system based on sub-graph learning, referring to fig. 1, fig. 1 is a multi-modal knowledge graph representation learning system based on sub-graph learning in the embodiment of the present invention, including:

the multi-mode sub-image construction subsystem 110 extracts multi-mode structural information of a head entity in the target triplet to obtain a multi-mode sub-image, wherein the multi-mode sub-image comprises a visual scene image and a self-center image of the head entity;

the neighborhood feature aggregation subsystem 120 fuses the multiple channel feature components of the entity features to obtain a multi-modal embedded representation of the entity;

the link prediction subsystem 130 predicts missing entities and relationships in the target triplet based on the multimodal embedded representation of the entities.

Specifically, in the multi-mode sub-graph construction subsystem 110, aiming at the target triplet, searching M-hop neighbor nodes by taking the head entity as a self-centering entity, and selecting M nodes to construct a corresponding self-centering graph; aiming at a picture corresponding to a head entity, performing target detection by using a fast R-CNN, and selecting the relation among n target object combined objects with highest confidence level to construct a scene graph; in the neighborhood feature aggregation subsystem 120, to reveal the complex interactions of multi-modal factors, a decoupling layer is first adopted to map each node feature decomposition to a plurality of feature modes, and then the embedded features of the visual modes are mapped to a graph structured feature vector space; further, aligning the visual scene graph and the self-centering graph of the head entity by adopting a graph alignment mechanism to obtain an aligned graph pair Ji Quan heavy matrix; then, feature aggregation is realized according to the normalized channel perception attentiveness of the alignment weights of the entities in all self-centroids, so as to obtain the multi-mode embedded representation of the entities; in the link prediction subsystem 130, the multi-modal entity and relationship in the target triplet are represented, and the scoring order of the missing entities or relationships is performed based on the scoring function, so as to obtain a final prediction result. Meanwhile, on the basis of constructing a corresponding scoring function for the specific mode embedding, a network parameter regularization term and a regularization term applied for the mode sensing embedding are added, so that different mode embedding representations are forced to have certain independence.

In the multi-mode sub-graph construction subsystem, the extracting the multi-mode structural information of the head entity in the target triplet to obtain the multi-mode sub-graph includes:

inputting target triples (h, r, t), searching M-hop neighbor nodes by taking a head entity as a center, and constructing a self-centering diagram of the head entity;

the input multi-mode sub-graph construction subsystem is a multi-mode knowledge spectrum triplet and an entity picture set, but only one target triplet head entity picture is selected from the entity picture set for each traversal, and one target triplet is selected from the multi-mode knowledge spectrum triplet for learning. Wherein h in the triples is represented as a head entity, r is represented as a relationship, t is represented as a tail entity, each triplet is represented as a relationship from the head entity h to the tail entity t, and h+r is made equal to t as much as possible by continuously adjusting the vector representations between h, r and t, i.e. h+r=t. Illustratively, the target triples entered are: the head entity is still-in-library, the relation is winning, the tail entity is NBA champion, and in the learning process, vector representation between the head entity (still-in-library), the relation (winning) and the tail entity (NBA champion) is continuously adjusted, so that the obtained prediction result is still-in-library NBA champion as far as possible.

Referring to fig. 2, fig. 2 is a schematic flow chart of a multi-mode sub-graph construction subsystem according to an embodiment of the invention. Specifically, for an input head entity picture, in a neighborhood feature aggregation subsystem, performing target detection by using a fast R-CNN to obtain m visual object sets V with highest confidence coefficient _I And deriving a set of relationships E that may exist between m visual objects based on their position in the picture and semantic information _I The method comprises the steps of carrying out a first treatment on the surface of the Aggregating visual objects V _I As a set of plotted nodes, a set of relationships E _I As a set of edges for the drawing, a visual scene graph G is constructed _I . The self-center graph of the head entity can be utilized in the knowledge graph to well represent the local information of the head entity, so that the self-center graph is selected as a sub-graph of the structured information of the system graph. For the input target triples (h, r, t), taking the head entity h as a self-center entity, searching M-hop neighbor entities around the head entity h, and recording the M-hop neighbor entities to a set n according to the distance M from the corresponding M-hop neighbor entities to the center entity _M Until n=n ₁ +...+n _M Stopping searching when the visual object m is larger than or equal to the visual object m, and marking the set of n nodes as V _s And their set of relationships E in a knowledge-graph _S The graph constructed from these nodes and edges is referred to as the self-centering graph of head entity h, denoted as G _S 。

Referring to fig. 3, fig. 3 is a schematic diagram of a multi-modal knowledge graph representation learning system based on sub-graph learning according to an embodiment of the invention. Inputting the picture with the still-in fig. 3 into a multi-modal subgraph construction subsystem, detecting with fast R-CNN targeting the still-in, deriving the 5 visual object sets (e.g. men, trophy, basketball, logo, clothing, etc.) with highest confidence according to the position and semantic information of the still-in the picture, and deriving the relation sets that may exist between the 4 visual objects (e.g. men hold trophy, basketball in front of men, logo on men, men wearing clothing, etc.) according to the position and semantic information of the visual objects in the picture; the visual object set is used as a node set for drawing, and the relation set is used as an edge set for drawing, so that a visual scene graph is constructed; the method comprises the steps of taking Stefan-kurari in a target triplet (for example, a head entity is Stefan-kurari, a relation is winning, a tail entity is NBA champion) as a self-centering entity, searching M-hop neighbor entities around the target triplet according to the distance M between the target triplet and the self-centering entity, recording the M-hop neighbor entities into a set (for example, a team in Stefan-kurari is a warrior team in Jinzhou, the warrior team in Jinzhou wins NBA champion, the Stefan-kurari is a basketball player and the like), searching according to nodes until the number of the nodes is larger than that of visual objects, stopping searching until the nodes are equal, and combining the set of the nodes into a relation set in a multi-mode knowledge graph, wherein a graph constructed according to the nodes and the set is a self-centering graph of the head entity (the Stefan-kurari). And then aligning the visual scene graph with the self-center graph of the head entity by adopting a graph alignment mechanism, and inputting the aligned visual scene graph and the self-center graph of the head entity into a link prediction subsystem to obtain a relationship prediction result of the Stefan-Kun winning NBA champion.

Referring to fig. 4, fig. 4 is a schematic diagram of a neighborhood feature aggregation subsystem according to an embodiment of the present invention, where the neighborhood feature aggregation subsystem includes:

the feature decoupling module 210 maps the multi-mode subgraphs to different mode decoding channels based on the feature decoupling layer to obtain initial embedded space vector representation of the header entity, projects the visual coding information to the characterization space of the structural information, and performs linear fusion to obtain preliminary multi-mode entity feature embedding;

the neighborhood feature learning module 220 aligns the multi-modal subgraphs by adopting a graph alignment mechanism to obtain a graph pair Ji Quan matrix guided by visual information;

the multi-mode information fusion module 230 fuses the multi-mode entity feature embedding based on the graph pair Ji Quanchong matrix to obtain a multi-mode embedded representation of the entity.

The neighborhood feature aggregation subsystem is used for realizing the embedded feature updating of knowledge, decoupling the multi-mode features of the entity into a plurality of embedded feature components, carrying out the propagation and aggregation of neighborhood information on the weights after the multi-mode subgraphs are aligned according to a graph alignment mechanism, carrying out fusion weighting and feature updating on a plurality of channel feature components of the entity features to obtain multi-mode embedded feature representation, and improving the embedded representation capability of the entity so as to facilitate the subsequent link prediction.

Specifically, in feature decoupling module 310, the initial embedded spatial vector representation includes a visual information feature representation and a structured information feature representation;

In this embodiment, the first encoder is ViT, and the second encoder is Complex.

Referring to fig. 5, fig. 5 is a schematic flow chart of a neighborhood feature aggregation subsystem according to an embodiment of the present invention. Specifically, in a feature decoupling module, complex interactions in the multi-modal features of the entity are decoupled; mapping the multi-mode subgraphs to different mode decoding channels by defining a characteristic decoupling layer, so as to obtain an initial embedded space vector representation of the entity; wherein the visual information is encoded by a first encoder ViT, resulting inThe characteristic is denoted as e _i The method comprises the steps of carrying out a first treatment on the surface of the ViT has the advantage of more accurate information representation than traditional CNN methods, because ViT is a large model trained by using data from various fields, and then fine-tuned according to the image dataset from the specific field to more accurately represent the specific information of the picture; the structured information is encoded by using the second encoder Complex to give an initialized definition as e _s The method comprises the steps of carrying out a first treatment on the surface of the Because the semantic matching-based method utilizes the inner product as a mathematical calculation formula of a final scoring function, compared with a transfer-based model, the method can more remarkably distinguish information of relation triples such as 1-N and N-N, and compared with a text-image-based pre-training model method, the method can more model structural information in a knowledge graph; based on initialization definition, visual coding information is projected to a representation space of structured information, and linear fusion is carried out to obtain preliminary multi-mode entity characteristics embedded as e _is ＝(1-λ ₁ )e _s +λ ₁ W _l e _i Wherein W is _l Is a linear projection matrix lambda ₁ Is a linear fusion weighting factor used to adjust the information contribution between the two modalities.

In the neighborhood feature learning module 320, the aligning the multi-modal subgraphs based on the graph alignment mechanism obtains a graph pair Ji Quan matrix guided by visual information, including:

based on the alignment scores, a graph pair Ji Quan matrix is constructed.

And in the neighborhood feature learning module, a graph alignment mechanism is adopted to learn neighborhood features. Multi-mode sub-graph visual scene graph G obtained by aligning multi-mode sub-graph construction subsystem _I Self-centering graph G of head entity _S The method for obtaining the visual information guided graph pair Ji Quanchong matrix mainly comprises the following steps of:

acquiring node embedded characterizations of two graphs;

soft alignment visual scene graph G employing greedy algorithm _I Self-centering graph G of head entity _S ；

Wherein node embedded characterizations of two graphs are acquired, specifically, V e v= { V for each node _I ∪V _S First, respectively calculate the degree of entry and exit of the twoIn->Represents the degree of the K-hop node, K represents the diameter of the graph, delta epsilon 0,1]Is an importance factor representing the measure of node degree; then, by->Calculating similarity values between every two nodes to obtain a similarity matrix; wherein, gamma _S Is a scalar parameter for controlling the effect of the structured information, d _a And d _b The ingress and egress degree of the node a and the node b; by graphically displaying G in a visual scene _I Self-centering graph G of head entity _S The similarity calculated by p is less than or equal to m+n special marking points and all m+n nodes is selected to obtain a matrix C, and a similarity value between the special marking points is extracted from the matrix C to obtain a matrix W; and then the matrix S approximately apprxeq YZ ^T Decomposing to obtain a similarity embedded representation matrix Y=CUΣ of the nodes ^1/2 Wherein->Full rank singular value decomposition, which represents the generalized inverse of matrix W. The embedded representation Y of the nodes of the two graphs can be obtained by decomposition _s And Y _I 。

Wherein, a greedy algorithm is adopted to soft align the visual scene graph G _I Self-centering graph G of head entity _S Specifically, considering the problem of time complexity, a soft alignment method is adopted to align the similarity between nodes; does not need to be G _I Each node in (a)Are all in accordance with G _S Matching each node in the list, then selecting a matching node pair with highest similarity score, and selecting the most possible matching nodes of the first a to calculate; thus will be to Y _S Stored in a K-D tree, G _I In the method, a is less than or equal to n G are rapidly selected through a nearest algorithm _S To calculate the corresponding similarity score of the node, and then byCalculation of G _I And G _S Is aligned with the alignment score of the final composition map pair Ji Quan matrixIn the formula, i is E V _I ，s∈V _S ，a _is Representing a similarity score representing a vector between the corresponding two nodes i and s.

At the multimodal information fusion module 330, multimodal information fusion is performed; after a Ji Quan matrix of the graph pair guided by the visual information is obtained, the neighborhood multi-mode characteristic information needs to be fused; based on the resulting weight of Ji Quan, visual scene graph G _I Each entity in (1) has a self-centering graph G _S The head entity in the rule corresponds to the rule, then the characteristic depth can be utilized to fuse the multi-mode entity characteristic embedding of the neighborhood, and the multi-mode embedding of the obtained entity is expressed as e _m ＝∑N(a _is )·e _is The method comprises the steps of carrying out a first treatment on the surface of the Where N is a normalization function, indicating that a will be _is Normalizing e _is Is a preliminary multi-modal entity feature embedding.

Illustratively, inputting a visual scene graph and a self-center graph in a still-library obtained by a multi-mode sub-graph construction subsystem to a neighborhood feature aggregation subsystem, coding visual information in the visual scene graph by adopting a ViT to obtain a feature representation through a feature decoupling module, and coding structured information in the self-center graph by adopting a Complex to obtain a feature representation, and carrying out linear fusion to obtain a multi-mode entity embedded representation; learning neighborhood characteristics based on a graph alignment mechanism, and obtaining a graph pair Ji Quan heavy matrix guided by visual information for the visual scene graph and the self-centering graph; and fusing the multi-modal entity characteristic embedding based on the graph pair Ji Quanchong matrix to obtain the multi-modal embedded representation of the entity.

In the link prediction subsystem 130, predicting missing entities and relationships in a target triplet based on the multimodal embedded representation of the entities, includes:

constructing a scoring function of the link prediction task;

and predicting missing entities and relations in the target triples based on the integral score function, and optimizing a model.

Wherein the overall score function is defined as follows:

wherein l= (1-lambda) ₂ )φ _s +λ ₂ φ _m The weighted summation representing the multi-modal embedded score, and subscripts m and s represent the multi-modal and single atlas structured information modality respectively;a score calculated by the representation triplet based on the multimodal embedded representation; beta is a super parameter for realizing gain adjustment for diversity and relevance balance;the negative sampling triplet set is C, which indicates that the head entity or the tail entity is randomly replaced in the input target triplet to be other entities, the random replacement relationship is other relationships, (h, r, t) is the target triplet, and (h ', r ', t ') is the prediction triplet.

The link prediction subsystem is used for realizing prediction of missing entities or relations based on the multi-mode embedded representation of the entities output in the neighborhood feature aggregation subsystem. Referring to fig. 6, fig. 6 is a flow chart of a link prediction subsystem according to an embodiment of the invention. Specifically, the processing procedure of the link prediction subsystem after obtaining the multi-modal embedded representation of the target triplet is as follows:

The scoring function is defined, and as the task focused by the system is mainly aimed at predicting the target triplet missing entity and relationship, the target is mainly focused on the link prediction task. The multi-modal knowledge graph representation learning system based on sub-graph learning provided by the embodiment of the invention defines a scoring function of a link prediction task as follows:wherein, the multi-modal embedded regular term l, l= (1-lambda) added in the scoring function ₂ )φ _s +λ ₂ φ _m The weighted summation representing the multi-modal embedded score, and subscripts m and s represent the multi-modal and single atlas structured information modality respectively; but->Representing the score calculated by the triplet based on the embedded representation obtained in the previous step; the network regular term beta added in the scoring function is a super parameter and is used for realizing gain adjustment for balancing diversity and relevance; wherein->Is the negative sampling triplet set of C, representing that the random replacement of the head entity or the tail entity in the input target triplet is other entity and the random replacement relationship is other relationship.

Illustratively, traversing each of the target triples in the multimodal knowledge map triples (e.g., stefin kurari team winning NBA champion for the state warrior, stefin kurari basketball player, etc.), respectively, scores for each target triplet are obtained, and sorting is based on the scores for each target triplet, resulting in a final prediction of stefin kurari winning NBA champion.

In the implementation process, the invention provides a multi-mode knowledge graph representation learning system based on sub-graph learning, wherein the multi-mode sub-graph comprises a visual scene graph and a self-center graph of a head entity; the neighborhood feature aggregation subsystem fuses a plurality of channel feature components of the entity features to obtain a multi-mode embedded representation of the entity; the link prediction subsystem predicts missing entities and relationships in the target triplet based on the multimodal embedded representation of the entities. The invention provides a multi-modal knowledge graph representation learning system based on sub-graph learning, which uses a novel picture structure information extraction mechanism, adopts a high-efficiency picture alignment mechanism to introduce structured information in pictures, realizes effective aggregation of target entity neighborhood topology information, lays a foundation for efficient fusion of subsequent multi-modal information, and can efficiently learn multi-modal knowledge graph embedded representation; the method comprises the steps that a feature decoupling layer is adopted to map entity multi-modal feature decomposition to a plurality of embedded spaces, visual modal features are projected into a graph structure feature space, feature aggregation of entities in all self-centering graphs is achieved through aligned multi-modal perception attention weights, multi-modal features of target entities are correspondingly updated, stability and characterization capacity of embedded representation are improved, complex association relationships implied by feature layers are fully considered, and characterization capacity of the embedded representation is further improved; concentrating and condensing the most important neighborhood topology for each target entity node respectively, carrying out neighborhood information propagation and aggregation on the basis, the application flexibility of the method in the scene that only a small number of entities have multi-modal information and need to learn embedded representation is improved, and the method can be flexibly applied to the required scene of multi-modal knowledge graph embedded representation.

Based on the same inventive concept as the first aspect, the invention further provides a multi-modal knowledge graph representation learning method based on sub-graph learning, wherein the multi-modal knowledge graph representation learning method based on sub-graph learning is used for the multi-modal knowledge graph representation learning system based on sub-graph learning of the first aspect. Referring to fig. 7, fig. 7 is a flowchart of a multi-modal knowledge graph representation learning method based on sub-graph learning in an embodiment of the invention, including:

embedding the multi-mode of the entity into an input link prediction subsystem to output a prediction result of the entity or the relation;

Illustratively, inputting the above-mentioned multimodal knowledge graph triplet and the set of pictures of the header entity into the multimodal sub-graph construction subsystem will generate a scene graph for each picture and a self-centered graph for each header entity. Assuming that the stoneley is a head entity, 40 pictures in the stoneley are included, taking the 40 pictures as a picture set of the head entity, selecting a target triplet head entity picture (shown in fig. 3) in the picture set and a non-traversed target triplet (shown in fig. 3, for example, the head entity is stoneley and the kurary, the relation is winning, and the tail entity is NBA champion) in the multi-mode knowledge map triplet, inputting the selected target triplet into a multi-mode sub-graph construction subsystem, and constructing a visual scene graph and a self-center graph in the stoneley and kurariey; inputting the constructed visual scene graph and the self-centering graph of the head entity into a neighborhood feature aggregation subsystem, and aligning the visual scene graph and the self-centering graph by adopting a graph alignment mechanism to obtain a multi-mode embedded representation of the entity; the multimodal embedded representation of the entity is then input to a link prediction subsystem, and the scores calculated based on the scoring function are ranked to yield the final prediction result (e.g., the NBA champion won in the ston-kurari). And repeating the process until each target triplet which is not traversed in the multi-mode knowledge spectrum triplet is traversed.

In the implementation process, the invention provides a multi-modal knowledge graph representation learning method based on sub-graph learning, which comprises the steps of selecting one non-traversed target triplet in a target triplet head entity picture set and a multi-modal knowledge graph triplet, inputting the non-traversed target triplet into a multi-modal sub-graph construction subsystem, and outputting a visual scene graph and a head entity self-center graph; inputting the visual scene graph and the head entity self-centering graph into a neighborhood feature aggregation subsystem, and outputting a multi-mode embedded representation of an entity; embedding the multi-mode of the entity into an input link prediction subsystem to output a prediction result of the entity or the relation; repeating the learning process until each target triplet in the multi-modal knowledge-graph triplet set is traversed. The embodiment of the invention provides a multi-modal knowledge graph representation learning method based on sub-graph learning, which uses a novel picture structure information extraction mechanism, adopts a high-efficiency picture alignment mechanism to introduce structured information in pictures, realizes effective aggregation of target entity neighborhood topology information, lays a foundation for high-efficiency fusion of subsequent multi-modal information, and can efficiently learn multi-modal knowledge graph embedded representation; the method comprises the steps that a feature decoupling layer is adopted to map entity multi-modal feature decomposition to a plurality of embedded spaces, visual modal features are projected into a graph structure feature space, feature aggregation of entities in all self-centering graphs is achieved through aligned multi-modal perception attention weights, multi-modal features of target entities are correspondingly updated, stability and characterization capacity of embedded representation are improved, complex association relationships implied by feature layers are fully considered, and characterization capacity of the embedded representation is further improved; concentrating and condensing the most important neighborhood topology for each target entity node respectively, carrying out neighborhood information propagation and aggregation on the basis, the application flexibility of the method in the scene that only a small number of entities have multi-modal information and need to learn embedded representation is improved, and the method can be flexibly applied to the required scene of multi-modal knowledge graph embedded representation.

Referring to fig. 8, fig. 8 is a schematic block diagram of an electronic device according to an embodiment of the present invention. The electronic device comprises a memory 101, a processor 102 and a communication interface 103, wherein the memory 101, the processor 102 and the communication interface 103 are electrically connected with each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules, such as program instructions/modules corresponding to a semantic knowledge guided class incremental learning system provided by embodiments of the present invention, and the processor 102 executes the software programs and modules stored in the memory 101 to perform various functional applications and data processing. The communication interface 103 may be used for communication of signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.

The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 may be a general purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

It will be appreciated that the configuration shown in fig. 8 is merely illustrative, and that the electronic device may also include more or fewer components than shown in fig. 8, or have a different configuration than shown in fig. 8. The components shown in fig. 8 may be implemented in hardware, software, or a combination thereof.

In the embodiments provided in the present invention, it should be understood that the disclosed system and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A multi-modal knowledge graph representation learning system based on sub-graph learning, comprising:

the neighborhood feature aggregation subsystem fuses a plurality of channel feature components of the entity features to obtain a multi-mode embedded representation of the entity;

2. The learning system of claim 1, wherein the extracting the multi-modal structure information of the head entity in the target triplet to obtain the multi-modal subgraph comprises:

3. The learning system of claim 1, wherein the neighborhood feature aggregation subsystem comprises:

4. A learning system according to claim 3, wherein the initial embedded spatial vector representation comprises a visual information feature representation and a structured information feature representation;

5. The learning system of claim 3 wherein the aligning the multi-modal subgraphs using a graph alignment mechanism results in a visual information directed graph pair Ji Quan re-matrix comprising:

based on the alignment scores, a graph pair Ji Quan matrix is constructed.

6. The learning system of claim 1 wherein predicting missing entities and relationships in a target triplet based on the multimodal embedded representation of the entities comprises:

constructing a scoring function of the link prediction task;

and predicting the missing entity and relationship in the target triplet based on the integral score function, and optimizing the model.

7. The learning system of claim 6 wherein the overall score function is defined as follows:

wherein l= (1-lambda) ₂ )φ _s +λ ₂ φ _m The weighted summation representing the multi-modal embedded score, and subscripts m and s represent the multi-modal and single atlas structured information modality respectively;representing the score calculated by the target triplet based on the multimodal embedded representation; beta is a super parameter for realizing gain adjustment for diversity and relevance balance;is a negative sampling triplet set of C, which represents that the head entity or the tail entity is randomly replaced in the input target triplet as other entities, and the random replacement relationship is thatOther relationships, (h, r, t) are target triples, and (h ', r ', t ') are prediction triples.

8. A multi-modal knowledge graph representation learning method based on sub-graph learning, characterized in that the method is used in the multi-modal knowledge graph representation learning system based on sub-graph learning as claimed in any one of the preceding claims 1-7, comprising:

9. An electronic device, comprising:

a memory for storing one or more programs;

a processor;

implementing a multi-modal knowledge graph representation learning system based on sub-graph learning as claimed in any one of claims 1-7 when the one or more programs are executed by the processor.

10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements a multi-modal knowledge graph representation learning system based on sub-graph learning as claimed in any one of claims 1-7.