WO2021179640A1

WO2021179640A1 - Graph model-based short video recommendation method, intelligent terminal and storage medium

Info

Publication number: WO2021179640A1
Application number: PCT/CN2020/125527
Authority: WO
Inventors: 王娜; 刘兑
Original assignee: 深圳大学
Priority date: 2020-03-10
Filing date: 2020-10-30
Publication date: 2021-09-16
Also published as: CN111382309A; CN111382309B

Abstract

A graph model-based short video recommendation method, an intelligent terminal and a storage medium, the method comprising: a bipartite graph of a corresponding relation between a user and a short video is constructed according to the interaction behavior of the user towards the short video (S10); an aggregation layer outputs a high-order representation vector of a target vertex by aggregating neighborhood information of the target vertex (S20); an integration layer integrates target node information and the neighborhood information (S30); a fusion layer fuses multiple pieces of modal information of the target vertex (S40); and an output layer calculates the similarity between the user vector and a short video vector, predicts the probability that the user interacts with the short video, and recommends the short video to the user (S50). The bipartite graph and a corresponding graph convolution network are constructed for different modes of the short video respectively, vector representation of the user and the vertex of the short video in different modes is learned, thereby achieving the purpose of fine-grained personalized recommendation for the user.

Description

一种基于图模型的短视频推荐方法、和智能终端和存储介质Short video recommendation method based on graph model, and intelligent terminal and storage medium

技术领域Technical field

本发明涉及信息处理技术领域，尤其涉及一种基于图模型的短视频推荐方法、智能终端及存储介质。The present invention relates to the technical field of information processing, in particular to a short video recommendation method based on a graph model, an intelligent terminal and a storage medium.

背景技术Background technique

在信息时代背景下，面对日渐增多的互联网信息，个性化推荐作为连接服务提供商和用户之间的桥梁，让企业有效地从海量信息中挖掘出有用信息并加以利用，能发掘用户的兴趣偏好，改善用户体验，增加用户粘性，进而提升收益；而对于用户，能让他们在平台海量的信息库中快速找到自己感兴趣的目标物。个性化推荐已经成为许多在线内容分享服务的核心组件，如图片、博客和音乐推荐。例如最近兴起的短视频分享平台快手和抖音，使得短视频推荐方法更加引人关注。与图像，音乐等单一模态的媒体内容不同的是，短视频包含了丰富的多媒体信息-视频封面图片、视频背景音乐以及视频的文字描述，组成了视觉、听觉和文本多个模态的内容，将这些多模态信息融入到用户与短视频的历史交互行为中，为更深一步捕捉用户偏好提供了帮助。In the context of the information age, in the face of increasing Internet information, personalized recommendations serve as a bridge between service providers and users, allowing companies to effectively dig out and use useful information from massive amounts of information, and discover the interests of users Preferences, improve user experience, increase user stickiness, and thus increase revenue; for users, they can quickly find their interested targets in the platform’s massive information database. Personalized recommendations have become a core component of many online content sharing services, such as pictures, blogs, and music recommendations. For example, the recently emerging short video sharing platforms Kuaishou and Douyin have made short video recommendation methods more attractive. Different from single-modal media content such as images and music, short videos contain rich multimedia information-video cover pictures, video background music, and text descriptions of the video, which constitute visual, auditory, and textual content in multiple modalities , Integrate these multi-modal information into the historical interaction behavior of users and short videos, and provide help for deeper capture of user preferences.

传统的用于短视频的推荐算法一般有基于协同过滤方法(Collaborative Filtering，CF)和基于图卷积网络方法(Graph Convolutional Network，GCN)方法。Traditional recommendation algorithms for short videos generally include Collaborative Filtering (CF) and Graph Convolutional Network (GCN) methods.

其中，基于协同过滤方法的思想大概可分为两种，均利用“用户—视频”的历史交互行为，构建“用户—视频”的交互矩阵，为目标用户推荐其相似用户喜欢的物品(基于用户的协同过滤)或者为目标用户推荐其偏好物品的相似物品(基于物品的协同过滤)。基于协同过滤的模型可以充分利用用户的显式反馈信息(点赞、关注、评论等)和隐式反馈信息(用户浏览记录、停留时长等)来预测用户与物品之间的交互，但容易受制于数据的稀疏性，导致推荐结果存在一定的局限性。如对于数据显式反馈不足，用户反馈较少的情况，推荐算法难以学习到有意义的用户偏好信息；使用隐式反馈也容易给推荐***带来“短视”的问题，即给用户推荐列表大多为头部的热门物品，牺牲了推荐的个性化与多样性。基于协同过滤的方法虽然简单快速，但只能利用用户与短视频的交互行为，而无法利用短视频丰富的多模态信息。Among them, the ideas based on collaborative filtering methods can be roughly divided into two types, both of which use the historical interaction behavior of “user-video” to construct a “user-video” interaction matrix, and recommend items that similar users like to target users (based on user-video) interaction matrix. Collaborative filtering) or recommend similar items to the target user's favorite items (collaborative filtering based on items). The model based on collaborative filtering can make full use of the user's explicit feedback information (likes, following, comments, etc.) and implicit feedback information (user browsing records, stay time, etc.) to predict the interaction between users and items, but it is easily restricted Due to the sparseness of the data, the recommendation results have certain limitations. For example, in the case of insufficient data explicit feedback and few user feedbacks, it is difficult for the recommendation algorithm to learn meaningful user preference information; the use of implicit feedback can also easily cause "short-sighted" problems for the recommendation system, that is, the recommendation list for users is mostly For the popular items on the head, the personalization and diversity of recommendations are sacrificed. Although the method based on collaborative filtering is simple and fast, it can only use the user's interactive behavior with the short video, and cannot use the rich multi-modal information of the short video.

基于图卷积网络方法用于推荐一般根据用户对物品的交互行为构造“用户-视频”二部图(bipartite graph)，在二部图中聚合目标节点邻域集合的属性信息作为节点自身的高阶表示，进行节点之间的信息传递，最终完成对用户节点和视频节点的表示向量的学习，通过计算用户向量与视频向量的相似性，预测用户对短视频产生交互行为的概率。基于图卷积网络的方法相比协同过滤方法，将用户交互序列这种非欧式结构的行为数据转化成二部图结构加以利用，并通过节点邻域聚合的方法，实现短视频的属性信息在图中节点之间的传递。但目前提出的基于图卷积网络的方法一般将短视频节点的多模态属性信息拼接作为整体进行计算传递，缺乏考虑不同模态信息之间的语义鸿沟(semantic gap)，即模态之间包含信息的差异性，存在对用户和短视频的表示学习不够细粒化的问题。The graph-based convolutional network method is used for recommendation. Generally, a "user-video" bipartite graph is constructed based on the user's interactive behavior on items. The attribute information of the target node neighborhood set is aggregated in the bipartite graph as the node's own high The first-order representation is to transfer information between nodes, and finally complete the learning of the representation vector of the user node and the video node. By calculating the similarity between the user vector and the video vector, the probability of the user's interaction behavior on the short video is predicted. Compared with the collaborative filtering method, the method based on graph convolutional network converts the behavior data of the non-European structure of the user interaction sequence into a bipartite graph structure for use, and uses the method of node neighborhood aggregation to realize the attribute information of the short video. Transfer between nodes in the graph. However, the currently proposed methods based on graph convolutional networks generally combine the multi-modal attribute information of short video nodes as a whole for calculation and transmission, and lack of consideration of the semantic gap between different modal information (semantic gap), that is, between modalities. Containing the difference of information, there is a problem that the representation learning of users and short videos is not granular enough.

基于协同过滤方法和基于图卷积网络方法都利用了用户与视频(物品)的历史交互行为，不过形式不同：前者将其用于构造“用户-视频”交互矩阵；后者将其转化为“用户-视频”二部图。协同过滤构造的交互矩阵只能利用交互行为信息(如只能理解“用户A点击了视频1”)，无法利用视频的属性信息(如视频的视觉、文本、听觉等多模态信息)；而图卷积网络相当于是协同过滤的改进，能利用视频的属性信息学习用户和视频的表示向量，但一般将视频的多模态信息当作整体输入到模型进行学习，没有将其按模态不同分开建模。Both the collaborative filtering method and the graph-based convolutional network method use the historical interaction behavior between users and videos (items), but in different forms: the former uses it to construct a "user-video" interaction matrix; the latter transforms it into " User-Video" two-part picture. The interactive matrix constructed by collaborative filtering can only use interactive behavior information (for example, it can only understand "User A clicked on video 1"), and cannot use video attribute information (such as video visual, text, auditory and other multi-modal information); and The graph convolutional network is equivalent to an improvement of collaborative filtering. It can use the attribute information of the video to learn the representation vector of the user and the video, but generally the multi-modal information of the video is input to the model for learning as a whole, and it is not modally different. Model separately.

现有的基于协同过滤方法和基于图卷积网络方法存在的共同问题是：都没有从模态层面进行用户与短视频的表示学习，无法衡量模态差异对用户偏好的影响。The common problem of the existing collaborative filtering methods and graph-based convolutional network methods is that they do not learn the representation of users and short videos from the modal level, and cannot measure the influence of modal differences on user preferences.

因此，现有技术还有待于改进和发展。Therefore, the existing technology needs to be improved and developed.

发明内容Summary of the invention

本发明针对现有技术中没有从模态层面进行用户与短视频的表示学习，无法衡量模态差异对用户偏好的影响，本发明提供一种基于图模型的短视频推荐方法、智能终端及存储介质。In view of the fact that the prior art does not perform user and short video representation learning from the modal level, and cannot measure the influence of modal differences on user preferences, the present invention provides a short video recommendation method based on a graph model, an intelligent terminal and storage medium.

本发明解决技术问题所采用的技术方案如下：The technical solutions adopted by the present invention to solve the technical problems are as follows:

一种基于图模型的短视频推荐方法，其中，所述基于图模型的短视频推荐方法包括：A short video recommendation method based on a graph model, wherein the short video recommendation method based on a graph model includes:

根据用户对短视频的交互行为，构造用户和短视频对应关系的二部图；According to the user's interactive behavior on the short video, construct a bipartite graph of the corresponding relationship between the user and the short video;

聚合层通过聚合目标顶点的邻域信息输出目标顶点自身的高阶表示向量；The aggregation layer outputs the high-order representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex;

整合层将目标节点信息与邻域信息进行整合；The integration layer integrates target node information with neighborhood information;

融合层对目标顶点多个模态信息进行融合；The fusion layer fuses multiple modal information of the target vertex;

输出层计算用户向量与短视频向量之间的相似程度，预测用户对短视频产生交互行为的概率，并为用户进行短视频推荐。The output layer calculates the degree of similarity between the user vector and the short video vector, predicts the probability that the user will interact with the short video, and recommends the short video for the user.

所述的基于图模型的短视频推荐方法，其中，所述交互行为定义为用户完整观看一部短视频或者对所观看的短视频进行点赞操作。In the method for recommending a short video based on a graph model, the interactive behavior is defined as a user watching a short video in full or performing a thumbs-up operation on the watched short video.

所述的基于图模型的短视频推荐方法，其中，所述根据用户对短视频的交互行为，构造用户和短视频对应关系的二部图，还包括：In the method for recommending short videos based on the graph model, the constructing a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video further includes:

构造模态层级的用户与短视频对应关系的二部图。Construct a bipartite graph of the corresponding relationship between users and short videos at the modal level.

所述的基于图模型的短视频推荐方法，其中，所述短视频包括视觉模态信息、文本模态信息和听觉模态信息；The short video recommendation method based on the graph model, wherein the short video includes visual modal information, text modal information, and auditory modal information;

所述视觉模态信息以视频封面图片经过卷积神经网络输出为128维的向量作为表征；The visual modal information is represented by a 128-dimensional vector output from a video cover picture through a convolutional neural network;

所述文本模态信息以视频标题文字经过分词和自然语言处理模型向量化输出为128维的向量作为表征；The text modal information is represented by a 128-dimensional vector outputted by word segmentation and natural language processing model vectorization of the video title text;

所述听觉模态信息以背景音乐和人物讲话声经过截断并经过卷积神经网络后输出为128维的向量作为表征。The auditory modal information is represented by a 128-dimensional vector after the background music and speech sounds of characters are truncated and passed through a convolutional neural network.

所述的基于图模型的短视频推荐方法，其中，所述聚合层用于对目标顶点的邻域信息进行聚合，得到表征目标邻域的向量，每次聚合操作由邻域聚合和非线性处理组成。In the short video recommendation method based on the graph model, the aggregation layer is used to aggregate the neighborhood information of the target vertex to obtain a vector representing the target neighborhood, and each aggregation operation is performed by neighborhood aggregation and nonlinear processing composition.

所述的基于图模型的短视频推荐方法，其中，所述邻域聚合为：对目标顶点的邻域通过聚合函数进行聚合操作；In the short video recommendation method based on the graph model, the neighborhood aggregation is: performing an aggregation operation on the neighborhood of the target vertex through an aggregation function;

所述非线性处理为：由邻域聚合操作得到目标顶点的一阶和二阶邻域信息，通过将目标顶点原始信息与其邻域信息进行拼接，输入到单层神经网络中获取目标顶点的高阶特征。The non-linear processing is: obtaining the first-order and second-order neighborhood information of the target vertex by the neighborhood aggregation operation, and by splicing the original information of the target vertex with its neighborhood information, and inputting it into a single-layer neural network to obtain the height of the target vertex. Order features.

所述的基于图模型的短视频推荐方法，其中，所述聚合函数的构造方式包括：平均聚合、最大池化聚合和注意力机制聚合。In the short video recommendation method based on the graph model, the construction mode of the aggregation function includes: average aggregation, maximum pooling aggregation, and attention mechanism aggregation.

所述的基于图模型的短视频推荐方法，其中，所述整合层用于对同一模态下不同来源的输入信息进行整合，以及将特定模态下目标顶点的低阶信息和高阶信息进行整合，得到用户顶点和短视频顶点在不同模态下的表示向量；In the short video recommendation method based on the graph model, the integration layer is used to integrate input information from different sources in the same mode, and to perform low-level information and high-level information of the target vertex in a specific mode. Integration to obtain the representation vectors of user vertices and short video vertices in different modalities;

所述融合层用于将用户顶点和短视频顶点的多个模态表示向量进行融合。The fusion layer is used to merge multiple modal representation vectors of the user vertex and the short video vertex.

一种智能终端，其中，所述智能终端包括如上所述的基于图模型的短视频推荐***，还包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于图模型的短视频推荐程序，所述基于图模型的短视频推荐程序被所述处理器执行时实现如上所述的基于图模型的短视频推荐方法的步骤。An intelligent terminal, wherein the intelligent terminal includes the above-mentioned graph model-based short video recommendation system, and further includes: a memory, a processor, and a memory based on the memory and capable of running on the processor. A graph model-based short video recommendation program, which implements the steps of the above-mentioned graph model-based short video recommendation method when the graph model-based short video recommendation program is executed by the processor.

一种存储介质，其中，所述存储介质存储有基于图模型的短视频推荐程序，所述基于图模型的短视频推荐程序被处理器执行时实现如上所述基于图模型的短视频推荐方法的步骤。A storage medium, wherein the storage medium stores a short video recommendation program based on a graph model, and when the short video recommendation program based on a graph model is executed by a processor, the method for short video recommendation based on the graph model as described above is implemented step.

本发明根据用户对短视频的交互行为，构造用户和短视频对应关系的二部图；聚合层通过聚合目标顶点的邻域信息输出目标顶点自身的高阶表示向量；整合层将目标节点信息与邻域信息进行整合；融合层对目标顶点多个模态信息进行融合；输出层计算用户向量与短视频向量之间的相似程度，预测用户对短视频产生交互行为的概率，并为用户进行短视频推荐。本发明通过对短视频的不同模态分别构建二部图以及相应的图卷积网络，学习用户和短视频顶点在不同模态下的向量表征，达到对用户进行细粒度个性化推荐的目的。The present invention constructs a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video; the aggregation layer outputs the high-order representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex; the integration layer combines the target node information with The neighborhood information is integrated; the fusion layer integrates multiple modal information of the target vertex; the output layer calculates the similarity between the user vector and the short video vector, predicts the probability of the user interacting with the short video, and performs short video for the user. Video recommendation. The present invention constructs a bipartite graph and a corresponding graph convolution network for different modalities of short videos, learns vector representations of users and short video vertices in different modalities, and achieves the purpose of fine-grained personalized recommendation for users.

附图说明Description of the drawings

图1是本发明基于图模型的短视频推荐方法的较佳实施例的流程图；Fig. 1 is a flowchart of a preferred embodiment of a short video recommendation method based on a graph model of the present invention;

图2是本发明基于图模型的短视频推荐方法的较佳实施例中整体框架原理示意图；2 is a schematic diagram of the overall framework principle of the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

图3是本发明基于图模型的短视频推荐方法的较佳实施例中二部图模型的示意图；3 is a schematic diagram of a two-part graph model in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

图4是本发明基于图模型的短视频推荐方法的较佳实施例中根据用户交互行为构建“用户-短视频”交互二部图的示意图；4 is a schematic diagram of the construction of a "user-short video" interactive bipartite graph based on user interaction behavior in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

图5是本发明基于图模型的短视频推荐方法的较佳实施例中模态层级“用户-短视频”二部图的示意图；FIG. 5 is a schematic diagram of a two-part diagram of the modal level "user-short video" in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

图6是本发明基于图模型的短视频推荐方法的较佳实施例中聚合层的示意图；6 is a schematic diagram of the aggregation layer in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

图7为本发明智能终端的较佳实施例的运行环境示意图。FIG. 7 is a schematic diagram of the operating environment of the preferred embodiment of the smart terminal of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚、明确，以下参照附图并举实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not used to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

本发明较佳实施例所述的基于图模型的短视频推荐方法，如图1所示，一种基于图模型的短视频推荐方法，其中，所述基于图模型的短视频推荐方法包括以下步骤：The short video recommendation method based on a graph model according to a preferred embodiment of the present invention, as shown in FIG. 1, is a short video recommendation method based on a graph model, wherein the short video recommendation method based on a graph model includes the following steps :

步骤S10、根据用户对短视频的交互行为，构造用户和短视频对应关系的二部图；Step S10: Construct a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video;

步骤S20、聚合层通过聚合目标顶点的邻域信息输出目标顶点自身的高阶表示向量；Step S20: The aggregation layer outputs the high-level representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex;

步骤S30、整合层将目标节点信息与邻域信息进行整合；Step S30, the integration layer integrates the target node information with the neighborhood information;

步骤S40、融合层对目标顶点多个模态信息进行融合；Step S40: The fusion layer fuses multiple modal information of the target vertex;

步骤S50、输出层计算用户向量与短视频向量之间的相似程度，预测用户对短视频产生交互行为的概率，并为用户进行短视频推荐。Step S50: The output layer calculates the degree of similarity between the user vector and the short video vector, predicts the probability that the user will interact with the short video, and recommends the short video for the user.

如图2所示，本发明中基于图模型的短视频推荐方法的框架由二部图(用户-短视频)、聚合层、整合层、融合层和输出层构成。As shown in Figure 2, the framework of the short video recommendation method based on the graph model of the present invention is composed of a bipartite graph (user-short video), an aggregation layer, an integration layer, a fusion layer and an output layer.

其中，二部图是图论中的一种特殊模型，如图3所示，假设图G＝(V，E)是由顶点集合V和边集合E构成，顶点集合V可以分割为两个互不相交的子集{A，B}，并且图中的任意一条边e _ij所连接的两个顶点i和j分别属于这两个不同的顶点集(i∈A，j∈B)，则图G为一个二部图，顶点i和j互为一阶邻居。 Among them, the bipartite graph is a special model in graph theory. As shown in Figure 3, assuming that the graph G=(V, E) is composed of a vertex set V and an edge set E, the vertex set V can be divided into two mutual Disjoint subset {A, B}, and _{the two vertices i and j connected by any edge e ij} in the graph belong to these two different vertex sets (i ∈ A, j ∈ B), then the graph G is a bipartite graph, and vertices i and j are first-order neighbors to each other.

根据用户的历史交互行为能体现用户的兴趣偏好，构造“用户-短视频”二部图，在“用户-短视频”二部图中，顶点分为用户顶点集合和短视频顶点集合两个子集，如果用户对某个短视频有过交互行为(如完整观看视频、点赞)，则在“用户-短视频”二部图中用户顶点与该短视频顶点存在直接相连的边。用户的交互历史短视频顶点集合为该用户顶点的一阶邻域集合，其中每个短视频顶点都包含了短视频的属性信息。为了衡量短视频不同模态的属性信息(如视频封面图片、标题和背景音乐)对用户偏好的影响程度，本发明针对短视频的不同模态(如视觉、文本和听觉)构造相应的“用户-短视频”二部图，不同的模态二部图拓扑结构相同，顶点包含对应模态下的属性信息。According to the user’s historical interaction behaviors that can reflect the user’s interest and preferences, a "user-short video" bipartite graph is constructed. In the "user-short video" bipartite graph, the vertices are divided into two subsets: the user vertex set and the short video vertex set If the user has interacted with a short video (such as watching the video completely, liking), then there is an edge directly connected between the user vertex and the short video vertex in the "user-short video" bipartite graph. The user's interaction history short video vertex set is the first-order neighborhood set of the user's vertex, and each short video vertex contains the attribute information of the short video. In order to measure the degree to which attribute information of different modalities of short videos (such as video cover pictures, titles, and background music) affect user preferences, the present invention constructs corresponding “users” for different modalities of short videos (such as vision, text, and hearing). -"Short video" bipartite graph, the topological structure of the bipartite graph of different modals is the same, and the vertices contain the attribute information under the corresponding modal.

其中，邻域是邻居顶点的集合，某顶点的邻居简单说就是与其直接相连的顶点，邻域就是与其直接相连的所有顶点的集合，一阶邻域指的是一阶邻居顶点的集合；因为池化聚合是在某一个邻域中，对每一个邻居顶点进行计算，所以是衡量不同邻居对目标顶点的影响程度。Among them, the neighborhood is the set of neighbor vertices. The neighbors of a vertex are simply the vertices directly connected to it. The neighborhood is the set of all vertices directly connected to it. The first-order neighborhood refers to the set of first-order neighbor vertices; because Pooling aggregation is to calculate each neighbor vertex in a certain neighborhood, so it measures the degree of influence of different neighbors on the target vertex.

遵循图卷积网络“聚合/整合/读出”的结构思想，本发明设计的聚合层作用为聚合目标顶点的邻域信息输出目标顶点自身的高阶表示向量；整合层进行目标节点信息与邻域信息的整合，融合层实现对目标顶点多个模态信息的融合，学习包含不同聚合层级信息的用户和短视频向量表征，体现短视频不同模态包含信息的差异性；输出层计算用户向量与短视频向量之间的相似程度，预测用户对短视频产生交互行为的概率，为用户产生推荐。Following the "aggregation/integration/readout" structure of the graph convolutional network, the aggregation layer designed in the present invention functions to aggregate the neighborhood information of the target vertex and output the high-order representation vector of the target vertex itself; the integration layer performs the target node information and neighboring information The integration of domain information, the fusion layer realizes the fusion of multiple modal information of the target vertex, learns users and short video vector representations that contain different aggregation levels of information, and reflects the difference of information contained in different modalities of short videos; the output layer calculates the user vector The degree of similarity with the short video vector predicts the probability that the user will interact with the short video and generates recommendations for the user.

具体地，根据用户对短视频的交互行为构建“用户-短视频”二部图，交互行为定义为用户完整观看一部短视频或者对该短视频进行点赞，用户交互过的短视频序列，形如用户1[视频1，视频2，…，视频n]，如图4所示，将用户与短视频对应为图顶点，用户与交互过的短视频顶点之间存在直连边，构建“用户-短视频”交互二部图。Specifically, a "user-short video" two-part graph is constructed based on the user's interactive behavior on the short video. The interactive behavior is defined as a short video sequence where the user has fully watched a short video or praised the short video, and the user has interacted with it. The shape is like user 1 [video 1, video 2, …, video n], as shown in Figure 4, the user and the short video are corresponding to the vertices of the graph, and there is a straight edge between the user and the vertices of the short video that has been interacted, constructing " User-short video" interactive two-part picture.

继续构造模态层级的“用户-短视频”二部图，每一种信息的来源或者形式，都可以称为一种模态，人可以通过视觉、听觉、嗅觉和触觉接收信息，信息可以通过图像、文字、语音等形式传递。短视频包括视觉模态信息、文本模态信息和听觉模态信息三种模态信息，每种模态包含的信息以固定维度的向量表示：如所述视觉模态信息以视频封面图片经过卷积神经网络输出为128维的向量作为表征；所述文本模态信息以视频标题文字经过分词和自然语言处理模型向量化输出为128维的向量作为表征；所述听觉模态信息以背景音乐和人物讲话声经过截断并经过卷积神经网络后输出为128维的向量作为表征。如图5所示，将顶点按照

不同模态种类进行区分，其中

为模态种类的集合，V为视觉模态，T为文本模态，A为听觉模态。构造模态层级的“用户-短视频”二部图

m∈{V，T，A}，二部图中的短视频顶点属性信息为短视频对应模态信息，不同模态图中顶点之间距离远近代表顶点模态之间信息的差异。 Continue to construct a two-part picture of the "user-short video" at the modal level. Each source or form of information can be called a modal. People can receive information through sight, hearing, smell and touch. Image, text, voice and other forms of transmission. Short video includes three types of modal information: visual modal information, text modal information, and auditory modal information. The information contained in each modal is represented by a vector with a fixed dimension: as the visual modal information passes through the volume with a video cover picture The output of the product neural network is a 128-dimensional vector as a representation; the text modal information is represented by the video title text after word segmentation and a natural language processing model vectorized output as a 128-dimensional vector; the auditory modal information is represented by background music and The character's speech is truncated and passed through a convolutional neural network and then output as a 128-dimensional vector as a representation. As shown in Figure 5, the vertices are

Different modal types are distinguished, among which

It is a collection of modal types, V is visual modal, T is text modal, and A is auditory modal. Construct a two-part picture of "user-short video" at the modal level

m ∈ {V, T, A}, the short video vertex attribute information in the bipartite graph is the corresponding modal information of the short video, and the distance between the vertices in different modal graphs represents the difference of information between the vertices modalities.

进一步地，如图6所示，根据推荐***中“用户的历史交互行为能体现用户的兴趣偏好”的思想，本发明采用在每个模态的二部图上构造两层结构的GCN(Graph Convolutional Network，图卷积网络)，对顶点进行两级(一阶、二阶邻域聚合)聚合操作(Bi-level Aggregation)；图6为聚合操作不同展示角度的示意图。聚合层的作用在于对目标顶点的邻域信息进行聚合，得到表征目标邻域的向量，每次聚合操作由邻域聚合和非线性处理两部分组成。Further, as shown in Fig. 6, according to the idea of "the user’s historical interactive behavior can reflect the user’s interest preferences" in the recommendation system, the present invention adopts a two-layer structure of GCN (Graph Convolutional Network, a two-level (first-order and second-order neighborhood aggregation) aggregation operation (Bi-level Aggregation) on vertices; Figure 6 is a schematic diagram of different display angles of the aggregation operation. The role of the aggregation layer is to aggregate the neighborhood information of the target vertex to obtain a vector that characterizes the target neighborhood. Each aggregation operation is composed of neighborhood aggregation and nonlinear processing.

其中，邻域聚合：对于模态m下目标顶点v的k阶邻域

通过聚合函数f _agg(·)进行聚合操作： Among them, neighborhood aggregation: for the k-th order neighborhood of the target vertex v under the mode m

The aggregation operation is performed by the aggregation function f _{agg (·):}

其中，

为GCN的层数，顶点u为目标顶点v的k阶邻域

中的顶点，

为顶点u在模态m下处于第

层的表示向量，当

时，其表示为顶点在特定模态下的原始属性特征x _m，v，

为目标顶点v的k阶邻域聚合信息。 in,

Is the number of layers of GCN, vertex u is the k-th order neighborhood of target vertex v

Vertices in,

Is that the vertex u is in the

The representation vector of the layer, when

, It is expressed as the original attribute feature x _m,v of the vertex in a specific mode,

Aggregate information for the k-th order neighborhood of the target vertex v.

其中，非线性处理：由邻域聚合操作得到目标顶点的一阶和二阶邻域信息，通过将目标顶点原始信息与其邻域信息进行拼接，输入到单层神经网络中获取目标顶点的高阶特征：Among them, non-linear processing: obtain the first-order and second-order neighborhood information of the target vertex by the neighborhood aggregation operation, and input the original information of the target vertex and its neighborhood information into a single-layer neural network to obtain the high-order of the target vertex feature:

其中，

为神经网络参数矩阵，

为顶点v在模态m下处于第

层的表示向量，

和

分别为目标顶点v的一阶和二阶邻域表示向量，[·，·]为向量拼接操作，σ(·)＝max(0，·)为ReLU函数，起到对向量进行非线性转换的作用，

为顶点v模态m下在GCN第

层的聚合层输出向量，代表顶点v在模态m下的高阶表示信息。 in,

Is the neural network parameter matrix,

Is that the vertex v is in the first

The representation vector of the layer,

with

Are the first-order and second-order neighborhood representation vectors of the target vertex v respectively, [·,·] is the vector splicing operation, and σ(·)=max(0,·) is the ReLU function, which can perform nonlinear transformation on the vector effect,

Is the vertex v mode m in the GCN

The output vector of the aggregation layer of the layer represents the high-order representation information of the vertex v in the mode m.

由于在“用户-短视频”二部图中，顶点的邻居是无序的，不存在实际意义的先后顺序。因此希望构造出的聚合函数f _agg(·)是具有置换不变性(permutation invariant)的，即聚合函数的输出结果不受输入顶点的邻居顺序的改变而改变，且能有效捕捉邻居顶点信息。本发明通过以下三种方式构造聚合函数： Since in the "user-short video" bipartite graph, the neighbors of the vertices are disordered, and there is no actual order. Therefore, it is hoped that the constructed aggregate function f _agg (·) is permutation invariant, that is, the output result of the aggregate function is not changed by the neighbor order of the input vertices, and can effectively capture the neighbor vertex information. The present invention constructs the aggregate function in the following three ways:

(1)平均聚合:聚合邻居信息最简单直观的方法是选取模态m下目标顶点v的k阶邻域

中的顶点u，并将其处于GCN第

层的表示向量

按元素(element-wise)进行平均操作： (1) Average aggregation: The simplest and most intuitive way to aggregate neighbor information is to select the k-th order neighborhood of the target vertex v under mode m

Vertex u in GCN, and place it in the GCN

Representation vector of the layer

Perform an averaging operation element-wise:

为顶点v在模态m下的k阶邻域表示向量，其中

表示顶点v的k阶邻域邻居数量。

Is the k-th order neighborhood representation vector of vertex v in mode m, where

Indicates the number of neighbors of the k-th order neighborhood of the vertex v.

在目标顶点邻接矩阵中引入自连接，保留目标顶点信息的思想后，对聚合函数进行改造：After introducing self-connection in the target vertex adjacency matrix and retaining the target vertex information, the aggregation function is transformed:

改造之后的聚合函数相当于将目标顶点自身特征融入到邻域特征中，在后续的非线性处理中直接以邻域特征作为单层网络的输入，可以避免由于拼接操作引入的噪声，同时降低计算复杂度。对应的聚合层输出为：After the transformation, the aggregate function is equivalent to integrating the features of the target vertex into the neighborhood features. In the subsequent nonlinear processing, the neighborhood features are directly used as the input of the single-layer network, which can avoid the noise introduced by the splicing operation and reduce the calculation at the same time. the complexity. The corresponding aggregation layer output is:

(2)最大池化聚合:池化操作通常是用于深度神经网络中，对网络层传入信息进行抽取和压缩的作用。本发明在GCN的单层网络结构中引入最大池化的聚合操作：(2) Maximum pooling aggregation: The pooling operation is usually used in deep neural networks to extract and compress the incoming information from the network layer. The present invention introduces the maximum pooling aggregation operation in the single-layer network structure of GCN:

其中，W _pool为池化参数矩阵，b为偏置。 Among them, W _pool is the pooling parameter matrix, and b is the bias.

由于深度神经网络能提取输入信息的高阶特征，信息在网络中传输相当于被编码成多个通道的特征。为了能直观地衡量不同邻居对目标顶点地影响程度，本发明对目标顶点邻居集合的特征按元素进行最大池化操作，在特定特征维度下表现最为显著的邻居顶点对目标顶点该维度下的影响程度最大。相比于平均聚合，最大池化聚合能在特征维度下更有效地区分不同邻居对输出的贡献程度。Since deep neural networks can extract high-level features of input information, the transmission of information in the network is equivalent to being encoded into features of multiple channels. In order to intuitively measure the degree of influence of different neighbors on the target vertex, the present invention performs the maximum pooling operation on the feature of the neighbor set of the target vertex by element, and the most significant neighbor vertex in a specific feature dimension affects the target vertex in that dimension The greatest degree. Compared with the average aggregation, the maximum pooling aggregation can more effectively distinguish the contribution degree of different neighbors to the output in the feature dimension.

(3)注意力机制聚合:为了更加简洁有效地对顶点邻域信息进行聚合，本发明通过逐顶点(node-wise)的方式在图顶点之间引入注意力分数，衡量目标顶点与邻居顶点的相似程度。假设顶点i为顶点v的邻居，两者之间相似度sim _v，i定义为： (3) Attention mechanism aggregation: In order to aggregate vertex neighborhood information more concisely and effectively, the present invention introduces attention scores between graph vertices in a node-wise manner to measure the difference between the target vertex and neighbor vertices. the similarity. Assuming that vertex i is the neighbor of vertex v, the similarity between the two is sim _{v, i is} defined as:

其中，W作为前向神经网络中的参数矩阵，W _v和W _i分别为顶点v和i在前向传播神经网络中对应的参数矩阵，与顶点的表示向量相乘用于扩充顶点的特征维度，函数a(·，·)将拼接后的高维向量特征映射到实数域中，

和

分别为顶点v的一阶邻域和二阶邻域。 Among them, W is the parameter matrix in the forward neural network, W _v and W _i are the corresponding parameter matrices of the vertices v and i in the forward propagation neural network, and are multiplied by the representation vector of the vertex to expand the feature dimension of the vertex , The function a(·,·) maps the spliced high-dimensional vector features to the real number domain,

with

They are the first-order neighborhood and the second-order neighborhood of vertex v.

将顶点v和i之间的相似度sim _v，i作为LeakyReLU函数(激活函数)： The similarity between vertices v and i sim _{v, i is taken} as the LeakyReLU function (activation function):

的输入进行非线性转换，x表示输入项，并将得到的向量(x)输入到softmax公式：The input of is subjected to nonlinear transformation, x represents the input term, and the obtained vector (x) is input into the softmax formula:

中进行归一化，将结果的值约束到区间[0，1]，得到顶点v和i之间的注意力分数α _v，i：

Perform normalization in, constrain the value of the result to the interval [0, 1], and obtain the attention score α _{v, i} between the vertices v and i:

对目标顶点v进行逐邻居聚合：Neighbor-by-neighbor aggregation is performed on the target vertex v:

其中，W与计算相似度公式中的W是相同的。Among them, W is the same as W in the formula for calculating similarity.

为了使聚合结果更加合理(健壮)，本发明将多头注意力机制引入到聚合操作，设置注意力多头个数为P：In order to make the aggregation result more reasonable (robust), the present invention introduces the multi-head attention mechanism into the aggregation operation, and sets the number of multi-head attention to P:

其中，

为在第p个注意力空间中目标顶点v与其k阶邻域中的邻居顶点u之间的注意力分数，

为多头注意力平均操作。 in,

Is the attention score between the target vertex v and its neighbor vertex u in its k-th order neighborhood in the p-th attention space,

Average operation for multi-head attention.

对聚合层的优化：在聚合层中，如果不对目标顶点的邻居数量作出限制，其最坏情况下对应的复杂度为：Optimization of the aggregation layer: In the aggregation layer, if the number of neighbors of the target vertex is not limited, the corresponding complexity in the worst case is:

其中，

为“用户-短视频”二部图中所有顶点的集合，

为所有顶点的数量，

和

分别为顶点v的一阶和二阶邻居数量。当使用注意力聚合时，需要进行P次邻域聚合，所以计算复杂度需要乘以P。由于不同目标顶点对应的邻居数目不一致，无法输入到模型中，为了均衡计算复杂度和准确性，在本发明中根据实践结果，设置目标顶点的一阶邻居取值

二阶邻居取值

多头注意力数量P＝3。对于目标顶点邻居数少于设定值的，通过重复采样补齐数目；邻居数量多于设定值的，如果聚合方法为平均或最大池化，则随机选择设定值数量的邻居，如聚合方法为注意力机制，则优先选择注意力分数较大的邻居顶点。 in,

Is the set of all vertices in the "user-short video" bipartite graph,

Is the number of all vertices,

with

Are the number of first-order and second-order neighbors of vertex v, respectively. When using attention aggregation, P neighborhood aggregation is required, so the computational complexity needs to be multiplied by P. Since the number of neighbors corresponding to different target vertices is inconsistent, it cannot be input into the model. In order to balance the computational complexity and accuracy, in the present invention, the value of the first-order neighbor of the target vertex is set according to the practical results.

Second-order neighbor value

The number of multi-head attention P=3. If the number of neighbors of the target vertex is less than the set value, the number is filled by repeated sampling; if the number of neighbors is more than the set value, if the aggregation method is average or maximum pooling, the set number of neighbors will be randomly selected, such as aggregation The method is the attention mechanism, and the neighbor vertices with larger attention scores are preferentially selected.

进一步地，在聚合层中，顶点自身包含的信息通过GCN在两个层级的邻居顶点之间传播进行高阶交互。然而之前用于推荐的GCN类模型将推荐物品的属性信息和对应图顶点的结构信息作为同质化信息，以整体输入模型中，忽视了物品不同来源信息对表示学习过程的影响。对此，本发明设计整合层对同一模态下不同来源的输入信息进行整合：Further, in the aggregation layer, the information contained in the vertex itself is propagated between neighbor vertices of two levels for high-level interaction through GCN. However, the previous GCN model used for recommendation regards the attribute information of the recommended item and the structural information of the corresponding graph vertices as homogenized information, and is input into the model as a whole, ignoring the influence of the different source information of the item on the representation learning process. In this regard, the design integration layer of the present invention integrates input information from different sources in the same mode:

H _m，v＝f _merge(h _m，v，x _m，v，h _v，id)，

H _{m, v} = f _merge (h _{m, v} , x _{m, v} , h _{v, id} ),

其中，f _merge(·)为整合函数，整合层的输出H _m，v顶点v在模态m下的表示向量，其中

代表是在实数域R，维度为dm)为模态m下顶点v经过聚合层的输出，代表顶点的高阶聚合信息，x _m，v为顶点在模态m包含的原始信息，可视为第零阶信息，h _v，id为在“用户-短视频”二部图通过图嵌入方法得到的顶点v的嵌入向量，可以等效为顶点结构信息的表示向量。整合层在模型中的功能是将特定模态下目标顶点的低阶信息(自身属性信息)和高阶信息(邻域信息)进行整合，本发明通过设计两种整合函数用于顶点信息的整合： Among them, f _merge (·) is the integration function, the output of the integration layer is H _{m, and the v} vertex v represents the vector in the mode m, where

Represents in the real number domain R, the dimension is dm) is the output of the vertex v through the aggregation layer in the mode m, representing the high-order aggregation information _{of the vertex, x m, v} are the original information contained in the vertex in the mode m, which can be regarded as The zeroth order information, h _{v, id} is the embedding vector of the vertex v obtained by the graph embedding method in the "user-short video" bipartite graph, which can be equivalent to the representation vector of the vertex structure information. The function of the integration layer in the model is to integrate the low-level information (own attribute information) and high-level information (neighborhood information) of the target vertex in a specific mode. The present invention uses two integration functions to integrate the vertex information. :

(1)阶层整合：将顶点的原始信息和ID嵌入信息定义为顶点的低阶信息，将两者按元素拼接再通过一层前馈神经网络生成的向量定义为包含顶点结构与内容信息的低阶表示：(1) Hierarchical integration: The original information and ID embedding information of the vertex are defined as the low-level information of the vertex, and the vector generated by the two element-wise splicing and a layer of feedforward neural network is defined as the low-level information containing the vertex structure and content information. Order representation:

h _m，v，low＝LeakyReLU(W _merge[x _m，v，h _id]+b)； h _{m, v, low} = LeakyReLU(W _merge [x _{m, v} , h _id ]+b);

其中，W _merge为整合层单层神经网络的参数矩阵，b为偏置，顶点的低阶表示h _m，v，low与顶点的高阶信息h _m，v进行拼接作为整合层的输出： Among them, W _merge is the parameter matrix of the single-layer neural network of the integration layer, b is the offset, and the low-level representation of the vertex h _{m, v, low} and the high-level information h _{m, v of the} vertex are spliced as the output of the integration layer:

H _m，v＝[h _m，v，low，h _m，v]。 H _{m, v} = [h _{m, v, low} , h _{m, v} ].

(2)外积整合：本发明将顶点在特定模态下信息分为内容信息(content information)和结构信息(structural information)两类，并通过外积的方法对两类信息的向量进行交叉，最后经过一层前馈神经网络输出：(2) Outer product integration: The present invention divides the information of the vertices in a specific mode into two types of content information (content information) and structural information (structural information), and crosses the vectors of the two types of information through the method of outer product. Finally, after a layer of feedforward neural network output:

其中，

为内容信息，

为结构信息，

为整合层学习的参数矩阵，

为偏置。 in,

Is the content information,

Is structural information,

The parameter matrix learned for the integration layer,

For bias.

进一步地，通过整合层对顶点在特定模态下的不同来源数据进行整合，得到用户顶点和短视频顶点在不同模态下的表示向量。将顶点(用户顶点和短视频顶点)的多个模态表示向量进行融合：Further, the integration layer integrates the data from different sources of the vertices in a specific modal to obtain the representation vectors of the user vertices and the short video vertices in different modalities. Fusion of multiple modal representation vectors of vertices (user vertices and short video vertices):

其中，

和

分别代表“用户-短视频”二部图中用户顶点的集合和短视频顶点的集合。对于用户顶点u，其在融合层的输出z _u由处在视觉、文本和听觉，也即V，T，A三种模态下的整合层输出向量H _V，u，H _T，u和H _A，u进行拼接得到；同理对于短视频顶点i，它在融合层的输出z _i由三种模态下的整合层输出向量H _V，i，H _T，i和H _A，i拼接得到。 in,

with

They respectively represent the set of user vertices and the set of short video vertices in the bipartite graph of "user-short video". For the user vertex u, its output z _u _{in the fusion layer is composed of the fusion layer output vectors H V, u} , H _{T, u} and H in the visual, text, and auditory senses, that is, V, T, and A _{A, u} are stitched together; similarly, for the short video vertex i, its output z _i in the fusion layer is stitched by the integration layer output vectors H _{V, i} , HT _{, i} and H _{A, i in the three modes} .

为了进行更细化的用户向量建模，使在“用户-短视频”二部图中相近的顶点的表示更为相似，互相分离的顶点表示更具有区分性。在本发明融合层中使用负采样(negative sampling)的方法进行无监督优化。定义“用户-短视频”二部图中与用户顶点u有直接相连边的短视频顶点i _p为正样本；负样本定义为“用户-短视频”二部图中度数较高，且目标用户顶点没有直连边的短视频顶点i _n。原因是短视频顶点的度数高代表被交互次数多，可视为热门物品，一般认为热门物品而用户没有行为更加代表用户对该物品不感兴趣。经过实验，为了保持正负样本的数量平衡，设置正样本与负样本数量均为Q＝20个，数量比为1:1，负样本从顶点度数数量的前15％中随机选取，并设计损失函数进行优化： In order to perform more detailed user vector modeling, the representations of similar vertices in the "user-short video" bipartite graph are more similar, and the representations of vertices separated from each other are more distinguishable. In the fusion layer of the present invention, a negative sampling method is used for unsupervised optimization. Define the “user-short video” bipartite graph with a short video vertex i _p directly connected to the user vertex u as a positive sample; a negative sample is defined as the “user-short video” bipartite graph with a higher degree and the target user _{A short video vertex i n} with no straight edges connected to the vertex. The reason is that the high degree of the apex of the short video means that it has been interacted more often, and it can be regarded as a popular item. Generally, it is considered that a popular item and the user's no behavior means that the user is not interested in the item. After experiments, in order to maintain the balance of the number of positive and negative samples, set the number of positive and negative samples to Q=20, the number ratio is 1:1, and the negative samples are randomly selected from the top 15% of the number of vertices, and the loss is designed Function optimization:

其中，

为sigmoid函数，

表示与用户u存在交互行为的短视频顶点i _p构成的“用户-短视频”对，

表示短视频顶点i _n未与用户顶点u发生交互行为，被选定为负样本。 in,

Is the sigmoid function,

Represents the "user-short video" pair formed by _{short video vertices i p} that have interactive behavior with user u,

It means that the short video vertex i _{n does} not interact with the user vertex u and is selected as a negative sample.

进一步地，将优化后的用户向量z _u与待推断的短视频向量z _i进行内积，输出得到用户对短视频产生交互行为的概率p(Interact)： Further, the optimized user vector z _u and the short video vector z _i to be inferred are inner producted, and the probability p(Interact) of the user's interaction behavior on the short video is output:

其中，

代表短视频i未被用户u交互过。 in,

It means that the short video i has not been interacted with by the user u.

技术效果：Technical effect:

(1)通过构建模态层级的“用户-短视频”二部图进行顶点的表示学习。由于在多模态数据中模态之间存在“语义鸿沟”的问题，现有的图卷积网络应用于推荐的方法均难以区分不同模态包含信息的差异性，对其分别建模。本发明通过对短视频的不同模态分别构建二部图以及相应的图卷积网络，学习用户和短视频顶点在不同模态下的向量表征，达到对用户进行细粒度个性化推荐的目的。(1) The representation learning of the vertices is carried out by constructing a two-part graph of "user-short video" of the modal hierarchy. Due to the "semantic gap" between the modalities in multimodal data, it is difficult to distinguish the difference of information contained in different modalities with the existing graph convolutional network applied to the recommended methods, and to model them separately. The present invention constructs a bipartite graph and a corresponding graph convolution network for different modalities of short videos, learns vector representations of users and short video vertices in different modalities, and achieves the purpose of fine-grained personalized recommendation for users.

(2)在聚合层中对顶点(用户顶点和短视频顶点)进行两级聚合操作(Bi-level aggregation)量化顶点邻居的影响力，建模顶点的高阶表示。随着GCN层数增加，高阶邻居的信息传递效率会逐步递减，高阶邻居顶点信息在传递过程中容易出现梯度消失的问题，难以作用于目标顶点的表示学习中。由卷积神经网络中使用跳接(skip-connect)增加信息传递通路，抑制梯度消失做法的启发。本发明在图中目标顶点与其二阶邻居之间进行第二层级的聚合操作，增强目标顶点的二阶邻居信息在目标顶点表示学习中的作用，保持高阶邻居信息传递的完整性。(2) Perform a bi-level aggregation operation on the vertices (user vertices and short video vertices) in the aggregation layer to quantify the influence of vertex neighbors, and model the high-level representation of the vertices. As the number of GCN layers increases, the information transmission efficiency of high-order neighbors will gradually decrease. The problem of gradient disappearance in the transmission of high-order neighbor vertex information is prone to be difficult to apply to the representation learning of the target vertex. Inspired by the use of skip-connect in convolutional neural networks to increase the information transmission path and inhibit the disappearance of gradients. The present invention performs a second-level aggregation operation between the target vertex and its second-order neighbor in the graph, enhances the role of the second-order neighbor information of the target vertex in the target vertex representation learning, and maintains the integrity of high-order neighbor information transmission.

(3)在聚合层中引入多头注意力机制的思想构造聚合函数。相比于现有图卷积网络常用的平均聚合(Mean aggregation)和最大池化聚合(Maxpool aggregation)方法，本发明基于注意力机制的方法在聚合的过程中以顶点之间的注意力分数作为度量，考虑顶点特征之间的相关性约束，起到筛选去除无关邻居信息，增强相关邻居对目标顶点影响的作用；引入多头注意力机制，相当于对多个注意力聚合操作进行集成学习(ensemble)，使学习得到的顶点表达向量更加健壮。(3) Introduce the idea of multi-head attention mechanism in the aggregation layer to construct the aggregation function. Compared with the Mean aggregation and Maxpool aggregation methods commonly used in existing graph convolutional networks, the method based on the attention mechanism of the present invention uses the attention scores between vertices as the aggregation process. Metrics, considering the correlation constraints between vertex features, play a role in filtering and removing irrelevant neighbor information, and enhancing the impact of related neighbors on target vertices; the introduction of a multi-head attention mechanism is equivalent to ensemble learning of multiple attention aggregation operations (ensemble). ) To make the learned vertex expression vector more robust.

(4)在整合层对顶点的内容向量和结构向量进行外积操作。在本发明中，将图嵌入方法应用于二部图学***面空间，并通过一层前馈神经网络转化成包含两者信息的一维向量输出H _m，v∈R ^d，达到整合目标顶点不同来源信息的目的。 (4) The outer product operation is performed on the content vector and structure vector of the vertex in the integration layer. In the present invention, the graph embedding method is applied to the topological structure representation of the bipartite graph learning target vertex in the graph as a structure vector; the original attribute vector of the target vertex and the higher-order representation vector through the aggregation layer are spliced into the content of the vertex Vector, the outer product operation of the two is equivalent to the feature dimension expansion from the data point of view, the two one-dimensional representation vectors are mapped to the two-dimensional plane space, and the two are transformed into the information containing both information through a layer of feedforward neural network One-dimensional vector output H _m,v ∈R ^d , to achieve the purpose of integrating different source information of the target vertex.

本发明通过构建模态层级的“用户-短视频”二部图进行顶点的表示学习，其他可替代的变形方案可通过构建模态层级的单一类型顶点图如“用户-用户”、“短视频-短视频”等形式，使用图卷积网络对用户或者短视频顶点进行表示学习。本发明在聚合层中对顶点(用户顶点和短视频顶点)进行两级(一阶和二阶)聚合操作量化顶点邻居的影响力，建模顶点的高阶表示；变形方案可通过对顶点(用户顶点和短视频顶点)的高阶(三阶或以上)聚合进行表示学习。The present invention learns the representation of vertices by constructing a bipartite graph of "user-short video" at the modal level. Other alternative variants can be achieved by constructing a single type of vertex graph at the modal level, such as "user-user" and "short video". -Short video" and other forms, using graph convolutional networks to learn representations of users or short video vertices. The present invention performs two-level (first-order and second-order) aggregation operations on vertices (user vertices and short video vertices) in the aggregation layer to quantify the influence of vertex neighbors, and to model high-order representations of vertices; the deformation scheme can be achieved through the vertex ( High-order (third-order or above) aggregation of user vertices and short video vertices for representation learning.

进一步地，如图7所示，基于上述基于图模型的短视频推荐方法，本发明还相应提供了一种智能终端，所述智能终端包括处理器10、存储器20及显示器30。图7仅示出了智能终端的部分组件，但是应理解的是，并不要求实施所有示出的组件，可以替代的实施更多或者更少的组件。Further, as shown in FIG. 7, based on the foregoing short video recommendation method based on the graph model, the present invention also provides an intelligent terminal correspondingly. The intelligent terminal includes a processor 10, a memory 20 and a display 30. FIG. 7 only shows part of the components of the smart terminal, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead.

所述存储器20在一些实施例中可以是所述智能终端的内部存储单元，例如智能终端的硬盘或内存。所述存储器20在另一些实施例中也可以是所述智能终端的外部存储设备，例如所述智能终端上配备的插接式硬盘，智能存储卡(Smart Media Card,SMC)，安全数字(Secure Digital,SD)卡，闪存卡(Flash Card)等。进一步地，所述存储器20还可以既包括所述智能终端的内部存储单元也包括外部存储设备。所述存储器20用于存储安装于所述智能终端的应用软件及各类数据，例如所述安装智能终端的程序代码等。所述存储器20还可以用于暂时地存储已经输出或者将要输出的数据。在一实施例中，存储器20上存储有基于图模型的短视频推荐程序40，该基于图模型的短视频推荐程序40可被处理器10所执行，从而实现本申请中基于图模型的短视频推荐方法。In some embodiments, the memory 20 may be an internal storage unit of the smart terminal, such as a hard disk or a memory of the smart terminal. In other embodiments, the memory 20 may also be an external storage device of the smart terminal, for example, a plug-in hard disk equipped on the smart terminal, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card, etc. Further, the memory 20 may also include both an internal storage unit of the smart terminal and an external storage device. The memory 20 is used to store application software and various types of data installed on the smart terminal, such as the program code of the installed smart terminal. The memory 20 can also be used to temporarily store data that has been output or will be output. In an embodiment, a short video recommendation program 40 based on a graph model is stored in the memory 20, and the short video recommendation program 40 based on a graph model can be executed by the processor 10, so as to realize the short video based on the graph model in this application. Recommended method.

所述处理器10在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)，微处理器或其他数据处理芯片，用于运行所述存储器20中存储的程序代码或处理数据，例如执行所述基于图模型的短视频推荐方法等。The processor 10 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run the program code or process data stored in the memory 20, for example Perform the short video recommendation method based on the graph model and so on.

所述显示器30在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode，有机发光二极管)触摸器等。所述显示器30用于显示在所述智能终端的信息以及用于显示可视化的用户界面。所述智能终端的部件10-30通过***总线相互通信。In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display 30 is used for displaying information on the smart terminal and for displaying a visualized user interface. The components 10-30 of the smart terminal communicate with each other via a system bus.

在一实施例中，当处理器10执行所述存储器20中基于图模型的短视频推荐程序40时实现以下步骤：In an embodiment, when the processor 10 executes the graph model-based short video recommendation program 40 in the memory 20, the following steps are implemented:

所述交互行为定义为用户完整观看一部短视频或者对所观看的短视频进行点赞操作。The interactive behavior is defined as the user watching a short video in its entirety or performing a thumbs-up operation on the watched short video.

所述根据用户对短视频的交互行为，构造用户和短视频对应关系的二部图，还包括：The constructing a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video also includes:

所述短视频包括视觉模态信息、文本模态信息和听觉模态信息；The short video includes visual modal information, text modal information, and auditory modal information;

所述听觉模态信息以背景音乐和人物讲话声经过截断并经过卷积神经网络后输出为128维的向量作为表征。The auditory modal information is represented by a 128-dimensional vector output after truncated background music and character speech sound through a convolutional neural network.

所述聚合层用于对目标顶点的邻域信息进行聚合，得到表征目标邻域的向量，每次聚合操作由邻域聚合和非线性处理组成。The aggregation layer is used to aggregate the neighborhood information of the target vertex to obtain a vector representing the target neighborhood, and each aggregation operation consists of neighborhood aggregation and nonlinear processing.

所述邻域聚合为：对目标顶点的邻域通过聚合函数进行聚合操作；The neighborhood aggregation is: performing an aggregation operation on the neighborhood of the target vertex through an aggregation function;

所述聚合函数的构造方式包括：平均聚合、最大池化聚合和注意力机制聚合。The construction methods of the aggregation function include: average aggregation, maximum pooling aggregation, and attention mechanism aggregation.

所述整合层用于对同一模态下不同来源的输入信息进行整合，以及将特定模态下目标顶点的低阶信息和高阶信息进行整合，得到用户顶点和短视频顶点在不同模态下的表示向量；The integration layer is used to integrate input information from different sources in the same mode, and to integrate low-level information and high-level information of the target vertex in a specific mode to obtain user vertices and short video vertices in different modalities The representation vector;

本发明还提供一种存储介质，其中，所述存储介质存储有基于图模型的短视频推荐程序，所述基于图模型的短视频推荐程序被处理器执行时实现所述基于图模型的短视频推荐方法的步骤；具体如上所述。The present invention also provides a storage medium, wherein the storage medium stores a short video recommendation program based on a graph model, and the short video recommendation program based on the graph model is executed by a processor to realize the short video based on the graph model The steps of the recommended method; the details are as described above.

综上所述，本发明提供了一种基于图模型的短视频推荐方法、和智能终端和存储介质，所述方法包括：使用数据集训练深度神经网络；将三维点云输入至所述深度神经网络；所述深度神经网络输出所述三维点云的第一部分和第二部分，将所述第一部分作为运动子单元，所述第二部分作为运动单元的参考部分；根据所述三维点云的输出完成网络预测，输出运动信息，所述运动信息包括运动性分割、运动轴和运动类型。本发明实现了在非结构化并且可能是部分扫描的各种铰链式物体在静止状态下同时运动和部件的预测结果，能够十分准确地预测物体部件的运动。In summary, the present invention provides a short video recommendation method based on a graph model, an intelligent terminal, and a storage medium. The method includes: training a deep neural network using a data set; and inputting a three-dimensional point cloud to the deep neural network. The deep neural network outputs the first part and the second part of the three-dimensional point cloud, using the first part as the motion subunit, and the second part as the reference part of the motion unit; according to the three-dimensional point cloud The output completes the network prediction, and outputs the motion information, the motion information includes the motion segmentation, the motion axis, and the motion type. The present invention realizes the prediction result of simultaneous movement and components of various hinged objects that are unstructured and may be partially scanned in a static state, and can predict the movement of the object components very accurately.

当然，本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关硬件(如处理器，控制器等)来完成，所述的程序可存储于一计算机可读取的存储介质中，所述程序在执行时可包括如上述各方法实施例的流程。其中所述的存储介质可为存储器、磁碟、光盘等。Of course, those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware (such as a processor, a controller, etc.) through a computer program, and the program can be stored in a computer program. In a computer-readable storage medium, the program may include the processes of the foregoing method embodiments when executed. The storage medium mentioned may be a memory, a magnetic disk, an optical disk, and the like.

应当理解的是，本发明的应用不限于上述的举例，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that the application of the present invention is not limited to the above examples, and those of ordinary skill in the art can make improvements or changes based on the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims

一种基于图模型的短视频推荐方法，其特征在于，所述基于图模型的短视频推荐方法包括：A short video recommendation method based on a graph model, characterized in that the short video recommendation method based on a graph model includes:

根据用户对短视频的交互行为，构造用户和短视频对应关系的二部图；According to the user's interactive behavior on the short video, construct a bipartite graph of the corresponding relationship between the user and the short video;

聚合层通过聚合目标顶点的邻域信息输出目标顶点自身的高阶表示向量；The aggregation layer outputs the high-order representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex;

整合层将目标节点信息与邻域信息进行整合；The integration layer integrates target node information with neighborhood information;

融合层对目标顶点多个模态信息进行融合；The fusion layer fuses multiple modal information of the target vertex;

输出层计算用户向量与短视频向量之间的相似程度，预测用户对短视频产生交互行为的概率，并为用户进行短视频推荐。The output layer calculates the degree of similarity between the user vector and the short video vector, predicts the probability that the user will interact with the short video, and recommends the short video for the user.
根据权利要求1所述的基于图模型的短视频推荐方法，其特征在于，所述交互行为定义为用户完整观看一部短视频或者对所观看的短视频进行点赞操作。The method for recommending a short video based on a graph model according to claim 1, wherein the interactive behavior is defined as the user watching a short video in full or performing a thumbs-up operation on the watched short video.
根据权利要求1所述的基于图模型的短视频推荐方法，其特征在于，所述根据用户对短视频的交互行为，构造用户和短视频对应关系的二部图，还包括：The method for recommending a short video based on a graph model according to claim 1, wherein the constructing a bipartite graph of the corresponding relationship between the user and the short video according to the user's interaction behavior on the short video further comprises:

构造模态层级的用户与短视频对应关系的二部图。Construct a bipartite graph of the corresponding relationship between users and short videos at the modal level.
根据权利要求3所述的基于图模型的短视频推荐方法，其特征在于，所述短视频包括视觉模态信息、文本模态信息和听觉模态信息；The method for recommending a short video based on a graph model according to claim 3, wherein the short video includes visual modal information, text modal information, and auditory modal information;

所述视觉模态信息以视频封面图片经过卷积神经网络输出为128维的向量作为表征；The visual modal information is represented by a 128-dimensional vector output from a video cover picture through a convolutional neural network;

所述文本模态信息以视频标题文字经过分词和自然语言处理模型向量化输出为128维的向量作为表征；The text modal information is represented by a 128-dimensional vector outputted by word segmentation and natural language processing model vectorization of the video title text;

所述听觉模态信息以背景音乐和人物讲话声经过截断并经过卷积神经网络后输出为128维的向量作为表征。The auditory modal information is represented by a 128-dimensional vector after the background music and speech sounds of characters are truncated and passed through a convolutional neural network.
根据权利要求1所述的基于图模型的短视频推荐方法，其特征在于，所述聚合层用于对目标顶点的邻域信息进行聚合，得到表征目标邻域的向量，每次聚合操作由邻域聚合和非线性处理组成。The method for recommending short videos based on a graph model according to claim 1, wherein the aggregation layer is used to aggregate the neighborhood information of the target vertex to obtain a vector that characterizes the target neighborhood, and each aggregation operation is performed by the neighbors. Domain aggregation and non-linear processing composition.
根据权利要求5所述的基于图模型的短视频推荐方法，其特征在于，所述邻域聚合为：对目标顶点的邻域通过聚合函数进行聚合操作；The method for recommending short videos based on a graph model according to claim 5, wherein the neighborhood aggregation is: performing an aggregation operation on the neighborhood of the target vertex through an aggregation function;

所述非线性处理为：由邻域聚合操作得到目标顶点的一阶和二阶邻域信息，通过将目标顶点原始信息与其邻域信息进行拼接，输入到单层神经网络中获取目标顶点的高阶特征。The non-linear processing is: obtaining the first-order and second-order neighborhood information of the target vertex by the neighborhood aggregation operation, and by splicing the original information of the target vertex with its neighborhood information, and inputting it into a single-layer neural network to obtain the height of the target vertex. Order features.
根据权利要求6所述的基于图模型的短视频推荐方法，其特征在于，所述聚合函数的构造方式包括：平均聚合、最大池化聚合和注意力机制聚合。The method for recommending a short video based on a graph model according to claim 6, characterized in that the construction mode of the aggregation function includes: average aggregation, maximum pooling aggregation, and attention mechanism aggregation.
根据权利要求1所述的基于图模型的短视频推荐方法，其特征在于，所述整合层用于对同一模态下不同来源的输入信息进行整合，以及将特定模态下目标顶点的低阶信息和高阶信息进行整合，得到用户顶点和短视频顶点在不同模态下的表示向量；The method for recommending short videos based on a graph model according to claim 1, wherein the integration layer is used to integrate input information from different sources in the same mode, and to combine the low-level target vertices in a specific mode. Information and high-level information are integrated to obtain the representation vectors of user vertices and short video vertices in different modalities;

所述融合层用于将用户顶点和短视频顶点的多个模态表示向量进行融合。The fusion layer is used to merge multiple modal representation vectors of the user vertex and the short video vertex.
一种智能终端，其特征在于，所述智能终端包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于图模型的短视频推荐程序，所述基于图模型的短视频推荐程序被所述处理器执行时实现如权利要求1-8任一项所述的基于图模型的短视频推荐方法的步骤。An intelligent terminal, characterized in that, the intelligent terminal includes: a memory, a processor, and a short video recommendation program based on a graph model that is stored in the memory and can run on the processor, and the graph model-based short video recommendation program When the short video recommendation program is executed by the processor, the steps of the short video recommendation method based on the graph model according to any one of claims 1-8 are realized.
一种存储介质，其特征在于，所述存储介质存储有基于图模型的短视频推荐程序，所述基于图模型的短视频推荐程序被处理器执行时实现如权利要求1-8任一项所述基于图模型的短视频推荐方法的步骤。A storage medium, wherein the storage medium stores a short video recommendation program based on a graph model, and when the short video recommendation program based on the graph model is executed by a processor, the short video recommendation program is implemented as described in any one of claims 1-8. Describes the steps of the short video recommendation method based on the graph model.