WO2023236469A1

WO2023236469A1 - Video action recognition method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023236469A1
Application number: PCT/CN2022/137036
Authority: WO
Inventors: 乔宇; 马跃; 王亚立; 吴悦; 吕子钰; 陈思然
Original assignee: 深圳先进技术研究院
Priority date: 2022-06-06
Filing date: 2022-12-06
Publication date: 2023-12-14
Also published as: CN115188067A

Abstract

A video action recognition method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring a video to be recognized, and extracting a triple feature, an overall image feature, and a video feature of a target from a video frame of the video to be recognized (S110); inputting the triple feature and the overall image feature into a pre-constructed mid-level semantic recognition model to predict a motion of the target, so as to obtain a predicted motion (S120); determining a predicted action set according to the predicted motion and a pre-constructed visual knowledge graph (S130); and determining an action category according to the video feature, the predicted motion, and the predicted action set (S140). According to the method, a predicted action set is determined by using a pre-constructed visual knowledge graph, is combined with visual information, and is then subjected to action classification, such that the classification effect can be improved.

Description

一种视频行为识别方法、装置、电子设备及存储介质A video behavior recognition method, device, electronic equipment and storage medium

技术领域Technical field

本发明属于人工智能技术领域，特别涉及一种视频行为识别方法、装置、电子设备及存储介质。The invention belongs to the field of artificial intelligence technology, and in particular relates to a video behavior recognition method, device, electronic equipment and storage medium.

背景技术Background technique

动作识别是视频理解的重要问题。随着深度学习的快速发展，这项研究取得了巨大进步。然而，大多数现有方法将动作识别视为高级视频分类问题，并专注于设计用于表示学习的骨干。或者，人类行为实际上是身体部位运动和物体相互作用的时空演变。如果没有这种明确的理解，这些方法通常会遇到识别具有复杂动态的混淆操作的性能瓶颈。Action recognition is an important problem in video understanding. With the rapid development of deep learning, this research has made great progress. However, most existing methods treat action recognition as a high-level video classification problem and focus on designing a backbone for representation learning. Alternatively, human behavior is actually the spatiotemporal evolution of body part movements and object interactions. Without this explicit understanding, these approaches often suffer from performance bottlenecks in identifying obfuscated operations with complex dynamics.

为了缓解上述问题，已经提出了几种方法，通过模仿具有构图知识的人类，对动作进行更详细的理解。然而，它们主要关注手部（例如，something-else）或场景中的整个人类（例如，action genome），而没有深入研究视频中的身体部位运动。最近，PaStaNet尝试通过部分级状态（即运动/动词）注释探索人类活动知识。但它适用于图像领域的人体交互（human object interaction，HOI），而无需在视频中学习人类动态。更重要的是，与HOI类别相比，人类行为指的是更抽象的概念。因此，PastaNet缺乏常识性知识来描述具有歧视性语义的人类行为。To alleviate the above problems, several methods have been proposed to achieve a more detailed understanding of actions by imitating humans with composition knowledge. However, they mainly focus on hands (e.g., something-else) or the whole human in the scene (e.g., action genome), without in-depth study of body part movements in videos. Recently, PaStaNet attempts to explore human activity knowledge through part-level state (i.e., motion/verb) annotations. But it is suitable for human object interaction (HOI) in the image domain without learning human dynamics in videos. More importantly, human behavior refers to more abstract concepts compared to HOI categories. Therefore, PastaNet lacks commonsense knowledge to describe human behavior with discriminatory semantics.

技术问题technical problem

本说明书实施例的目的是提供一种视频行为识别方法、装置、电子设备及存储介质。The purpose of the embodiments of this specification is to provide a video behavior recognition method, device, electronic device and storage medium.

技术解决方案Technical solutions

为解决上述技术问题，本申请实施例通过以下方式实现的：In order to solve the above technical problems, the embodiments of this application are implemented in the following ways:

第一方面，本申请提供一种视频行为识别方法，该方法包括：In the first aspect, this application provides a video behavior recognition method, which method includes:

获取待识别视频，并从待识别视频的视频帧中提取目标的三元组特征、图像整体特征、视频特征；Obtain the video to be recognized, and extract the triplet features, overall image features, and video features of the target from the video frames of the video to be recognized;

将三元组特征及图像整体特征输入预建中层语义识别模型，预测目标的动作，得到预测动作；Input triplet features and overall image features into the pre-built mid-level semantic recognition model to predict the target's action and obtain the predicted action;

根据预测动作及预建视觉知识图谱，确定预测行为集合；Determine the set of predicted behaviors based on predicted actions and pre-built visual knowledge graphs;

根据视频特征、预测动作及预测行为集合，确定行为类别。Determine behavior categories based on video features, predicted actions, and predicted behavior sets.

在其中一个实施例中，预建视觉知识图谱通过以下步骤构建：In one embodiment, the pre-built visual knowledge graph is constructed through the following steps:

获取数据集中的标注信息，标注信息包括身体部件、动作及交互对象；Obtain the annotation information in the data set, which includes body parts, actions and interactive objects;

将身体部件、动作、交互对象进行连接，得到视频级子图；Connect body parts, actions, and interactive objects to obtain video-level subgraphs;

获取动作类别；Get action category;

将所有动作类别与视频级子图中所有标注信息均连接，得到视频级动作子图；Connect all action categories with all annotation information in the video-level subgraph to obtain the video-level action subgraph;

所有视频级动作子图构成预建视觉知识图谱。All video-level action subgraphs constitute a pre-built visual knowledge graph.

在其中一个实施例中，动作类别还包括语义知识。In one embodiment, the action categories also include semantic knowledge.

在其中一个实施例中，根据预测动作及预建视觉知识图谱，确定预测行为集合，包括：In one embodiment, the predicted behavior set is determined based on the predicted actions and the pre-built visual knowledge graph, including:

根据预测动作，确定预测动作对应的预测身体部件及预测交互对象；According to the predicted action, determine the predicted body parts and predicted interaction objects corresponding to the predicted action;

将预测动作、预测身体部件及预测交互对象与预建视觉知识图谱相匹配，得到预测行为集合。Match predicted actions, predicted body parts, and predicted interactive objects with the pre-built visual knowledge graph to obtain a set of predicted behaviors.

在其中一个实施例中，预测行为集合包括至少一个预测行为及每个预测行为对应的分数；In one embodiment, the predicted behavior set includes at least one predicted behavior and a score corresponding to each predicted behavior;

将预测动作、预测身体部件及预测交互对象与预建视觉知识图谱相匹配，得到预测行为集合，包括：Match predicted actions, predicted body parts and predicted interactive objects with the pre-built visual knowledge graph to obtain a set of predicted behaviors, including:

将预测动作、预测身体部件及预测交互对象与预建视觉知识图谱相匹配，确定每个预测行为对应的权重；Match predicted actions, predicted body parts and predicted interactive objects with the pre-built visual knowledge graph to determine the weight corresponding to each predicted behavior;

将所有预测行为对应的权重进行求和并归一化后，得到每个预测行为对应的分数。After summing and normalizing the weights corresponding to all predicted behaviors, the score corresponding to each predicted behavior is obtained.

在其中一个实施例中，根据不确定性加权机制确定每个预测行为对应的权重。In one embodiment, the weight corresponding to each predicted behavior is determined according to an uncertainty weighting mechanism.

在其中一个实施例中，根据视频特征、预测动作及预测行为集合，确定行为类别，包括：In one embodiment, the behavior category is determined based on video features, predicted actions, and predicted behavior sets, including:

将视频特征、预测动作、预测身体部件、预测交互对象及预测行为集合，输入预建高层语义识别模型，得到多模态特征；Input video features, predicted actions, predicted body parts, predicted interactive objects and predicted behaviors into a pre-built high-level semantic recognition model to obtain multi-modal features;

根据待识别视频所有视频帧对应的所有多模态特征，确定增强特征；Determine enhanced features based on all multi-modal features corresponding to all video frames of the video to be recognized;

根据增强特征，确定行为类别。Based on the enhanced characteristics, behavioral categories are determined.

第二方面，本申请提供一种视频行为识别装置，该装置包括：In a second aspect, this application provides a video behavior recognition device, which includes:

提取模块，用于获取待识别视频，并从待识别视频的视频帧中提取目标的三元组特征、图像整体特征、视频特征；The extraction module is used to obtain the video to be recognized and extract the triplet features, overall image features, and video features of the target from the video frames of the video to be recognized;

预测模块，用于将三元组特征及图像整体特征输入预建中层语义识别模型，预测目标的动作，得到预测动作；The prediction module is used to input triplet features and overall image features into the pre-built mid-level semantic recognition model to predict the target's action and obtain the predicted action;

第一确定模块，用于根据预测动作及预建视觉知识图谱，确定预测行为集合；The first determination module is used to determine the predicted behavior set based on the predicted actions and the pre-built visual knowledge graph;

第二确定模块，用于根据视频特征、预测动作及预测行为集合，确定行为类别。The second determination module is used to determine the behavior category based on video features, predicted actions, and predicted behavior sets.

第三方面，本申请提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行程序时实现如第一方面的视频行为识别方法。In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the video behavior recognition method of the first aspect is implemented.

第四方面，本申请提供一种可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如第一方面的视频行为识别方法。In a fourth aspect, the present application provides a readable storage medium on which a computer program is stored. When the program is executed by a processor, the video behavior recognition method of the first aspect is implemented.

有益效果beneficial effects

由以上本说明书实施例提供的技术方案可见，该方案：使用预建视觉知识图谱确定预测行为集合，并与视觉信息进行结合，再进行行为分类，可以提升分类效果。It can be seen from the technical solution provided by the above embodiments of this specification that: using the pre-built visual knowledge graph to determine the predicted behavior set, combining it with the visual information, and then classifying the behavior, the classification effect can be improved.

附图说明Description of the drawings

为了更清楚地说明本说明书实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本说明书中记载的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of this specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some of the embodiments recorded in this specification. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.

图1为本申请提供的视频行为识别方法的流程示意图；Figure 1 is a schematic flow chart of the video behavior recognition method provided by this application;

图2为本申请提供的视频行为识别装置的结构示意图；Figure 2 is a schematic structural diagram of the video behavior recognition device provided by this application;

图3为本申请提供的电子设备的结构示意图。Figure 3 is a schematic structural diagram of an electronic device provided by this application.

本发明的实施方式Embodiments of the invention

为了使本技术领域的人员更好地理解本说明书中的技术方案，下面将结合本说明书实施例中的附图，对本说明书实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本说明书一部分实施例，而不是全部的实施例。基于本说明书中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都应当属于本说明书保护的范围。In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of this specification. Obviously, the described The embodiments are only some of the embodiments of this specification, but not all of the embodiments. Based on the embodiments in this specification, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this specification.

以下描述中，为了说明而不是为了限定，提出了诸如特定***结构、技术之类的具体细节，以便透彻理解本申请实施例。然而，本领域的技术人员应当清楚，在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中，省略对众所周知的***、装置、电路以及方法的详细说明，以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of explanation rather than limitation, specific details such as specific system structures and technologies are provided to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

在不背离本申请的范围或精神的情况下，可对本申请说明书的具体实施方式做多种改进和变化，这对本领域技术人员而言是显而易见的。由本申请的说明书得到的其他实施方式对技术人员而言是显而易见得的。本申请说明书和实施例仅是示例性的。It will be obvious to those skilled in the art that various modifications and changes can be made to the specific embodiments described in the present application without departing from the scope or spirit of the present application. Other embodiments will be apparent to the skilled person from the description of this application. The specification and examples are intended to be illustrative only.

行为识别是指判断一段视频中目标的行为的类别，其中，目标可以为人或其他动物等，本实施例中目标以人为例，因此也可以称为Human Action Recognition（人类动作识别）。相关技术中，一般使用的数据库都先将动作分割好了，一个视频片断中包含一段明确的动作，时间较短且有唯一确定的label。所以也可以看作是输入为视频，输出为动作标签的多分类问题。此外，动作识别数据库中的动作一般都比较明确，周围的干扰也相对较少。有点像图像分析中的Image Classification（图像分类）任务。Behavior recognition refers to determining the behavior category of a target in a video, where the target can be a human or other animal, etc. In this embodiment, the target is a human, so it can also be called Human Action Recognition. In related technologies, commonly used databases first segment actions. A video clip contains a clear action, is short in duration, and has a unique label. Therefore, it can also be regarded as a multi-classification problem where the input is video and the output is action label. In addition, the actions in the action recognition database are generally relatively clear, and there are relatively few surrounding interferences. It's a bit like the Image Classification task in image analysis.

关于本文中所使用的“包含”、“包括”、“具有”、“含有”等等，均为开放性的用语，即意指包含但不限于。The words "includes", "includes", "has", "contains", etc. used in this article are all open terms, which mean including but not limited to.

本申请中的“份”如无特别说明，均按质量份计。"Parts" in this application are all parts by mass unless otherwise specified.

下面结合附图和实施例对本发明进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and examples.

参照图1，其示出了适用于本申请实施例提供的视频行为识别方法的流程示意图。Referring to FIG. 1 , a schematic flowchart applicable to the video behavior recognition method provided by the embodiment of the present application is shown.

如图1所示，视频行为识别方法，可以包括：As shown in Figure 1, video behavior recognition methods can include:

S110、获取待识别视频，并从待识别视频的视频帧中提取目标的三元组特征、图像整体特征及视频特征；S110. Obtain the video to be identified, and extract the triplet features, overall image features and video features of the target from the video frames of the video to be identified;

具体的，待识别视频是指欲进行行为识别的视频，该视频可以为实时获取得到的视频，也可以为存储于存储服务器或存储介质中的视频，对此不作限制。Specifically, the video to be identified refers to the video for which behavior recognition is to be performed. The video can be a video obtained in real time or a video stored in a storage server or storage medium. There is no limit to this.

三元组特征为身体部位、动作（即身体部位的运动）及交互对象（或称为交互物体，为与身体部位接触的物体）。可以理解的，该三元组特征可以基于Kinetics-TPS数据集进行命名定义。由于，本申请行为识别的目标不是尽可能多的涵盖人类行为类别，而是掌握构建人类行为视觉常识知识的基本机制，因此，Kinetics-TPS数据集是对24种常见人类行为类别进行***研究的首选数据集。示例性的，Kinetics TPS为详细的人类动作理解提供了超过15万个部分级注释，包括790万个身体部位的方框注释、790万个部分动作标记（即动词）和0.5万个交互对象。The triplet features are body parts, actions (i.e., the movement of body parts), and interactive objects (or interactive objects, which are objects in contact with body parts). It can be understood that the triple feature can be named and defined based on the Kinetics-TPS data set. Since the goal of behavior recognition in this application is not to cover as many human behavior categories as possible, but to master the basic mechanism of constructing visual common sense knowledge of human behavior, the Kinetics-TPS data set is a systematic study of 24 common human behavior categories. Preferred dataset. For example, Kinetics TPS provides more than 150,000 part-level annotations for detailed human action understanding, including 7.9 million box annotations for body parts, 7.9 million part action tags (i.e., verbs), and 5,000 interactive objects.

图像整体特征为视频帧的全局特征。示例性的，图像整体特征为例如采用ResNet50提取的静态的图像语义特征。The overall characteristics of the image are the global characteristics of the video frame. For example, the overall image features are static image semantic features extracted using ResNet50, for example.

可以理解的，可以采用抽帧工具提取待识别视频的视频帧，提取的所有视频帧可以构成一个视频帧序列。示例性的，可以将待识别视频提取为8帧视频帧，示例性的，抽帧工具可以为FFmpeg。Kinetics-TPS数据集将视频中的每个视频帧中的人进行了详细的标注，即每个视频帧中可以包含N个人***置框，示例性的，N=3。每个人***置框的标注信息包括人体的身体部位、每个身体部位对应的动作及交互对象的位置框标签。示例性的，一个人标注的10个身体部位（part）的位置框标签、身体部位对应的动作（verb）的位置框标签及交互对象（object）的位置框标签。可以理解的，人***置框可以使用ROI pooling（Region of interest pooling，感兴趣区域池化）从视频帧中抠取。It can be understood that a frame extraction tool can be used to extract video frames of the video to be identified, and all the extracted video frames can constitute a video frame sequence. For example, the video to be identified can be extracted into 8 video frames, and for example, the frame extraction tool can be FFmpeg. The Kinetics-TPS data set details the people in each video frame in the video, that is, each video frame can contain N human position boxes, for example, N=3. The annotation information of each human body position box includes the body parts of the human body, the actions corresponding to each body part, and the position box labels of the interactive objects. For example, the position box labels of 10 body parts (parts) marked by a person, the position box labels of the actions (verb) corresponding to the body parts, and the position box labels of the interactive objects (object). It is understandable that the human body position frame can use ROI pooling（Region of interest pooling, region of interest pooling) is extracted from video frames.

可以理解的，可以采用视觉学习器提取视频帧中目标的图像整体特征。视觉学习器可以采用神经网络特征提取器，示例性的，视觉学习器可以采用ResNet（Residual Network，残差网络）、VGG16（Visual Geometry Group）、GoogleNet、SqueezeNet等特征提取器。It can be understood that a visual learner can be used to extract the overall image features of the target in the video frame. The visual learner can use a neural network feature extractor. For example, the visual learner can use ResNet (Residual Network, residual network), VGG16 (Visual Geometry Group), GoogleNet, SqueezeNet and other feature extractors.

视频特征可以通过视频学习器进行提取，视频学习器可以通过自适应聚合其部分级特征，在每一帧上灵活地构建人类级别的表示。其中，视频学习器可以采用视频特征提取器，示例性的，视频特征提取器可以采用TSN（Temporal segment networks，时序分割网络）、TSM（Temporal shift module，时序移位模型）、TEA（Temporal excitation and aggregation，时序激发聚合）等。示例性的，视频特征是用TSN提取的时序性的视频语义特征。Video features can be extracted through video learners, which can flexibly build human-level representations on each frame by adaptively aggregating their part-level features. Among them, the video learner can use a video feature extractor. For example, the video feature extractor can use TSN (Temporal segment networks, temporal segmentation network), TSM (Temporal shift module, timing shift model), TEA (Temporal excitation and aggregation, time series excitation aggregation), etc. For example, the video features are temporal video semantic features extracted using TSN.

S120、将三元组特征和图像整体特征输入预建中层语义识别模型，预测目标的动作。S120. Input triplet features and overall image features into the pre-built mid-level semantic recognition model to predict the target's actions.

具体的，预建中层语义识别模型是预先构建的神经网络模型，用于根据输入的三元组特征和图像整体特征，预测目标对应的动作。示例性的，神经网络模型可以采用自注意力网络模型。Specifically, the pre-built mid-level semantic recognition model is a pre-built neural network model, which is used to predict the action corresponding to the target based on the input triplet features and the overall image features. For example, the neural network model can adopt a self-attention network model.

采用预建中层语义识别模型进行身体部位的中层语义识别时，使用了交互物体特征与视频帧全局特征，将这些特征作为身体部位的上下文信息，融合进身体部位特征，进行行为识别。When using the pre-built mid-level semantic recognition model for mid-level semantic recognition of body parts, interactive object features and video frame global features are used. These features are used as contextual information of body parts and integrated into body part features for behavior recognition.

S130、根据预测动作及预建视觉知识图谱，确定预测行为集合。S130. Determine the predicted behavior set based on the predicted actions and the pre-built visual knowledge graph.

预建视觉知识图谱是预先构建的知识图谱，该知识图谱是一个可视化的知识图谱，总结了所有人类注释的各个子图，以反映视频中人类行为的视觉常识知识。The pre-built visual knowledge graph is a pre-built knowledge graph that is a visual knowledge graph that summarizes the individual subgraphs of all human annotations to reflect the visual common sense knowledge of human behavior in the video.

其中，预建视觉知识图谱通过以下步骤构建：Among them, the pre-built visual knowledge graph is constructed through the following steps:

获取动作类别；Get action category;

可选的，动作类别还包括语义知识。Optionally, the action category also includes semantic knowledge.

具体的，知识图谱本质上是一种叫做语义网络的知识库，即具有有向图结构的一个知识库。即知识图谱是由实体、关系和属性组成的一种数据结构。Specifically, the knowledge graph is essentially a knowledge base called a semantic network, that is, a knowledge base with a directed graph structure. That is, the knowledge graph is a data structure composed of entities, relationships and attributes.

其中，数据集可以采用Kinetics-TPS数据集、Kinetic数据集、Action Genome数据集等。示例性的，采用Kinetics-TPS数据集构建预建视觉知识图谱，即使用Kinetic-TPS数据集的标注文件作为原始的数据。Among them, the data set can use Kinetics-TPS data set, Kinetic data set, Action Genome data set, etc. For example, the Kinetics-TPS data set is used to construct a pre-built visual knowledge graph, that is, the annotation files of the Kinetic-TPS data set are used as the original data.

可以理解的，可以先对数据集中的标注信息（即标注的三元组信息）进行去重。后续的标注信息可以理解为去重后的标注信息。It is understandable that the annotation information in the data set (that is, the annotated triplet information) can be deduplicated first. The subsequent annotation information can be understood as the annotation information after deduplication.

首先从每个带注释的人类实例中构造部分动词宾语的子图，为了简单起见，将身体部位的运动称为动词。在该子图中，节点是指身体部位、动词和交互对象，边指的是它们之间的视觉连接。We first construct a subgraph of partial verb objects from each annotated human instance, calling the movement of a body part a verb for simplicity. In this subgraph, nodes refer to body parts, verbs, and interaction objects, and edges refer to the visual connections between them.

然后通过合并重复路径，为每个视频整合所有带注释的人类实例的子图，即得到视频级子图。为了反映该视频的动作类别，添加了一个额外的动作节点，该节点与子图中的所有节点连接，得到视频级动作子图。最后结合TPS中所有视频级动作子图。这将生成一个包含181个节点和4532条边的可视化知识图谱，其中包括10个身体部位节点、84个动词节点、73个交互对象节点和24个动作节点。Video-level subgraphs are then obtained by integrating the subgraphs of all annotated human instances for each video by merging repeated paths. To reflect the action category of this video, an additional action node is added, which is connected to all nodes in the subgraph, resulting in a video-level action subgraph. Finally, all video-level action sub-images in TPS are combined. This will generate a visual knowledge graph containing 181 nodes and 4532 edges, including 10 body part nodes, 84 verb nodes, 73 interactive object nodes, and 24 action nodes.

可以理解的，动词分为了及物动词vt和不及物动词vi，进而根据他们的词性决定是否与物体有交互，形成part-vt-object 与part-vi两类。对于每个三元组信息，将其中的节点与视频所对应的动作节点连接，最终形成由<part,vt,object,action>、<part,vi ,action>两类边构成的视频级动作子图。It is understandable that verbs are divided into transitive verbs vt and intransitive verbs vi. Their parts of speech determine whether to interact with objects, forming part-vt-object. and part-vi two categories. For each triplet of information, connect the nodes with the action nodes corresponding to the video, and finally form a video-level action substructure composed of two types of edges: <part, vt, object, action> and <part, vi, action>. picture.

还可以通过人类行为的语义描述来进一步扩展视频级动作子图，来提供丰富多样的上下文信息。具体来说，为每个动作节点配备了来自***的文本描述，该描述提供了语义知识，因此，预建视觉知识图谱整合了视觉和语义动作知识，并利用它们的协作能力在视频中有效理解人类动作。Video-level action subgraphs can also be further extended with semantic descriptions of human actions to provide rich and diverse contextual information. Specifically, each action node is equipped with a text description from Wikipedia, which provides semantic knowledge. Therefore, the pre-built visual knowledge graph integrates visual and semantic action knowledge and leverages their collaborative capabilities to be effective in videos. Understanding human actions.

其中，预建视觉知识图谱可以使用RDF（Resource Description Framework, 资源描述框架）、Neo4j 等方法进行存储，在此不做限制。Among them, the pre-built visual knowledge graph can use RDF (Resource Description Framework (Resource Description Framework), Neo4j and other methods are used for storage, and there are no restrictions here.

本实施例中，预建视觉知识的推理过程具有很好的可解释性，并且其中的语义知识清晰。In this embodiment, the reasoning process of the pre-built visual knowledge has good interpretability, and the semantic knowledge therein is clear.

在一个实施例中，根据预测动作及预建视觉知识图谱，确定预测行为集合，可以包括：In one embodiment, determining the predicted behavior set based on the predicted actions and the pre-built visual knowledge graph may include:

其中，预测行为集合包括至少一个预测行为及每个预测行为对应的分数。可以理解的，预测行为即为预测的动作类别。The predicted behavior set includes at least one predicted behavior and a score corresponding to each predicted behavior. It can be understood that the predicted behavior is the predicted action category.

可选的，将预测动作、预测身体部件及预测交互对象与预建视觉知识图谱相匹配，得到预测行为集合，可以包括：Optionally, match the predicted actions, predicted body parts and predicted interactive objects with the pre-built visual knowledge graph to obtain a set of predicted behaviors, which may include:

将预测动作、预测身体部件及预测交互对象与预建视觉知识图谱相匹配，确定每个预测行为对应的权重；Match predicted actions, predicted body parts, and predicted interactive objects with the pre-built visual knowledge graph to determine the weight corresponding to each predicted behavior;

可选的，可以根据不确定性加权机制确定每个预测行为对应的权重。Optionally, the weight corresponding to each predicted behavior can be determined based on the uncertainty weighting mechanism.

具体的，将预测动作、预测身体部件及预测交互对象简称为预测三元组。Specifically, predicted actions, predicted body parts, and predicted interactive objects are referred to as prediction triples.

可以理解的，将预测三元组与预建视觉知识图谱相匹配，可以识别许多可能的动作节点，因为某些身体动作实际上是在不同的动作类别之间共享的。显然，它为人类行为推理带来了不确定性。为了缓解这个问题，可以引入一种简单但有效的不确定性加权机制。如果预测三元组只匹配一个动作类别（即预测行为集合中仅包括一个预测行为），则为此预测三元组分配这个动作类别的权重为100。随着匹配操作数量的增加，则减少每个动作类别的权重，例如，如果预测三元组可以匹配2（或3或4或5）个动作类别，将对应每个动作类别的权重分配为60（或30或10或3）以考虑不确定性。可以理解的，权重设置可以扩大置信度差距并防止有效推理被淹没。如果一个预测三元组匹配超过五个动作类别，将不会考虑这个预测三元组的推理结果，因为它具有很大的不确定性。Understandably, matching prediction triples to pre-built visual knowledge graphs can identify many possible action nodes, since some body actions are actually shared between different action categories. Obviously, it introduces uncertainty into human behavioral reasoning. To alleviate this problem, a simple but effective uncertainty weighting mechanism can be introduced. If a prediction triplet matches only one action category (that is, the set of predicted behaviors includes only one predicted behavior), then this prediction triplet is assigned a weight of 100 for this action category. As the number of matching operations increases, the weight of each action category is reduced. For example, if the prediction triple can match 2 (or 3 or 4 or 5) action categories, assign the weight corresponding to each action category to 60 (or 30 or 10 or 3) to account for uncertainty. Understandably, weight settings can widen the confidence gap and prevent valid inferences from being overwhelmed. If a prediction triple matches more than five action categories, the inference result of this prediction triple will not be considered because it has large uncertainty.

然后将得到的权重进行求和并归一化后，即可得到每个预测行为对应的分数。可以理解的，所有预测行为的分数总和为1。Then, after summing and normalizing the obtained weights, the score corresponding to each predicted behavior can be obtained. Understandably, the sum of the scores for all predicted behaviors is 1.

S140、根据视频特征、预测动作及预测行为集合，确定行为类别，可以包括：S140. Determine the behavior category based on video features, predicted actions and predicted behavior sets, which may include:

具体的，多模态表示基于单模态表示，并对单模态表示的结果进行约束。多模态表示指采用模态共作用语义表示或者模态约束语义表示的方法，对各模态信息进行处理，使得包含相同或相近语义的模态信息也具有相同或相近的表示结果。在本申请中，主要是使用视觉特征与文本特征进行融合。Specifically, the multi-modal representation is based on the single-modal representation and constrains the results of the single-modal representation. Multimodal representation refers to the use of modal co-action semantic representation or modal constraint semantic representation to process each modal information, so that modal information containing the same or similar semantics also has the same or similar representation results. In this application, visual features and text features are mainly used for fusion.

其中，预建高层语义识别模型是预先构建的高层语义识别模型，用于将视频特征和语义动作知识进行融合，即结合人体行为信息与视频中人与人之间的时间和空间的联系，得到多模态特征。其中，高层语义识别模型可以使用图卷积神经网络、树形融合、transformer等结构进行融合。Among them, the pre-built high-level semantic recognition model is a pre-built high-level semantic recognition model that is used to fuse video features and semantic action knowledge, that is, by combining human behavior information with the time and space connections between people in the video, we get Multimodal features. Among them, high-level semantic recognition models can be fused using graph convolutional neural networks, tree fusion, transformer and other structures.

例如使用树形融合，具体可以参照人体自然的组成结构进行人体部件特征的融合，即：先使用全连接层将手和上肢进行融合，形成手部特征，再将腿部特征与脚部特征进行全连接层的融合，形成腿部特征，然后将手部特征和头部特征融合，形成上半身特征，将腿部特征和臀部（或称为腰部）特征融合，形成下半身特征，最后将上半身特征和下半身特征全部融合，形成人体特征，即为高级特征。For example, using tree fusion, you can refer to the natural structure of the human body to fuse the features of human body parts. That is, first use a fully connected layer to fuse the hands and upper limbs to form hand features, and then combine the leg features with the foot features. The fusion of fully connected layers forms leg features, then the hand features and head features are fused to form upper body features, the leg features and hip (or waist) features are fused to form lower body features, and finally the upper body features and The features of the lower body are all integrated to form human body features, which are advanced features.

根据待识别视频所有视频帧对应的所有多模态特征，确定增强特征，可以采用GCN（Graph Convolutional Network，图卷积网络）。即使用GCN对多帧视频帧之间的人体信息进行关系学习，从而使人与人之间的关联更加紧密，帧与帧之间的信息进行增强。According to all multi-modal features corresponding to all video frames of the video to be recognized, the enhanced features can be determined using GCN (Graph Convolutional Network, graph convolutional network). That is, GCN is used to learn the relationship between human body information between multiple video frames, so that the relationship between people is closer and the information between frames is enhanced.

假设每一视频帧之间人与人都是一个度为1的图，使用GCN来对这个图进行卷积计算，然后将多帧视频中人的信息进行交互，从而使得帧与帧之间的信息增强。Assume that the people between each video frame are a graph with degree 1. Use GCN to perform convolution calculation on this graph, and then interact the information of people in multiple frames of video, so that the relationship between frames is Information enhancement.

然后将得到的增强特征，输入分类层，即可得到各个动作类别的分类概率，根据分类概率即可确定行为类别。示例性的，可以将分类概率最高的动作类别确定为行为类别。Then the obtained enhanced features are input into the classification layer to obtain the classification probability of each action category, and the behavior category can be determined based on the classification probability. For example, the action category with the highest classification probability may be determined as the behavior category.

可以理解的，上述所有神经网络中的激活函数均采用非线性激活函数，示例性的，可以采用Relu激活函数。It can be understood that the activation functions in all the above neural networks use nonlinear activation functions. As an example, the Relu activation function can be used.

本申请实施例使用预建视觉知识图谱进行行为的不确定性推理，并与视觉信息进行结合，再进行行为分类，可以提升分类效果。The embodiment of this application uses a pre-built visual knowledge graph to perform uncertainty reasoning on behavior, combines it with visual information, and then performs behavior classification, which can improve the classification effect.

本申请实施例使用预建视觉知识图谱的推理行为，增加了鲁棒性，在小数据集的情况下，实验结果更好。The embodiments of this application use the reasoning behavior of the pre-built visual knowledge graph to increase the robustness and achieve better experimental results in the case of small data sets.

在最近的Kinetics-TPS基准测试中进行了验证，该基准测试包含身体部位解析注释，以详细了解视频中的人类行为。结果表明，本申请提供的视频行为识别方法，在监督和少量拍摄设置中，可以在各种流行的2D，3D和transformer骨干上实现超过3-5%的精度提高，用于人类动作识别。此外，在相同的设置下，本申请提供的视频行为识别方法，明显优于最近的基于知识的动作识别框架，例如，本申请提供的视频行为识别方法在Kinetics-TPS上实现了83.9%的准确率，而PaStaNet实现了63.8%的准确率。Validated on the recent Kinetics-TPS benchmark, which incorporates body part parsing annotations to provide detailed understanding of human behavior in videos. The results show that the video action recognition method provided in this application can achieve more than 3-5% accuracy improvement on a variety of popular 2D, 3D and transformer backbones for human action recognition in both supervised and low-volume shooting settings. In addition, under the same settings, the video behavior recognition method provided by this application is significantly better than the recent knowledge-based action recognition framework. For example, the video behavior recognition method provided by this application achieved 83.9% accuracy on Kinetics-TPS. rate, while PaStaNet achieved an accuracy of 63.8%.

本申请实施例提供的视频行为识别方法，可以用在包括裁判辅助***、搜索引擎视频审核、以及安防等多个不同的领域。The video behavior recognition method provided by the embodiment of this application can be used in many different fields including referee assistance systems, search engine video review, and security.

在进行裁判辅助时，可以对动作的具体行为以及犯规动作进行有效及时的捕捉，从而提高裁判的检测准确性。When assisting referees, specific actions and foul actions can be effectively and timely captured, thereby improving the referee's detection accuracy.

在搜索引擎的视频审核时，由于网络上存在着各种各样的图片和视频，包括一些违法淫秽色***等。可以使用本申请实施例提供的视频行为识别方法对相关禁止的视频进行搜索，从而有效的避免人为直接对视频进行暴力搜索与过滤，从而提升效率，能够净化网络环境。When reviewing videos on search engines, there are a variety of pictures and videos on the Internet, including some illegal and pornographic videos. The video behavior recognition method provided by the embodiments of this application can be used to search for related prohibited videos, thereby effectively avoiding manual violent search and filtering of videos directly, thereby improving efficiency and purifying the network environment.

在安防领域，使用本申请实施例提供的视频行为识别方法的计算量更小，占用服务器资源较少，并且可以实时监测视频相关的问题和内容，省去人为直接观察摄像视频。对于部分危险行为等，可以直接进行识别并且报警，从而将损失降到最小。In the field of security, the video behavior recognition method provided by the embodiments of the present application requires less calculation, occupies less server resources, and can monitor video-related issues and content in real time, eliminating the need for humans to directly observe camera videos. For some dangerous behaviors, etc., it can be directly identified and alarmed, thereby minimizing losses.

参照图2，其示出了根据本申请一个实施例描述的视频行为识别装置的结构示意图。Referring to Figure 2, which shows a schematic structural diagram of a video behavior recognition device described according to an embodiment of the present application.

如图2所示，视频行为识别装置，可以包括：As shown in Figure 2, the video behavior recognition device may include:

提取模块210，用于获取待识别视频，并从待识别视频的视频帧中提取目标的三元组特征、图像整体特征、视频特征；The extraction module 210 is used to obtain the video to be recognized, and extract the triplet features, overall image features, and video features of the target from the video frames of the video to be recognized;

预测模块220，用于将三元组特征及图像整体特征输入预建中层语义识别模型，预测目标的动作，得到预测动作；The prediction module 220 is used to input the triplet features and the overall image features into the pre-built mid-level semantic recognition model, predict the target's action, and obtain the predicted action;

第一确定模块230，用于根据预测动作及预建视觉知识图谱，确定预测行为集合；The first determination module 230 is used to determine the predicted behavior set based on the predicted actions and the pre-built visual knowledge graph;

第二确定模块240，用于根据视频特征、预测动作及预测行为集合，确定行为类别。The second determination module 240 is used to determine the behavior category based on video features, predicted actions, and predicted behavior sets.

可选的，预建视觉知识图谱通过以下步骤构建：Optionally, a pre-built visual knowledge graph is constructed through the following steps:

获取动作类别；Get action category;

可选的，第一确定模块230还用于：Optionally, the first determination module 230 is also used to:

可选的，预测行为集合包括至少一个预测行为及每个预测行为对应的分数；第一确定模块230还用于：Optionally, the predicted behavior set includes at least one predicted behavior and a score corresponding to each predicted behavior; the first determination module 230 is also used to:

可选的，根据不确定性加权机制确定每个预测行为对应的权重。Optionally, determine the weight corresponding to each predicted behavior based on the uncertainty weighting mechanism.

可选的，第二确定模块240还用于：Optionally, the second determination module 240 is also used to:

本实施例提供的一种视频行为识别装置，可以执行上述方法的实施例，其实现原理和技术效果类似，在此不再赘述。This embodiment provides a video behavior recognition device that can execute the embodiments of the above method. Its implementation principles and technical effects are similar and will not be described again here.

图3为本发明实施例提供的一种电子设备的结构示意图。如图3所示，示出了适于用来实现本申请实施例的电子设备300的结构示意图。Figure 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 3 , a schematic structural diagram of an electronic device 300 suitable for implementing embodiments of the present application is shown.

如图3所示，电子设备300包括中央处理单元（CPU）301，其可以根据存储在只读存储器（ROM）302中的程序或者从存储部分308加载到随机访问存储器（RAM）303中的程序而执行各种适当的动作和处理。在RAM 303中，还存储有设备300操作所需的各种程序和数据。CPU 301、ROM 302以及RAM 303通过总线304彼此相连。输入/输出（I/O）接口305也连接至总线304。As shown in FIG. 3 , the electronic device 300 includes a central processing unit (CPU) 301 that can operate according to a program stored in a read-only memory (ROM) 302 or loaded from a storage portion 308 into a random access memory (RAM) 303 And perform various appropriate actions and processing. In the RAM 303, various programs and data required for the operation of the device 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304. An input/output (I/O) interface 305 is also connected to bus 304 .

以下部件连接至I/O接口305：包括键盘、鼠标等的输入部分306；包括诸如阴极射线管（CRT）、液晶显示器（LCD）等以及扬声器等的输出部分307；包括硬盘等的存储部分308；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分309。通信部分309经由诸如因特网的网络执行通信处理。驱动器310也根据需要连接至I/O接口305。可拆卸介质311，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器310上，以便于从其上读出的计算机程序根据需要被安装入存储部分308。The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, etc.; an output section 307 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., speakers, etc.; and a storage section 308 including a hard disk, etc. ; and a communication section 309 including a network interface card such as a LAN card, a modem, etc. The communication section 309 performs communication processing via a network such as the Internet. Driver 310 is also connected to I/O interface 305 as needed. Removable media 311, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on the drive 310 as needed, so that a computer program read therefrom is installed into the storage portion 308 as needed.

特别地，根据本公开的实施例，上文参考图1描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括有形地包含在机器可读介质上的计算机程序，计算机程序包含用于执行上述视频行为识别方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分309从网络上被下载和安装，和/或从可拆卸介质311被安装。In particular, according to embodiments of the present disclosure, the process described above with reference to FIG. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product, which includes a computer program tangibly embodied on a machine-readable medium, and the computer program includes program code for executing the above video behavior recognition method. In such embodiments, the computer program may be downloaded and installed from the network via communication portion 309 and/or installed from removable media 311 .

附图中的流程图和框图，图示了按照本发明各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，前述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的***来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more components that implement the specified logical function(s). executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

描述于本申请实施例中所涉及到的单元或模块可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中。这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定。The units or modules described in the embodiments of this application can be implemented in software or hardware. The described units or modules may also be provided in the processor. The names of these units or modules do not constitute a limitation on the unit or module itself under certain circumstances.

上述实施例阐明的***、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的，计算机例如可以为个人计算机、笔记本电脑、行动电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or any of these devices. combination of equipment.

作为另一方面，本申请还提供了一种存储介质，该存储介质可以是上述实施例中前述装置中所包含的存储介质；也可以是单独存在，未装配入设备中的存储介质。存储介质存储有一个或者一个以上程序，前述程序被一个或者一个以上的处理器用来执行描述于本申请的视频行为识别方法。As another aspect, this application also provides a storage medium. The storage medium may be the storage medium included in the foregoing device in the above embodiment; it may also be a storage medium that exists independently and is not assembled into the device. The storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the video behavior recognition method described in this application.

存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Storage media includes permanent and non-permanent, removable and non-removable media and may be implemented by any method or technology to store information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should be noted that the terms "comprises," "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, good, or apparatus that includes a list of elements not only includes those elements, but also includes none. Other elements expressly listed, or elements inherent to the process, method, article or equipment. Without further limitation, an element qualified by the statement "comprises a..." does not exclude the presence of additional identical elements in the process, method, good, or device that includes the element.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于***实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

Claims

一种视频行为识别方法，其特征在于，所述方法包括：A video behavior recognition method, characterized in that the method includes:

获取待识别视频，并从所述待识别视频的视频帧中提取目标的三元组特征、图像整体特征、视频特征；Obtain the video to be identified, and extract the triplet features, overall image features, and video features of the target from the video frames of the video to be identified;

将所述三元组特征及所述图像整体特征输入预建中层语义识别模型，预测所述目标的动作，得到预测动作；Input the triplet features and the overall image features into a pre-built mid-level semantic recognition model, predict the action of the target, and obtain the predicted action;

根据所述预测动作及预建视觉知识图谱，确定预测行为集合；Determine a set of predicted behaviors based on the predicted actions and the pre-built visual knowledge graph;

根据所述视频特征、所述预测动作及所述预测行为集合，确定行为类别。Based on the video features, the predicted actions, and the predicted behavior set, a behavior category is determined.
根据权利要求1所述的方法，其特征在于，所述预建视觉知识图谱通过以下步骤构建：The method according to claim 1, characterized in that the pre-built visual knowledge graph is constructed through the following steps:

获取数据集中的标注信息，所述标注信息包括身体部件、动作及交互对象；Obtain annotation information in the data set, which includes body parts, actions and interactive objects;

将所述身体部件、所述动作、所述交互对象进行连接，得到视频级子图；Connect the body parts, the actions, and the interactive objects to obtain video-level sub-pictures;

获取动作类别；Get action category;

将所有所述动作类别与所述视频级子图中所有所述标注信息均连接，得到视频级动作子图；Connect all the action categories with all the annotation information in the video-level sub-image to obtain the video-level action sub-image;

所有所述视频级动作子图构成所述预建视觉知识图谱。All the video-level action sub-graphs constitute the pre-built visual knowledge graph.
根据权利要求2所述的方法，其特征在于，所述动作类别还包括语义知识。The method of claim 2, wherein the action category further includes semantic knowledge.
根据权利要求1-3任一项所述的方法，其特征在于，所述根据所述预测动作及预建视觉知识图谱，确定预测行为集合，包括：The method according to any one of claims 1 to 3, characterized in that determining the set of predicted behaviors based on the predicted actions and the pre-built visual knowledge graph includes:

根据所述预测动作，确定所述预测动作对应的预测身体部件及预测交互对象；According to the predicted action, determine the predicted body parts and predicted interaction objects corresponding to the predicted action;

将所述预测动作、所述预测身体部件及所述预测交互对象与所述预建视觉知识图谱相匹配，得到所述预测行为集合。The predicted actions, the predicted body parts and the predicted interactive objects are matched with the pre-built visual knowledge graph to obtain the predicted behavior set.
根据权利要求4所述的方法，其特征在于，所述预测行为集合包括至少一个预测行为及每个所述预测行为对应的分数；The method according to claim 4, wherein the predicted behavior set includes at least one predicted behavior and a score corresponding to each predicted behavior;

所述将所述预测动作、所述预测身体部件及所述预测交互对象与所述预建视觉知识图谱相匹配，得到所述预测行为集合，包括：Matching the predicted actions, the predicted body parts and the predicted interactive objects with the pre-built visual knowledge graph to obtain the predicted behavior set includes:

将所述预测动作、所述预测身体部件及所述预测交互对象与所述预建视觉知识图谱相匹配，确定每个所述预测行为对应的权重；Match the predicted actions, the predicted body parts and the predicted interactive objects with the pre-built visual knowledge graph, and determine the weight corresponding to each predicted behavior;

将所有所述预测行为对应的权重进行求和并归一化后，得到每个所述预测行为对应的分数。After summing and normalizing the weights corresponding to all the predicted behaviors, a score corresponding to each predicted behavior is obtained.
根据权利要求5所述的方法，其特征在于，根据不确定性加权机制确定每个所述预测行为对应的权重。The method according to claim 5, characterized in that the weight corresponding to each predicted behavior is determined according to an uncertainty weighting mechanism.
根据权利要求1-3任一项所述的方法，其特征在于，所述根据所述视频特征、所述预测动作及所述预测行为集合，确定行为类别，包括：The method according to any one of claims 1 to 3, characterized in that determining the behavior category based on the video characteristics, the predicted action and the predicted behavior set includes:

根据所述预测动作，确定所述预测动作对应的预测身体部件及预测交互对象；According to the predicted action, determine the predicted body parts and predicted interaction objects corresponding to the predicted action;

将所述视频特征、所述预测动作、所述预测身体部件、所述预测交互对象及所述预测行为集合，输入预建高层语义识别模型，得到多模态特征；Input the video features, the predicted actions, the predicted body parts, the predicted interaction objects and the predicted behavior set into a pre-built high-level semantic recognition model to obtain multi-modal features;

根据所述待识别视频所有视频帧对应的所有所述多模态特征，确定增强特征；Determine enhanced features based on all the multi-modal features corresponding to all video frames of the video to be recognized;

根据所述增强特征，确定所述行为类别。Based on the enhanced characteristics, the behavior category is determined.
一种视频行为识别装置，其特征在于，所述装置包括： A video behavior recognition device, characterized in that the device includes:

提取模块，用于获取待识别视频，并从所述待识别视频的视频帧中提取目标的三元组特征、图像整体特征、视频特征；An extraction module, used to obtain the video to be identified, and extract the triplet features, overall image features, and video features of the target from the video frames of the video to be identified;

预测模块，用于将所述三元组特征及所述图像整体特征输入预建中层语义识别模型，预测所述目标的动作，得到预测动作；A prediction module, used to input the triplet features and the overall image features into a pre-built mid-level semantic recognition model, predict the action of the target, and obtain the predicted action;

第一确定模块，用于根据所述预测动作及预建视觉知识图谱，确定预测行为集合；The first determination module is used to determine the set of predicted behaviors based on the predicted actions and the pre-built visual knowledge graph;

第二确定模块，用于根据所述视频特征、所述预测动作及所述预测行为集合，确定行为类别。The second determination module is used to determine the behavior category according to the video characteristics, the predicted action and the predicted behavior set.
一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，其特征在于，所述处理器执行所述程序时实现如权利要求1-7中任一所述的视频行为识别方法。 An electronic device, including a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the program, it implements any one of claims 1-7. The video behavior recognition method described above.
一种可读存储介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现如权利要求1-7中任一所述的视频行为识别方法。A readable storage medium with a computer program stored thereon, characterized in that when the program is executed by a processor, the video behavior recognition method as described in any one of claims 1-7 is implemented.