WO2021088510A1

WO2021088510A1 - Video classification method and apparatus, computer, and readable storage medium

Info

Publication number: WO2021088510A1
Application number: PCT/CN2020/114389
Authority: WO
Inventors: 王瑞琛; 王晓利
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2019-11-05
Filing date: 2020-09-10
Publication date: 2021-05-14
Also published as: CN110837579B; CN110837579A

Abstract

A video classification method, comprising: acquiring a key frame image from a target video; inputting the key frame image into an image search engine to obtain descriptive information of the key frame image, and determining keyword groups of the key frame image according to the descriptive information; acquiring and determining, from a plurality of preset text type features, preset text type features corresponding to the plurality of keyword groups; and determining a video type label of the target video according to the text type features. By means of the method, on the basis of a key frame image among a plurality of frame images constituting a target video, text type features of the target video can be determined according to corresponding descriptive information in an image search engine, so as to obtain a video type label of the target video, thereby improving the video classification efficiency.

Description

视频分类方法、装置、计算机以及可读存储介质Video classification method, device, computer and readable storage medium

本申请要求于2019年11月5日提交中国专利局、申请号为201911071940.9、发明名称为“视频分类方法、装置、计算机以及可读存储介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911071940.9, and the invention title is "Video Classification Method, Device, Computer, and Readable Storage Medium" on November 5, 2019, the entire content of which is incorporated by reference Incorporated in this application.

技术领域Technical field

本申请涉及计算技术领域，尤其涉及一种视频分类方法、装置、计算机以及可读存储介质。This application relates to the field of computing technology, and in particular to a video classification method, device, computer, and readable storage medium.

发明背景Background of the invention

随着视频种类的日益丰富及视频数量的日益增多，人们可以进行观看的视频及所使用的视频播放应用也越来越多样化。每个人所喜欢的视频类型各不相同，如果在大量的视频中去查找自己想看的视频，会耗费很多时间，甚至可能会失去观看视频的兴趣。因此，对于视频播放应用来说，会将该视频播放应用中的大量视频进行分类，使得人们查找感兴趣的视频时更加简单方便，也可以根据每个用户的喜好进行视频推送。但是，由于视频的日益增多，通过人工进行视频分类会浪费大量的时间和精力，故而高效快捷地进行视频分类就显得十分重要。With the increasing variety of videos and the increasing number of videos, the videos that people can watch and the video playback applications they use are becoming more and more diversified. Everyone likes different types of videos. If you look for a video you want to watch from a large number of videos, it will take a lot of time, and you may even lose your interest in watching the video. Therefore, for a video playback application, a large number of videos in the video playback application will be classified, making it easier and more convenient for people to find videos of interest, and it can also push videos according to the preferences of each user. However, due to the increasing number of videos, manual video classification will waste a lot of time and energy, so it is very important to efficiently and quickly classify videos.

发明内容Summary of the invention

本申请实施例提供了一种视频分类方法和装置，可以提高视频分类的效率。The embodiments of the present application provide a video classification method and device, which can improve the efficiency of video classification.

本申请实施例第一方面提供了一种视频分类方法，包括：The first aspect of the embodiments of the present application provides a video classification method, including:

从目标视频中获取关键帧图像；Obtain key frame images from the target video;

将所述关键帧图像输入图像搜索引擎，得到所述关键帧图像的描述信息，根据所述描述信息确定所述关键帧图像的关键词组；Input the key frame image into an image search engine to obtain description information of the key frame image, and determine the keyword group of the key frame image according to the description information;

获取所述关键词组对应的文本内容特征；Acquiring the text content feature corresponding to the keyword group;

根据所述文本内容特征确定所述目标视频的视频类型标签。The video type tag of the target video is determined according to the characteristics of the text content.

本申请实施例第二方面提供了一种视频分类装置，所述装置包括：A second aspect of the embodiments of the present application provides a video classification device, the device including:

第一获取模块，用于从目标视频中获取关键帧图像；The first acquisition module is used to acquire key frame images from the target video;

第一确定模块，用于将所述关键帧图像输入图像搜索引擎，得到所述关键帧图像的描述信息，根据所述描述信息确定所述关键帧图像的关键词组；The first determining module is configured to input the key frame image into an image search engine to obtain description information of the key frame image, and determine the keyword group of the key frame image according to the description information;

第二获取模块，用于获取所述关键词组对应的文本内容特征；The second obtaining module is used to obtain the text content feature corresponding to the keyword group;

第二确定模块，用于根据所述文本内容特征确定所述目标视频的视频类型标签。The second determining module is configured to determine the video type tag of the target video according to the text content feature.

本申请实施例第三方面提供了一种计算机，包括处理器、存储器、输入输出接口；The third aspect of the embodiments of the present application provides a computer, including a processor, a memory, and an input and output interface;

所述处理器分别与所述存储器和所述输入输出接口相连，其中，所述输入输出接口用于输入数据和输出数据，所述存储器用于存储程序代码，所述处理器用于调用所述程序代码，以执行如本申请实施例第一方面中所述的视频分类方法。The processor is respectively connected to the memory and the input and output interface, wherein the input and output interface is used to input data and output data, the memory is used to store program code, and the processor is used to call the program Code to execute the video classification method described in the first aspect of the embodiments of the present application.

本申请实施例第四方面提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时，执行如本申请实施例第一方面中所述的视频分类方法。The fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the program instructions, when executed by a processor, execute as The video classification method described in the first aspect of the embodiments of the present application.

本申请实施例通过从目标视频中获取关键帧图像，将该关键帧图像输入图像搜索引擎，得到该关键帧图像的描述信息，根据该描述信息确定关键帧图像的关键词组，获取该关键词组对应的文本内容特征，根据该文本内容特征确定目标视频的视频类型标签。本申请基于目标视频的文本信息对该目标视频进行分类，通过实现视频分类过程的自动化执行，避免了人工进行视频分类会耗费大量时间的情况，从而提高了视频分类的效率。In this embodiment of the application, a key frame image is obtained from a target video, and the key frame image is input to an image search engine to obtain the description information of the key frame image, and the keyword group of the key frame image is determined according to the description information, and the corresponding keyword group is obtained According to the text content feature, the video type tag of the target video is determined according to the text content feature. This application classifies the target video based on the text information of the target video, and realizes the automatic execution of the video classification process, avoiding the situation that manual video classification will consume a lot of time, thereby improving the efficiency of video classification.

附图简要说明Brief description of the drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

图1是本申请实施例提供的一种视频分类架构图；Figure 1 is a video classification architecture diagram provided by an embodiment of the present application;

图2是本申请实施例提供的一种基于文本进行视频分类的场景示意图；Fig. 2 is a schematic diagram of a scene for text-based video classification provided by an embodiment of the present application;

图3是本申请实施例提供的一种视频分类方法流程图；FIG. 3 is a flowchart of a video classification method provided by an embodiment of the present application;

图4是本申请实施例提供的一种关键词组获取方式场景示意图；FIG. 4 is a schematic diagram of a scenario of a keyword group obtaining method provided by an embodiment of the present application;

图5是本申请实施例提供的一种文本内容特征确定场景示意图；FIG. 5 is a schematic diagram of a scenario for determining features of text content provided by an embodiment of the present application;

图6是本申请实施例提供的一种视频分类具体流程示意图；FIG. 6 is a schematic diagram of a specific process of video classification provided by an embodiment of the present application;

图7a是本申请实施例提供的一种关键帧图像获取场景示意图；Fig. 7a is a schematic diagram of a key frame image acquisition scene provided by an embodiment of the present application;

图7b是本申请实施例提供的一种关联矩阵示意图；FIG. 7b is a schematic diagram of an association matrix provided by an embodiment of the present application;

图8是本申请实施例提供的一种文本内容特征获取架构图；FIG. 8 is a structure diagram of text content feature acquisition provided by an embodiment of the present application;

图9是本申请实施例提供的一种视频内容特征确定示意图；FIG. 9 is a schematic diagram of determining characteristics of video content according to an embodiment of the present application;

图10是本申请实施例提供的一种视频类型标签确定过程示意图；FIG. 10 is a schematic diagram of a process for determining a video type tag provided by an embodiment of the present application;

图11是本申请实施例提供的一种视频分类装置示意图；FIG. 11 is a schematic diagram of a video classification device provided by an embodiment of the present application;

图12是本申请实施例提供的一种计算机的结构示意图。FIG. 12 is a schematic structural diagram of a computer provided by an embodiment of the present application.

实施本发明的方式Ways to implement the invention

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

一些实施例中，图1是本申请实施例提供的一种视频分类架构图。如图1所示，本申请实施例可以包括用户终端101、分类服务器102及多个接收服务器103和多个目标终端104。各实施例的方法可以由用户终端101、分类服务器102及多个接收服务器103和多个目标终端104中的一个或多个计算设备实现。当由分类服务器102实现时的实施例如下所述。当用户终端101监测到有需要进行视频分类的目标视频后，将该目标视频发送给分类服务器102。分类服务器102接收到目标视频后，将该目标视频进行拆分，得到组成该目标视频的多个帧图像。分类服务器102从多个帧图像中获取关键帧图像，将该关键帧图像作为该目标视频的代表性组成图像。分类服务器102将该关键帧图像输入图像搜索引擎，得到该关键帧图像的描述信息。该描述信息由多个词组组成。分类服务器102根据该描述信息确定关键帧图像的关键词组，并获取该关键词组对应的文本内容特征，根据文本内容特征确定目标视频的视频类型标签。一些实施例中，分类服务器102在确定了目标视频的视频类型标签后，可以基于该视频类型标签将该目标视频发送到接收服务器103或目标终端104，以使接收服务器103可以基于视频类型标签将该目标视频加入到对应分类中。目标终端104为标记了视频类型标签的终端，可以认为该目标终端的使用用户对该视频类型标签相关的视频感兴趣。通过在确定目标视频的视频类型标签后，将该目标视频进行针对性推送，可以提高对视频管理的智能性。其中，用户终端101及目标终端104均可以是一种电子设备，包括但不限于手机、平板电脑、台式电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device，MID)、可穿戴设备(例如智能手表、智能手环等)等，接收服务器103可以是视频播放应用对应的服务器等。In some embodiments, FIG. 1 is a video classification architecture diagram provided by an embodiment of the present application. As shown in FIG. 1, the embodiment of the present application may include a user terminal 101, a classification server 102, multiple receiving servers 103, and multiple target terminals 104. The method of each embodiment may be implemented by one or more computing devices in the user terminal 101, the classification server 102, the multiple receiving servers 103, and the multiple target terminals 104. The example of implementation when implemented by the classification server 102 is as follows. When the user terminal 101 detects that there is a target video that needs to be classified, the target video is sent to the classification server 102. After receiving the target video, the classification server 102 splits the target video to obtain multiple frame images that make up the target video. The classification server 102 obtains a key frame image from a plurality of frame images, and uses the key frame image as a representative component image of the target video. The classification server 102 inputs the key frame image into an image search engine to obtain the description information of the key frame image. The description information consists of multiple phrases. The classification server 102 determines the keyword group of the key frame image according to the description information, obtains the text content feature corresponding to the keyword group, and determines the video type tag of the target video according to the text content feature. In some embodiments, after the classification server 102 determines the video type tag of the target video, it can send the target video to the receiving server 103 or the target terminal 104 based on the video type tag, so that the receiving server 103 can send the target video to the receiving server 103 or the target terminal 104 based on the video type tag. The target video is added to the corresponding category. The target terminal 104 is a terminal marked with a video type tag, and it can be considered that a user of the target terminal is interested in a video related to the video type tag. After the video type tag of the target video is determined, the target video can be targeted to push the target video, which can improve the intelligence of video management. Among them, both the user terminal 101 and the target terminal 104 may be an electronic device, including but not limited to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palmtop computer, a mobile internet device (MID), a wearable device ( For example, smart watches, smart bracelets, etc.). The receiving server 103 may be a server corresponding to a video playback application.

本申请通过目标视频的相关文字信息提取该目标视频的特征，基于该目标视频的特征确定目标视频的分类，以增加可以用于视频分类的训练数据量。通过图像搜索引擎得到目标视频的相关文字信息，避免了在难以获取到较多的目标视频的文字描述时，可能导致的训练数据量不足的情况出现。在获得足够的用于视频分类的训练数据量的情况下，基于目标视频的相关文字信息对该目标视频进行分类，提高了视频分类的准确性和有效性。The present application extracts the characteristics of the target video from the relevant text information of the target video, and determines the classification of the target video based on the characteristics of the target video, so as to increase the amount of training data that can be used for video classification. Obtaining the relevant text information of the target video through the image search engine avoids the situation that the amount of training data may be insufficient when it is difficult to obtain more text descriptions of the target video. In the case of obtaining a sufficient amount of training data for video classification, the target video is classified based on the relevant text information of the target video, which improves the accuracy and effectiveness of the video classification.

图2是本申请实施例提供的一种基于文本进行视频分类的场景示意图。如图2所示，目标视频201是通过多个帧图像依次播放得到的，即多个帧图像组成了目标视频201。当获取到目标视频201后，对该目标视频201进行拆分，得到组成该目标视频201的多个帧图像202，每个帧图像202为目标视频201的一个画面。从组成该目标视频201的多个帧图像202中获取该目标视频201的关键帧图像203，该关键帧图像203是基于关键帧确定模型得到的。将获取到的关键帧图像203依次输入图像搜索引擎204，得到关键帧图像203的描述信息。该描述信息由多个词组组成，该描述信息包括关键帧图像203中的每个帧图像在图像搜索引擎204中对应的多个词组描述。根据该描述信息确定关键帧图像的关键词组205，获取关键词组205对应的文本内容特征，并根据该文本内容特征确定目标视频201的视频类型标签206。可以基于该视频类型标签206将目标视频发送到目标服务器207和/或目标终端208，以使目标服务器207对应的应用可以基于该视频类型标签206将目标视频添加到对应的分类中。其中，目标终端208为标记了视频类型标签206中任意一个视频类型标签的终端。例如，这里假定目标视频201的视频类型标签206包括两个视频类型标签，分别为第一视频类型标签和第三视频类型标签，在得到该视频类型标签206后，将该目标视频201及其该目标视频201的视频类型标签206发送到目标服务器207。若该目标服务器207为第一视频播放应用的服务器，则目标服务器207接收到视频类型标签206后，在第一视频播放应用的第一视频类型标签对应的分类中以及第三视频类型标签对应的分类中添加该目标视频201。或者，在得到该视频类型标签206后，获取标记有视频类型标签206的终端，即获取标记有第一视频类型标签或第三视频类型标签中任意一个视频类型标签的目标终端208。获取到目标终端208后，将该目标视频201及其该目标视频201的视频类型标签206发送到目标终端208，该目标终端208标记了第一视频类型标签和第二视频类型标签，以使目标终端208在推荐页面中的第一视频类型标签处显示该目标视频201。Fig. 2 is a schematic diagram of a scene for text-based video classification provided by an embodiment of the present application. As shown in FIG. 2, the target video 201 is obtained by sequentially playing multiple frame images, that is, multiple frame images constitute the target video 201. After the target video 201 is obtained, the target video 201 is split to obtain a plurality of frame images 202 constituting the target video 201, and each frame image 202 is a picture of the target video 201. A key frame image 203 of the target video 201 is obtained from a plurality of frame images 202 constituting the target video 201, and the key frame image 203 is obtained based on a key frame determination model. The obtained key frame images 203 are sequentially input to the image search engine 204 to obtain the description information of the key frame images 203. The description information is composed of multiple phrases, and the description information includes multiple phrase descriptions corresponding to each frame image in the key frame image 203 in the image search engine 204. The keyword group 205 of the key frame image is determined according to the description information, the text content feature corresponding to the keyword group 205 is obtained, and the video type tag 206 of the target video 201 is determined according to the text content feature. The target video can be sent to the target server 207 and/or the target terminal 208 based on the video type tag 206, so that the application corresponding to the target server 207 can add the target video to the corresponding category based on the video type tag 206. Wherein, the target terminal 208 is a terminal marked with any one of the video type tags 206. For example, it is assumed here that the video type tag 206 of the target video 201 includes two video type tags, namely a first video type tag and a third video type tag. After the video type tag 206 is obtained, the target video 201 and its The video type tag 206 of the target video 201 is sent to the target server 207. If the target server 207 is the server of the first video playback application, after the target server 207 receives the video type tag 206, it will be listed in the category corresponding to the first video type tag of the first video playback application and the third video type tag. The target video 201 is added to the category. Alternatively, after obtaining the video type tag 206, the terminal marked with the video type tag 206 is obtained, that is, the target terminal 208 marked with any one of the first video type tag or the third video type tag is obtained. After the target terminal 208 is obtained, the target video 201 and the video type tag 206 of the target video 201 are sent to the target terminal 208, and the target terminal 208 marks the first video type tag and the second video type tag so that the target The terminal 208 displays the target video 201 at the first video type tag in the recommendation page.

图3是本申请实施例提供的一种视频分类方法流程图。如图3所示，该视频分类过程包括如下步骤。Fig. 3 is a flowchart of a video classification method provided by an embodiment of the present application. As shown in Figure 3, the video classification process includes the following steps.

步骤S301，从目标视频中获取关键帧图像。Step S301: Obtain a key frame image from the target video.

一些实施例中，获取到目标视频后，对该目标视频进行拆分，得到组成该目标视频的多个帧图像，并从多个帧图像中获取关键帧图像。其中，该关键帧图像可以基于相邻图像间的敏感变化确定。其中，关键帧图像指的是视频中角色或物体等运动或变化中的关键动作所处的那一帧图像。视频的关键帧图像包含很少的冗余信息，并且能表示视频的关键内容。由于视频是由多个帧图像组成的，而连续的多帧图像可能内容区别不是很大，如果对每个帧图像都进行处理，会造成不必要的资源浪费。因此，可以从视频中提取部分画面作为关键帧图像，将关键帧图像作为对应视频的代表性图像，根据关键帧图像的处理结果确定视频的确定结果。In some embodiments, after the target video is obtained, the target video is split to obtain multiple frame images composing the target video, and key frame images are obtained from the multiple frame images. Wherein, the key frame image can be determined based on the sensitive changes between adjacent images. Among them, the key frame image refers to the frame of the image where the key action in motion or change such as a character or an object in the video is located. The key frame image of the video contains little redundant information and can represent the key content of the video. Since the video is composed of multiple frame images, and the content of consecutive multiple frame images may not be very different, if each frame image is processed, it will cause unnecessary waste of resources. Therefore, it is possible to extract a part of the picture from the video as a key frame image, use the key frame image as a representative image of the corresponding video, and determine the determination result of the video according to the processing result of the key frame image.

步骤S302，将关键帧图像输入图像搜索引擎，得到关键帧图像的描述信息，根据该描述信息确定关键帧图像的关键词组。Step S302: Input the key frame image into the image search engine to obtain the description information of the key frame image, and determine the keyword group of the key frame image according to the description information.

一些实施例中，将关键帧图像输入图像搜索引擎中进行检索，以得到该关键帧图像在图像搜索引擎中得到的描述信息，根据该描述信息确定关键帧图像的关键词组。其中，描述信息包括多个词组，将关键帧图像输入图像搜索引擎中进行检索，得到检索结果。该检索结果中是由针对该关键帧图像的相关描述语句组成的，每个描述语句包括多个词组。统计描述信息中每个词组的出现次数，基于每个词组的出现次数确定关键帧图像的关键词组。例如，获取出现次数大于统计次数阈值的词组，将该词组确定为关键帧图像的关键词组；或者，基于出现次数对描述信息中出现的每个词组进行排序，当按出现次数由多到少排序时，获取排序后的各个词组中前N个词组作为关键帧图像的关键词组。当按出现次数由少到多排序时，获取排序后的各个词组中后N个词组作为关键帧图像的关键词组。其中，N为正整数，N为预设的词组获取数据量。其中，在统计描述信息中的各个词组的出现次数时，对当前统计的词组进行筛选，可以根据当前统计的词组的词性进行筛选，也可以根据其他语句分析方法对当前统计的词组进行筛选，以避免无实际意义的词组被确定为关键词组的情况，从而减少无实际意义的词组对视频分类的干扰。如当描述信息由中文语句构成时，将词性为形容词或副词等的词组作为无实际意义的词组，就可以减少大量的包含“的”或“地”等的词组被确定为关键词组的情况。由于无实际意义的词组一般在描述信息中起到突出或补充说明等的作用，对于视频的主要内容不会造成大的影响。而这种修饰类的词组在对图像进行描述时，一般会出现较多，就可能会导致这种修饰类的词组被确定为关键词组的情况。由于这种修饰类的词组对视频分类可能会产生干扰，而且增加处理的数据量，可能会浪费资源，因此，可以在统计描述信息中的词组的出现次数时，可以对描述信息中的词组进行筛选，以在不影响视频分类结果的情况下减少需要处理的数据量，节省资源。In some embodiments, the key frame image is input into an image search engine for retrieval to obtain the description information of the key frame image obtained in the image search engine, and the keyword group of the key frame image is determined according to the description information. Among them, the description information includes multiple phrases, and the key frame image is input into the image search engine for retrieval, and the retrieval result is obtained. The retrieval result is composed of related description sentences for the key frame image, and each description sentence includes multiple phrases. Count the number of occurrences of each phrase in the description information, and determine the keyword group of the key frame image based on the number of occurrences of each phrase. For example, to obtain the phrase with the number of occurrences greater than the statistical threshold, and determine the phrase as the keyword group of the key frame image; or, to sort each phrase that appears in the description information based on the number of occurrences, when sorting according to the number of occurrences from most to least At this time, the first N phrases in each of the sorted phrases are obtained as the keyword groups of the key frame image. When sorting according to the number of occurrences from least to most, the last N phrases among the sorted phrases are obtained as the keyword groups of the key frame image. Among them, N is a positive integer, and N is a preset phrase to obtain data volume. Among them, when the number of occurrences of each phrase in the description information is counted, the currently counted phrase is filtered, and the current counted phrase can be filtered according to the part of speech currently counted, or the current counted phrase can be filtered according to other sentence analysis methods. Avoid the situation that phrases with no actual meaning are determined as keyword groups, thereby reducing the interference of phrases with no actual meaning on video classification. For example, when the descriptive information is composed of Chinese sentences, the phrase groups with parts of speech of adjectives or adverbs as meaningless phrases can reduce the number of phrases containing "的" or "地" that are identified as keyword groups. Since phrases with no practical significance generally play a prominent or supplementary role in the description of the information, they will not have a big impact on the main content of the video. When describing images, such modified phrases generally appear more, which may lead to the situation that such modified phrases are determined as keyword groups. Because such modified phrases may interfere with video classification, and increase the amount of data to be processed, which may waste resources. Therefore, when counting the number of occurrences of phrases in the description information, the phrases in the description information can be counted. Screening to reduce the amount of data that needs to be processed and save resources without affecting the video classification results.

其中，搜索引擎指自动从因特网搜集信息，经过一定整理以后，提供给用户进行查询的***。图像搜索引擎则是基于搜索引擎得到的用于图像检索的***，该图像搜索引擎可以接收图像，并通过识别该图像得到该图像的图像数据及相关的文字说明。将关键帧图像输入图像搜索引擎后，得到该关键帧图像的图像数据及相关的文字说明，该文字说明为该关键帧图像的描述信息。一些实施例中，通过描述信息确定关键帧图像的关键词组后，若存在某一条相关信息中出现的关键词组的数量大于出现次数阈值，则可以获取该条相关信息对应的网络页面，并从该网络页面中提取该关键帧图像的关键词组，以将与该关键帧图像相关性较强的完整网页内容作为该关键帧图像的描述信息中的一部分，丰富该关键帧图像的特征。Among them, a search engine refers to a system that automatically collects information from the Internet and provides it to users for inquiries after certain sorting. An image search engine is a system for image retrieval based on a search engine. The image search engine can receive an image and obtain the image data and related text description of the image by recognizing the image. After the key frame image is input into the image search engine, the image data of the key frame image and the related text description are obtained, and the text description is the description information of the key frame image. In some embodiments, after the keyword group of the key frame image is determined by the description information, if the number of keyword groups appearing in a certain piece of related information is greater than the number of occurrences threshold, the web page corresponding to the piece of related information can be obtained, and from the The keyword group of the key frame image is extracted from the web page, so that the complete web content with strong correlation with the key frame image is used as a part of the description information of the key frame image to enrich the characteristics of the key frame image.

其中，若该描述信息为英文，则可以基于英文语句的句式对该英文语句进行词法分析，得到多个词组，并从多个词组中确定关键词组；若该描述信息为中文，则可以基于中文语句的格式对该中文语句进行词法分析，得到组成该中文语句的多个词组，并从多个词组中确定关键帧图像的关键词组；若该描述信息包括中文描述信息和英文描述信息，则对于由中英文组成的混合语句，基于中文语句的句式对该混合语句进行词法分析，得到组成该混合语句的多个词组，并从多个词组中确定关键帧图像的关键词组。Among them, if the description information is in English, the English sentence can be morphologically analyzed based on the sentence pattern of the English sentence to obtain multiple phrases, and the keyword groups can be determined from the multiple phrases; if the description information is in Chinese, it can be based on The format of the Chinese sentence performs morphological analysis on the Chinese sentence to obtain multiple phrases that make up the Chinese sentence, and determine the keyword group of the key frame image from the multiple phrases; if the description information includes Chinese description information and English description information, then For a mixed sentence composed of Chinese and English, morphological analysis is performed on the mixed sentence based on the sentence pattern of the Chinese sentence to obtain multiple phrases that make up the mixed sentence, and the keyword group of the key frame image is determined from the multiple phrases.

一些实施例中，图4是本申请实施例提供的一种关键词组获取方式场景示意图。如图4所示，以目标视频的一个关键帧图像为例，当确定了关键帧图像401后，将该关键帧图像401输入图像搜索引擎进行检索，得到检索显示页面402。该检索显示页面402中显示了基于图像搜索引擎检索到的针对关键帧图像401的相关信息。提取该检索显示页面402中显示的相关信息，该相关信息为关键帧图像401的描述信息403。依次对该描述信息403中的各个词组进行统计，得到每个词组的出现次数。根据每个词组的出现次数对该描述信息403进行选取，将描述信息403中出现次数大于统计次数阈值的词组确定为关键帧图像的关键词组统计信息404。该关键词组统计信息404中记录有确定的各个关键词组及每个关键词组在描述信息403中的出现次数。其中，以图4为例，经过对描述信息403进行统计整理，确定出多个关键词组，根据多个关键词组及每个关键词组的出现次数生成关键词组统计信息404，包括关键词组“相关”出现4次，关键词组“XXX”出现8次，关键词组“XXXX XXXXX”出现5次，关键词组“XX”出现6次，......等。In some embodiments, FIG. 4 is a schematic diagram of a scenario for obtaining a keyword group according to an embodiment of the present application. As shown in FIG. 4, taking a key frame image of the target video as an example, when the key frame image 401 is determined, the key frame image 401 is input to the image search engine for retrieval, and the retrieval display page 402 is obtained. The retrieval display page 402 displays related information for the key frame image 401 retrieved based on the image search engine. The relevant information displayed in the retrieval display page 402 is extracted, and the relevant information is the description information 403 of the key frame image 401. Statistics are performed on each phrase in the description information 403 in turn to obtain the number of occurrences of each phrase. The description information 403 is selected according to the number of appearances of each phrase, and the phrase in the description information 403 with the number of appearances greater than the statistical number threshold is determined as the keyword group statistical information 404 of the key frame image. The keyword group statistical information 404 records each determined keyword group and the number of appearances of each keyword group in the description information 403. Among them, taking Figure 4 as an example, after statistically sorting the description information 403, multiple keyword groups are determined, and the keyword group statistical information 404 is generated according to the multiple keyword groups and the number of occurrences of each keyword group, including the keyword group "relevant" It appears 4 times, the keyword group "XXX" appears 8 times, the keyword group "XXXX XXXXX" appears 5 times, the keyword group "XX" appears 6 times, etc.

步骤S303，获取关键词组对应的文本内容特征。Step S303: Acquire the text content characteristics corresponding to the keyword group.

步骤S303是从复数个预设文本类型特征中确定所述复数个关键词组对应的预设文本类型特征。一些实施例中，获取关键词组对应的文本内容特征时，可以将该关键词组输入文本分类模型中，提取关键词组对应的初始文本特征。将初始文本特征与文本分类模型中的多个待匹配类型特征进行匹配，得到匹配值。将具有最大匹配值的待匹配类型特征确定为关键词组对应的文本内容特征。一些实施例中，也可以将所有待匹配类型特征中匹配值较大的预设数量个待匹配类型特征作为关键词组对应的文本内容特征。如预设数量为3，则在得到多个待匹配类型特征分别对应的匹配值时，根据匹配值进行排序，将从大到小排序后的待匹配类型特征中的前3个待匹配类型特征确定为关键词组对应的文本内容特征；或者，将从小到大排序后的待匹配类型特征中的后3个待匹配类型特征确定为关键词组对应的文本内容特征。Step S303 is to determine the preset text type features corresponding to the plurality of keyword groups from the plurality of preset text type features. In some embodiments, when acquiring the text content feature corresponding to the keyword group, the keyword group can be input into the text classification model to extract the initial text feature corresponding to the keyword group. The initial text feature is matched with multiple types of features to be matched in the text classification model to obtain a matching value. The feature of the type to be matched with the largest matching value is determined as the text content feature corresponding to the keyword group. In some embodiments, a preset number of features of the type to be matched with a larger matching value among all features of the type to be matched may also be used as the text content feature corresponding to the keyword group. If the preset number is 3, when the matching values corresponding to multiple types of features to be matched are obtained, sorting is performed according to the matching values, and the top 3 types of features to be matched are sorted from largest to smallest. Determine as the text content feature corresponding to the keyword group; or, determine the last three to-be-matched type features among the to-be-matched type features sorted from small to large as the text content features corresponding to the keyword group.

图5是本申请实施例提供的一种文本内容特征确定场景示意图。如图5所示，将关键词组501转换成关键词组向量502，如图5中关键词组501包括关键词组1、关键词组2、...及关键词组m，将关键词组501依次转换成关键词组向量502，包括关键词组向量1、关键词组向量2、...及关键词组向量3。将关键词组向量502输入文本分类模型503中，通过该文本分类模型503提取该关键词组向量502对应的初始文本特征5031。获取文本分类模型503中的多个待匹配类型特征，将初始文本特征5031与多个待匹配类型特征进行匹配，得到多个待匹配类型特征中每个待匹配类型特征的匹配值5032。根据每个待匹配类型特征的匹配值5032，将具有最大匹配值的待匹配类型特征确定为关键词组对应的文本内容特征504。其中，关键词组向量502与初始文本特征5031间的连接关系为用于提取关键词组向量的特征的参数；初始文本特征5031与每个待匹配类型特征的匹配值5032间的连接关系，为用于根据输入的内容的特征确定多个待匹配类型特征分别与输入的内容间的匹配值的参数。该参数包括但不限于权重矩阵。每个待匹配类型特征的匹配值5032与关键词组对应的文本内容特征504间的连接关系为选取关系，用于基于匹配值进行排序，得到具有最大匹配值的待匹配类型特征504。Fig. 5 is a schematic diagram of a scenario for determining text content features provided by an embodiment of the present application. As shown in FIG. 5, the keyword group 501 is converted into a keyword group vector 502. In FIG. 5, the keyword group 501 includes keyword group 1, keyword group 2, ... and keyword group m, and the keyword group 501 is sequentially converted into keyword group The vector 502 includes keyword group vector 1, keyword group vector 2, ... and keyword group vector 3. The keyword group vector 502 is input into the text classification model 503, and the initial text feature 5031 corresponding to the keyword group vector 502 is extracted through the text classification model 503. Obtain multiple to-be-matched type features in the text classification model 503, and match the initial text feature 5031 with multiple to-be-matched type features to obtain a matching value 5032 for each of the multiple to-be-matched type features. According to the matching value 5032 of each to-be-matched type feature, the to-be-matched type feature with the largest matching value is determined as the text content feature 504 corresponding to the keyword group. Among them, the connection relationship between the keyword group vector 502 and the initial text feature 5031 is the parameter used to extract the feature of the keyword group vector; the connection relationship between the initial text feature 5031 and the matching value 5032 of each feature to be matched is for According to the features of the input content, the parameters of the matching values between the multiple types of features to be matched and the input content are determined. This parameter includes but is not limited to the weight matrix. The connection relationship between the matching value 5032 of each to-be-matched type feature and the text content feature 504 corresponding to the keyword group is a selection relationship, which is used to sort based on the matching value to obtain the to-be-matched type feature 504 with the largest matching value.

步骤S304，根据文本内容特征确定目标视频的视频类型标签。Step S304: Determine the video type tag of the target video according to the characteristics of the text content.

一些实施例中，根据文本内容特征确定目标视频的视频类型标签。其中，在得到文本内容特征后，将该文本内容特征作为目标视频的视频类型标签。In some embodiments, the video type tag of the target video is determined according to the characteristics of the text content. Wherein, after the text content feature is obtained, the text content feature is used as the video type tag of the target video.

本申请实施例通过从目标视频中获取关键帧图像，将该关键帧图像输入图像搜索引擎，得到该关键帧图像的描述信息。根据该描述信息确定关键帧图像的关键词组，获取该关键词组对应的文本内容特征，根据该文本内容特征确定目标视频的视频类型标签。本申请将目标视频的相关文字信息作为该目标视频进行分类的一种依据，使得在训练时可以增加视频样本的文字信息作为训练的样本，以增加可训练的数据量，提高视频分类的准确性。而且由于该文字信息是基于图像搜索引擎得到的，该图像搜索引擎对于图像的文字标注是对该图像的一种解释说明，使得该目标视频的相关文字信息本身就是对该目标视频的一种内容说明，因此通过该相关文字信息确定目标视频的视频类型标签，可以进一步地提高视频分类的有效性和准确性。In the embodiment of the present application, a key frame image is obtained from a target video, and the key frame image is input to an image search engine to obtain the description information of the key frame image. The keyword group of the key frame image is determined according to the description information, the text content feature corresponding to the keyword group is obtained, and the video type tag of the target video is determined according to the text content feature. This application uses the relevant text information of the target video as a basis for classifying the target video, so that the text information of the video sample can be added as a training sample during training, so as to increase the amount of trainable data and improve the accuracy of video classification . And because the text information is obtained based on the image search engine, the text annotation of the image by the image search engine is an explanation of the image, so that the relevant text information of the target video itself is a kind of content of the target video It is explained that, therefore, determining the video type label of the target video through the relevant text information can further improve the effectiveness and accuracy of video classification.

图6是本申请实施例提供的一种视频分类具体流程示意图。如图6所示，步骤S601、步骤S603及步骤S604是并列的三个步骤，分别用于获取目标视频的文本内容特征、视频内容特征及语音内容特征，在本申请中执行时无先后顺序，可以对这三个步骤进行异步执行，也可以直接同步执行，同步执行时的顺序不做限定，该视频分类方法包括如下步骤。FIG. 6 is a schematic diagram of a specific process of video classification provided by an embodiment of the present application. As shown in Fig. 6, step S601, step S603, and step S604 are three parallel steps, which are respectively used to obtain the text content feature, video content feature, and voice content feature of the target video, and they are executed in no order in this application. These three steps can be executed asynchronously or directly synchronously. The order of synchronous execution is not limited. The video classification method includes the following steps.

步骤S601，从目标视频中获取关键帧图像。Step S601: Obtain a key frame image from the target video.

一些实施例中，将目标视频进行拆分，得到组成该目标视频的多个帧图像，从多个帧图像中获取关键帧图像。一些实施例中，获取组成目标视频的多个帧图像，将多个帧图像输入关键帧确定模型中的特征提取层，得到每个帧图像的图像特征；将每个帧图像的图像特征输入关键帧确定模型中的关键值确定层，在关键值确定层中基于注意力机制确定每个帧图像的关键值；根据每个帧图像的关键值，确定目标视频中的关键帧图像。其中，在关键值确定层中基于注意力机制，确定多个帧图像中第i个帧图像的图像特征与对照图像的图像特征间的关联度，根据第i个帧图像的图像特征与对照图像的图像特征间的关联度得到第i个帧图像的关键值；对照图像为组成目标视频的多个帧图像中除第i个帧图像以外的帧图像，i为正整数，i不大于多个帧图像的数量；当第i个帧图像为多个帧图像中的最后一个帧图像时，得到每个帧图像的关键值，根据每个帧图像的关键值，确定目标视频中的关键帧图像。In some embodiments, the target video is split to obtain multiple frame images composing the target video, and key frame images are obtained from the multiple frame images. In some embodiments, multiple frame images that make up the target video are obtained, and the multiple frame images are input into the feature extraction layer in the key frame determination model to obtain the image characteristics of each frame image; the image characteristics of each frame image are input into the key The key value determination layer in the frame determination model determines the key value of each frame image based on the attention mechanism in the key value determination layer; determines the key frame image in the target video according to the key value of each frame image. Among them, based on the attention mechanism in the key value determination layer, the correlation between the image feature of the i-th frame image and the image feature of the control image in the multiple frame images is determined, according to the image feature of the i-th frame image and the control image The key value of the i-th frame image is obtained by the correlation between the image features of the image features; the control image is the frame image other than the i-th frame image among the multiple frame images that make up the target video, i is a positive integer, and i is not more than multiple The number of frame images; when the i-th frame image is the last frame image among multiple frame images, the key value of each frame image is obtained, and the key frame image in the target video is determined according to the key value of each frame image .

一些实施例中，可以通过关联矩阵对每个帧图像的图像特征与该帧图像的对照图像的图像特征间的关联度进行缓存，并基于该关联矩阵确定每个帧图像的关键值。一些实施例中，创建一个空的关联矩阵，该关联矩阵为M*M大小的二维矩阵，M为组成目标视频的多个帧图像的数量。获取第一个帧图像分别与第二个帧图像到第M个帧图像的关联度，并将各个关联度记录到关联矩阵中的第一行和第一列中，用于表示第一个帧图像的图像特征分别与第二个帧图像到第M个帧图像的图像特征间的关联度，以及第二个帧图像到第M个帧图像的图像特征分别与第一个帧图像的图像特征间的关联度；再获取第二个帧图像的图像特征与第三个帧图像到第M个帧图像的图像特征间的关联度，将各个关联度记录到关联矩阵中第二行中[2][3]到[2][M]及第二列中[3][2]到[M][2]中，直至得到关联矩阵中[M][M]位置处的关联度。基于该关联矩阵，得到每个帧图像的关键值，如基于该关联矩阵对多个帧图像进行分组，每一组中包含的各个帧图像可以认为是相似图像，根据每组包含的帧图像在目标图像中的相对位置，将相对位置在最前面的帧图像，确定为该组对应的关键帧图像，从而得到目标视频的关键帧图像。In some embodiments, the correlation between the image feature of each frame image and the image feature of the control image of the frame image may be cached through the correlation matrix, and the key value of each frame image can be determined based on the correlation matrix. In some embodiments, an empty incidence matrix is created. The incidence matrix is a two-dimensional matrix with a size of M*M, where M is the number of multiple frame images that constitute the target video. Obtain the correlation degree between the first frame image and the second frame image to the Mth frame image, and record each correlation degree in the first row and first column of the correlation matrix to indicate the first frame The degree of association between the image features of the image and the image features of the second frame image to the M-th frame image, and the image features of the second frame image to the M-th frame image respectively and the image feature of the first frame image The correlation degree between the second frame image and the image characteristics of the third frame image to the M-th frame image are obtained, and each correlation degree is recorded in the second row of the correlation matrix [2 ][3] to [2][M] and [3][2] to [M][2] in the second column until the degree of association at the position [M][M] in the incidence matrix is obtained. Based on the correlation matrix, the key value of each frame image is obtained. For example, if multiple frame images are grouped based on the correlation matrix, each frame image contained in each group can be considered as a similar image. For the relative position in the target image, the frame image whose relative position is at the forefront is determined as the key frame image corresponding to the group, so as to obtain the key frame image of the target video.

其中，上述通过关联矩阵确定关键帧图像是一种可能的关键帧图像确定方法，也可以通过其他方法确定关键帧图像，如使用关键帧获取模型或关键帧获取应用等，在此不做限制。Among them, the above-mentioned determining the key frame image by the correlation matrix is a possible method for determining the key frame image, and the key frame image may also be determined by other methods, such as using a key frame acquisition model or a key frame acquisition application, etc., which is not limited here.

一些实施例中，参见图7a，图7a是本申请实施例提供的一种关键帧图像获取场景示意图。如图7a所示，关键帧确定模型包括特征提取层703及关键值确定层704。当获取到目标视频701后，将目标视频701进行拆分，得到组成目标视频701的多个帧图像702，将多个帧图像702输入特征提取层703，得到每个帧图像对应的图像特征，将每个帧图像对应的图像特征输入关键值确定层704，并依次将每个帧图像的图像特征与其他对照图像的图像特征进行比对，确定每个帧图像的图像特征与该帧图像的对照图像的图像特征间的关联度，并基于关联度得到每个帧图像的关键值，根据关键值确定层704中得到的关键值确定关键帧图像在目标视频中的位置705，根据该关键帧图像在目标视频中的位置705从目标视频中获取关键帧图像706。In some embodiments, refer to FIG. 7a, which is a schematic diagram of a key frame image acquisition scene provided by an embodiment of the present application. As shown in FIG. 7a, the key frame determination model includes a feature extraction layer 703 and a key value determination layer 704. After the target video 701 is obtained, the target video 701 is split to obtain multiple frame images 702 that make up the target video 701, and the multiple frame images 702 are input to the feature extraction layer 703 to obtain the image features corresponding to each frame image. Input the image feature corresponding to each frame image into the key value determination layer 704, and compare the image feature of each frame image with the image features of other control images in turn to determine the image feature of each frame image and that of the frame image. Compare the correlation between the image features of the image, and obtain the key value of each frame image based on the correlation degree, and determine the position 705 of the key frame image in the target video according to the key value obtained in the key value determination layer 704. According to the key frame The position 705 of the image in the target video obtains the key frame image 706 from the target video.

一些实施例中，若在关键值确定层704中基于关联矩阵确定关键帧图像，假定得到如图7b所示的关联矩阵示意图，基于每个帧图像的图像特征与该帧图像的对照图像的图像特征间的关联度，得到每个帧图像的关键值。一些实施例中，基于图7b中所示的关联矩阵，根据该关联矩阵中每一行所包含的各个关联度得到该行对应的帧图像的关键值，若各个关联度都小于最小相似阈值，则认为该帧图像与其他帧图像间的相似度较小，可以认为该帧图像为单独的内容，则可将该帧图像认为是一个关键帧图像，若存在大于最大相似阈值的关联度，则将大于最大相似阈值的关联度对应的各个帧图像分为一组，将最终得到的各个分组所包含的各个帧图像中在目标视频中的相对位置在前的帧图像确定为该分组中的关键帧图像。假定最小相似阈值为0.3，最大相似阈值为0.7，则基于图7b中的关联矩阵，第一个帧图像对应的各个关联度均小于0.3，则将第一个帧图像确定为一个关键帧图像；第二个帧图像与对应的各个关联度均小于0.3，则将第二个帧图像确定为一个关键帧图像；第三个帧图像与第四个帧图像、第五个帧图像及第六个帧图像等的关联度均大于0.7，则将第三个帧图像、第四个帧图像、第五个帧图像及第六个帧图像等分为一组；再对第七个帧图像进行分组确定，直至第M个帧图像处理完，得到单帧分组中第一个帧图像为一个关键帧图像，单帧分组中第二个帧图像为一个关键帧图像，多帧分组(第三个帧图像、第四个帧图像、第五个帧图像及第六个帧图像等)中第三个帧图像为一个关键帧图像，......，直至得到目标视频的所有关键帧图像。In some embodiments, if the key frame image is determined based on the correlation matrix in the key value determination layer 704, it is assumed that the correlation matrix diagram shown in FIG. 7b is obtained, based on the image characteristics of each frame image and the comparison image of the frame image. The correlation degree between the features, the key value of each frame image is obtained. In some embodiments, based on the correlation matrix shown in FIG. 7b, the key value of the frame image corresponding to each row in the correlation matrix is obtained according to the correlation degrees contained in each row of the correlation matrix. If each correlation degree is less than the minimum similarity threshold, then It is considered that the similarity between the frame image and other frame images is small, and the frame image can be considered as a separate content, then the frame image can be considered as a key frame image. If there is an association degree greater than the maximum similarity threshold, the Each frame image corresponding to the degree of association greater than the maximum similarity threshold is grouped into a group, and the frame image of each frame image contained in each group finally obtained in the target video is determined as the key frame in the group. image. Assuming that the minimum similarity threshold is 0.3 and the maximum similarity threshold is 0.7, based on the correlation matrix in Figure 7b, the correlation degrees corresponding to the first frame image are all less than 0.3, and the first frame image is determined as a key frame image; The second frame image and the corresponding respective correlation degrees are less than 0.3, then the second frame image is determined as a key frame image; the third frame image and the fourth frame image, the fifth frame image and the sixth frame image If the correlation degree of the frame images is greater than 0.7, the third frame image, the fourth frame image, the fifth frame image and the sixth frame image are divided into one group; then the seventh frame image is grouped Determine, until the M-th frame image is processed, the first frame image in the single frame group is a key frame image, the second frame image in the single frame group is a key frame image, and the multi-frame group (the third frame Image, the fourth frame image, the fifth frame image, and the sixth frame image, etc.) in the third frame image is a key frame image, ... until all the key frame images of the target video are obtained.

一些实施例中，可以在得到每个帧图像的图像特征与其他帧图像的图像特征间的关联度后，基于该关联度及关键权重矩阵得到每个帧图像的关键值，该关键权重矩阵可以使得当前处理的帧图像在与位于该当前处理的帧图像之前的关联度小时，会提高当前处理的帧图像的关键值，在与位于该当前处理的帧图像之前的关联度大时，会降低当前处理的帧图像的关键值等。得到每个帧图像的关键值后，基于该关键值对多个帧图像进行排序，将排序后的多个帧图像中的前列帧图像确定为目标视频的关键帧图像，前列帧图像为基于关键值由大到小排序后的多个帧图像中位于前面的帧图像，可以是从前往后选取K个帧图像作为关键帧图像(K为关键帧数量阈值)，也可以是选取指定比例个帧图像作为关键帧图像等，如关键帧数量阈值为10，则从多个帧图像中选取10个关键值大于其他帧图像的关键值的帧图像作为关键帧图像，若指定比例为10％，则从基于关键值由大到小排序后的多个帧图像中选取十分之一的帧图像作为关键帧图像。In some embodiments, after obtaining the correlation degree between the image feature of each frame image and the image characteristics of other frame images, the key value of each frame image can be obtained based on the correlation degree and the key weight matrix. The key weight matrix can be If the correlation degree between the currently processed frame image and the currently processed frame image is small, the key value of the currently processed frame image will be increased, and when the correlation degree between the currently processed frame image and the currently processed frame image is large, it will decrease. The key value of the currently processed frame image, etc. After the key value of each frame image is obtained, the multiple frame images are sorted based on the key value, and the first frame image of the sorted multiple frame images is determined as the key frame image of the target video, and the previous frame image is based on the key The first frame image among the multiple frame images sorted by the value from large to small can be selected from the front to the back as the key frame image (K is the threshold of the number of key frames), or select a specified proportion of frames The image is used as a key frame image. If the threshold of the number of key frames is 10, then 10 frame images with key values greater than the key values of other frame images are selected from multiple frame images as the key frame image. If the specified ratio is 10%, then One tenth of the frame images is selected as the key frame image from the multiple frame images sorted from large to small based on the key value.

步骤S602，确定关键帧图像对应的关键词组，根据关键词组确定目标视频的文本内容特征。Step S602: Determine the keyword group corresponding to the key frame image, and determine the text content feature of the target video according to the keyword group.

一些实施例中，确定关键帧图像对应的关键词组，根据关键词组确定目标视频的文本内容特征。一些实施例中，确定关键帧图像对应的关键词组，将关键词组输入文本分类模型中，提取关键词组对应的初始文本特征，将初始文本特征与文本分类模型中的多个待匹配类型特征进行匹配，得到匹配值；将具有最大匹配值的待匹配类型特征确定为关键词组对应的文本内容特征。In some embodiments, the keyword group corresponding to the key frame image is determined, and the text content feature of the target video is determined according to the keyword group. In some embodiments, the keyword group corresponding to the key frame image is determined, the keyword group is input into the text classification model, the initial text feature corresponding to the keyword group is extracted, and the initial text feature is matched with multiple types of features to be matched in the text classification model , Obtain the matching value; determine the feature of the type to be matched with the largest matching value as the text content feature corresponding to the keyword group.

其中，上述关键词组的确定方法是基于描述信息与图像文字、字幕信息之间随机组合得到的几种方法，具体可以参见图8，图8是本申请实施例提供的一种文本内容特征获取架构图，如图8所示，关键词组的确定方法是由从关键帧图像到关键词组的各个支路进行随机组合得到的。其中，各个支路分别为通过将关键帧图像输入图像搜索引擎得到该关键帧图像的描述信息，并基于该描述信息得到关键词组；识别关键帧图像，得到关键帧图像中的图像文字，基于该图像文字得到关键词组；提取关键帧图像中的字幕信息，基于该字幕信息得到关键词组。其中，该关键词组的确定方法具体如下：Among them, the method for determining the above keyword group is based on several methods obtained by random combination of description information, image text, and caption information. For details, please refer to FIG. 8. FIG. 8 is a text content feature acquisition structure provided by an embodiment of the present application. As shown in Fig. 8, the method for determining the keyword group is obtained by randomly combining the branches from the key frame image to the keyword group. Among them, each branch is to obtain the description information of the key frame image by inputting the key frame image into the image search engine, and obtain the keyword group based on the description information; identify the key frame image to obtain the image text in the key frame image, based on the The image text obtains the keyword group; the subtitle information in the key frame image is extracted, and the keyword group is obtained based on the subtitle information. Among them, the method for determining the keyword group is as follows:

第一种关键词组确定方法，将该关键帧图像输入图像搜索引擎中，得到该关键帧图像的描述信息，根据描述信息确定关键帧图像的关键词组，一些实施例中，统计该描述信息包含的词组中每个词组的出现次数，将描述信息中出现次数大于统计次数阈值的词组确定为关键帧图像的关键词组，该过程可以参见图4中的具体描述，在此不再进行赘述。The first keyword group determination method is to input the key frame image into an image search engine to obtain the description information of the key frame image, and determine the keyword group of the key frame image according to the description information. In some embodiments, the description information contains statistics For the number of occurrences of each phrase in the phrase, the phrase whose occurrence number is greater than the statistical frequency threshold in the description information is determined as the keyword group of the key frame image. This process can be referred to the specific description in FIG. 4, and will not be repeated here.

第二种关键词组确定方法，将该关键帧图像输入图像搜索引擎中，得到该关键帧图像的描述信息，识别所述关键帧图像中的图像文字，根据描述信息以及图像文字确定关键帧图像的关键词组。一些实施例中，将关键帧图像输入图像搜索引擎中，得到该关键帧图像的描述信息，将该关键帧图像输入图像文字提取工具，以识别该关键帧图像中的图像文字，将描述信息中的词组以及图像文字中的词组添加到词组集合中，根据词组集合中每个词组的出现次数及描述信息对应的权重、图像文字对应的权重，确定该词组集合中每个词组对应的评估值，根据评估值，对词组集合中每个词组进行排序，根据排序结果从词组集合中确定关键帧图像的关键词组。如图8所示，在得到描述信息及图像文字后，描述信息和图像文字均由词组组成，将描述信息中的词组以及图像文字中的词组添加到词组集合中，并在将词组添加至词组集合时，统计每个词组出现的次数，使得最终的词组集合中包含两部分，一部分为描述信息中的词组及每个词组的出现次数，另一部分为图像文字中的词组及每个词组的出现次数，根据描述信息中的词组，每个词组的出现次数及描述信息的权重1，对每个词组进行加权计算，以得到描述信息中每个词组的评估值，根据图像文字中的词组，每个词组的出现次数及图像文字的权重2，对每个词组进行加权计算，以得到图像文字中每个词组的评估值，根据该评估值对词组集合中的各个词组进行排序，根据排序结果从词组集合中确定关键词组。其中，可以将词组集合中的每个词组的出现次数乘以该词组对应的类型权重得到的值作为该词组的评估值。The second keyword group determination method is to input the key frame image into an image search engine to obtain the description information of the key frame image, identify the image text in the key frame image, and determine the key frame image based on the description information and the image text Keyword group. In some embodiments, the key frame image is input into the image search engine to obtain the description information of the key frame image, and the key frame image is input into the image text extraction tool to identify the image text in the key frame image, and the description information The phrase and the phrase in the image text are added to the phrase set, and the evaluation value corresponding to each phrase in the phrase set is determined according to the number of occurrences of each phrase in the phrase set and the weight corresponding to the description information, and the weight corresponding to the image text. According to the evaluation value, each phrase in the phrase set is sorted, and the keyword group of the key frame image is determined from the phrase set according to the sorting result. As shown in Figure 8, after obtaining the description information and the image text, the description information and the image text are composed of phrases. The phrase in the description information and the phrase in the image text are added to the phrase set, and the phrase is added to the phrase When collecting, count the number of occurrences of each phrase, so that the final phrase collection contains two parts, one is the phrase in the description information and the number of occurrences of each phrase, and the other is the phrase in the image text and the occurrence of each phrase The number of times, according to the phrase in the description information, the number of occurrences of each phrase and the weight of the description information, each phrase is weighted to obtain the evaluation value of each phrase in the description information. According to the phrase in the image text, each The number of occurrences of a phrase and the weight of the image text 2. Weighted calculation is performed on each phrase to obtain the evaluation value of each phrase in the image text. According to the evaluation value, each phrase in the phrase set is sorted. According to the sorting result, Identify the keyword group in the phrase set. Wherein, the value obtained by multiplying the number of occurrences of each phrase in the phrase set by the type weight corresponding to the phrase can be used as the evaluation value of the phrase.

第三种关键词组确定方法，将该关键帧图像输入图像搜索引擎中，得到该关键帧图像的描述信息，获取所述关键帧图像对应的字幕信息，根据描述信息以及字幕信息确定关键帧图像的关键词组。一些实施例中，将关键帧图像输入图像搜索引擎中，得到该关键帧图像的描述信息，并从该关键帧图像对应的字幕信息，将描述信息中的词组以及字幕信息中的词组添加到词组集合中，根据词组集合中每个词组的出现次数及描述信息对应的权重、字幕信息对应的权重，确定该词组集合中每个词组对应的评估值，根据评估值，对词组集合中每个词组进行排序，根据排序结果从词组集合中确定关键帧图像的关键词组。如图8所示，在得到描述信息及字幕信息后，描述信息和字幕信息均由词组组成，将描述信息中的词组以及字幕信息中的词组添加到词组集合中，并在将词组添加至词组集合时，统计每个词组出现的次数，使得最终的词组集合中包含两部分，一部分为描述信息中的词组及每个词组的出现次数，另一部分为字幕信息中的词组及每个词组的出现次数，根据描述信息中的词组，每个词组的出现次数及描述信息的权重1，对每个词组进行加权计算，以得到描述信息中每个词组的评估值，根据字幕信息中的词组，每个词组的出现次数及字幕信息的权重3，对每个词组进行加权计算，以得到字幕信息中每个词组的评估值，根据该评估值对词组集合中的各个词组进行排序，根据排序结果从词组集合中确定关键词组。其中，可以将词组集合中的每个词组的出现次数乘以该词组对应的类型权重得到的值作为该词组的评估值。The third keyword group determination method is to input the key frame image into the image search engine to obtain the description information of the key frame image, obtain the caption information corresponding to the key frame image, and determine the key frame image according to the description information and the caption information Keyword group. In some embodiments, the key frame image is input into the image search engine to obtain the description information of the key frame image, and from the subtitle information corresponding to the key frame image, the phrase in the description information and the phrase in the subtitle information are added to the phrase In the set, the evaluation value corresponding to each phrase in the phrase set is determined according to the number of occurrences of each phrase in the phrase set and the weight corresponding to the description information and the weight corresponding to the subtitle information. According to the evaluation value, the evaluation value of each phrase in the phrase set is determined. Perform sorting, and determine the keyword group of the key frame image from the phrase set according to the sorting result. As shown in Figure 8, after the description information and subtitle information are obtained, both the description information and the subtitle information are composed of phrases. The phrase in the description information and the phrase in the subtitle information are added to the phrase set, and the phrase is added to the phrase When assembling, count the number of occurrences of each phrase, so that the final phrase collection contains two parts, one is the phrase in the description information and the number of occurrences of each phrase, and the other is the phrase in the subtitle information and the occurrence of each phrase The number of times, according to the phrase in the description information, the number of appearances of each phrase and the weight of the description information, each phrase is weighted to obtain the evaluation value of each phrase in the description information. According to the phrase in the subtitle information, each The number of occurrences of a phrase and the weight of the subtitle information 3, each phrase is weighted to obtain the evaluation value of each phrase in the subtitle information, and each phrase in the phrase set is sorted according to the evaluation value, and according to the sorting result from Identify the keyword group in the phrase set. Wherein, the value obtained by multiplying the number of occurrences of each phrase in the phrase set by the type weight corresponding to the phrase can be used as the evaluation value of the phrase.

第四种关键词组确定方法，将该关键帧图像输入图像搜索引擎中，得到该关键帧图像的描述信息，识别所述关键帧图像中的图像文字，并获取所述关键帧图像对应的字幕信息，根据描述信息、图像文字以及字幕信息确定关键帧图像的关键词组。一些实施例中，将描述信息中的词组、图像文字中的词组以及字幕信息中的词组添加到词组集合，根据该词组集合中每个词组的出现次数以及类型权重，确定词组集合中每个词组对应的评估值，根据评估值，对词组集合中每个词组进行排序，根据排序结果从词组集合中确定关键帧图像的关键词组。其中，类型权重包括描述信息对应的权重、图像文字对应的权重以及字幕信息对应的权重。如图8所示，基于描述信息中的各个词组的出现次数及描述信息对应的权重1得到描述信息中的各个词组的评估值，基于图像文字中的各个词组的出现次数及图像文字对应的权重2得到图像文字中的各个词组的评估值，基于字幕信息中的各个词组的出现次数及字幕信息对应的权重3得到字幕信息中的各个词组的评估值，根据各个词组的评估值对词组集合中的词组进行排序，根据排序结果从词组集合中确定关键帧图像的关键词组。The fourth method for determining keyword groups is to input the key frame image into an image search engine to obtain the description information of the key frame image, identify the image text in the key frame image, and obtain the caption information corresponding to the key frame image , Determine the keyword group of the key frame image according to the description information, image text and caption information. In some embodiments, the phrase in the description information, the phrase in the image text, and the phrase in the subtitle information are added to the phrase set, and each phrase in the phrase set is determined according to the number of occurrences and type weight of each phrase in the phrase set The corresponding evaluation value, according to the evaluation value, sorts each phrase in the phrase set, and determines the keyword group of the key frame image from the phrase set according to the sorting result. Among them, the type weight includes the weight corresponding to the description information, the weight corresponding to the image text, and the weight corresponding to the subtitle information. As shown in Figure 8, the evaluation value of each phrase in the description information is obtained based on the number of occurrences of each phrase in the description information and the corresponding weight 1 of the description information, based on the number of occurrences of each phrase in the image text and the weight corresponding to the image text 2 Obtain the evaluation value of each phrase in the image text, based on the number of occurrences of each phrase in the subtitle information and the weight corresponding to the subtitle information. 3 Obtain the evaluation value of each phrase in the subtitle information, and compare the phrase set according to the evaluation value of each phrase Sort the phrases, and determine the keyword group of the key frame image from the phrase set according to the sorting result.

其中，在关键词组确定时，若由描述信息与图像文字和/或字幕信息多个文本组合确定目标视频的关键帧图像的关键词组，则可以基于描述信息、图像文字及字幕信息的重要程度，对描述信息、图像文字及字幕信息分别赋予不同的类型权重，如可以设置描述信息的权重大于图像文字的权重大于字幕信息的权重等。Among them, when the keyword group is determined, if the keyword group of the key frame image of the target video is determined by the combination of description information, image text and/or subtitle information multiple texts, it can be based on the importance of the description information, image text, and subtitle information. Different types of weights are assigned to description information, image text, and caption information. For example, you can set the weight of the description information to be greater than the weight of the image text than the weight of the caption information.

其中，在确定了关键帧图像的关键词组后，将各个关键词组转换为关键词组向量，将关键词组向量输入文本分类模型中，以得到目标视频的文本内容特征。其中，该文本内容特征的确定过程可以参见图5中所述的文本内容特征的确定过程中的具体描述，在此不再进行赘述。Among them, after determining the keyword group of the key frame image, each keyword group is converted into a keyword group vector, and the keyword group vector is input into the text classification model to obtain the text content characteristics of the target video. The process of determining the text content feature can refer to the specific description in the process of determining the text content feature described in FIG. 5, which is not repeated here.

步骤S603，获取目标视频对应的视频内容特征。Step S603: Obtain video content characteristics corresponding to the target video.

一些实施例中，获取组成目标视频的多个帧图像，根据该目标视频中每帧图像的内容，获取该目标视频对应的视频内容特征。一些实施例中，获取上述目标视频中的至少一个图像对，每个图像对均包含该目标视频中相邻的两帧图像；获取至少一个图像对中的两帧图像间的光流图，将至少一个图像对对应的光流图组成目标视频的光流图序列；将该目标视频的帧图像序列及光流图序列输入视频分类模型，得到目标视频对应的视频内容特征，帧图像序列由组成目标视频的各个帧图像依次排列得到。其中，光流图是空间运动物体在观察成像平面上的像素运动的瞬时速度，是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系，从而计算出相邻帧之间物体的运动信息的一种方法，在获取到该目标视频的至少一个图像对后，获取每个图像对中后一个帧图像在前一个帧图像的基础上，像素在时间域上的变化及两个帧图像间的相关性，以确定每个图像对的光流图。In some embodiments, multiple frame images that make up the target video are acquired, and the video content characteristics corresponding to the target video are acquired according to the content of each frame of the target video. In some embodiments, at least one image pair in the above-mentioned target video is acquired, and each image pair contains two adjacent frames of the target video; the optical flow diagram between the two frames of the at least one image pair is acquired, and the The optical flow diagram corresponding to at least one image pair constitutes the optical flow diagram sequence of the target video; the frame image sequence and the optical flow diagram sequence of the target video are input into the video classification model to obtain the video content characteristics corresponding to the target video, and the frame image sequence is composed of Each frame image of the target video is arranged in sequence. Among them, the optical flow diagram is the instantaneous velocity of the pixel movement of the spatially moving object on the observation imaging plane. It uses the changes in the time domain of the pixels in the image sequence and the correlation between adjacent frames to find the previous frame and the current frame. A method to calculate the motion information of objects between adjacent frames. After obtaining at least one image pair of the target video, obtain the next frame image in each image pair before the previous one. Based on the frame image, the change of pixels in the time domain and the correlation between the two frame images to determine the optical flow diagram of each image pair.

一些实施例中，参见图9，图9是本申请实施例提供的一种视频内容特征确定示意图，如图9所示，获取组成目标视频的多个帧图像，将多个帧图像依次组合得到帧图像序列901，获取多个帧图像中的至少一个图像对，并基于每个图像对得到该图像对的光流图，基于该光流图在目标视频中的相对位置，将各个光流图依次组合得到光流图序列902，将帧图像序列901及光流图序列902输入视频分类模型903中进行学习，得到该目标视频的视频内容特征。具体将帧图像序列901输入视频分类模型903中的空域卷积层，对帧图像序列901进行特征提取，得到目标视频的空域特征，将光流图序列902输入视频分类模型903中的时域卷积层，对光流图序列902进行特征提取，得到目标视频的时域特征，将空域特征与时域特征进行拼接，并基于视频分类模型对拼接后的空域特征与时域特征进行处理，得到视频内容特征9031。换句话说，把视频分成连续的帧图像，然后对每相邻两帧图像，计算它们的光流(optical flow)图，接着用两个三维卷积神经网络(空域卷积层和时域卷积层)分别对时序的帧图像901和光流图902进行特征提取，接着将提取出来的两个特征进行拼接，最后再进行分类，得到目标视频的视频内容特征，其中，是基于视频分类模型的分类头(header)结构对拼接后的特征进行处理得到视频内容特征，在该视频分类神经网络结构中，分类头结构通常由全连接层和softmax层组成。其中，视频分类模型使用了三维卷积模型(3D ConvNet)进行特征提取，该视频分类模型中所使用的三维卷积核的构建方法可以是用二维卷积进行“膨胀”获得的。In some embodiments, refer to FIG. 9, which is a schematic diagram of video content feature determination provided by an embodiment of the present application. As shown in FIG. 9, multiple frame images that make up the target video are acquired, and the multiple frame images are sequentially combined to obtain Frame image sequence 901, acquiring at least one image pair in a plurality of frame images, and obtaining an optical flow diagram of the image pair based on each image pair. Based on the relative position of the optical flow diagram in the target video, each optical flow diagram The sequence of optical flow diagrams 902 is sequentially combined, and the frame image sequence 901 and the sequence of optical flow diagrams 902 are input into the video classification model 903 for learning, and the video content characteristics of the target video are obtained. Specifically, the frame image sequence 901 is input into the spatial convolutional layer in the video classification model 903, and feature extraction is performed on the frame image sequence 901 to obtain the spatial characteristics of the target video. The optical flow graph sequence 902 is input into the time domain volume in the video classification model 903. Multilayer, perform feature extraction on the optical flow diagram sequence 902 to obtain the time domain feature of the target video, stitch the spatial feature and the time domain feature, and process the spliced spatial feature and time domain feature based on the video classification model to obtain Video content features 9031. In other words, the video is divided into continuous frame images, and then for every two adjacent frames of images, their optical flow (optical flow) graph is calculated, and then two three-dimensional convolutional neural networks (spatial convolutional layer and time domain convolutional layer) are used. Multilayer) respectively perform feature extraction on the time series frame image 901 and optical flow graph 902, then stitch the two extracted features, and finally classify them to obtain the video content features of the target video, which is based on the video classification model The classification header structure processes the spliced features to obtain video content features. In the video classification neural network structure, the classification header structure usually consists of a fully connected layer and a softmax layer. Among them, the video classification model uses a three-dimensional convolution model (3D ConvNet) for feature extraction, and the construction method of the three-dimensional convolution kernel used in the video classification model can be obtained by "expansion" using two-dimensional convolution.

步骤S604，获取目标视频的音频信息，基于该音频信息得到该音频信息对应的语音内容特征。Step S604: Obtain the audio information of the target video, and obtain the voice content feature corresponding to the audio information based on the audio information.

一些实施例中，获取目标视频中的音频信息，基于该音频信息得到该音频信息对应的语音内容特征。一些实施例中，获取目标视频中的音频信息后，将该音频信息输入语音分类模型，得到该音频信息对应的语音内容特征，或者，将该音频信息转换成图像，再将音频信息转换后的图像输入相关的图像分类模型中进行特征提取。其中，该语音分类模型可以是现有的语音分类模型，如深层的前馈序列记忆神经网络(Deep Feedforward SequentialMemory Networks，DFSMN)等。其中，获取现有的语音分类模型，基于视频音频样本及视频类别样本对该现有的语音分类模型进行训练，以得到适用于视频分类的语音分类模型，具体是获取每个视频中的音频样本特征及该视频的类别特征，每个视频的音频样本特征及类别特征对语音分类模型进行训练，使得最终得到的语音分类模型可以实现在接收音频样本时，得到对应的类别特征的概率最大。In some embodiments, the audio information in the target video is obtained, and the voice content feature corresponding to the audio information is obtained based on the audio information. In some embodiments, after obtaining the audio information in the target video, input the audio information into a voice classification model to obtain the voice content characteristics corresponding to the audio information, or convert the audio information into an image, and then convert the audio information into an image. Feature extraction is performed in the image classification model related to image input. The voice classification model may be an existing voice classification model, such as a deep feedforward sequential memory neural network (DFSMN) and so on. Among them, the existing voice classification model is obtained, and the existing voice classification model is trained based on video audio samples and video category samples to obtain a voice classification model suitable for video classification, specifically obtaining audio samples in each video The feature and the category feature of the video, the audio sample feature and category feature of each video train the voice classification model, so that the finally obtained voice classification model can achieve the highest probability of obtaining the corresponding category feature when receiving audio samples.

步骤S605，得到目标视频的融合特征。In step S605, the fusion feature of the target video is obtained.

一些实施例中，根据上述步骤S602得到的文本内容特征、步骤S603得到的视频内容特征以及步骤S604得到的语音内容特征，得到该目标视频的融合特征。其中，第一种可能的融合特征获取方法中，将文本内容特征与视频内容特征进行拼接，得到第一融合特征；第二种可能的融合特征获取方法中，将文本内容特征与语音内容特征进行拼接，得到第二融合特征；第三种可能的融合特征获取方法中，将文本内容特征、视频内容特征与语音内容特征进行拼接，得到第三融合特征。其中，本申请是以第三种可能的融合特征获取方法进行描述，将文本内容特征、视频内容特征与语音内容特征进行拼接，得到第三融合特征。In some embodiments, the fusion feature of the target video is obtained based on the text content feature obtained in step S602, the video content feature obtained in step S603, and the voice content feature obtained in step S604. Among them, in the first possible fusion feature acquisition method, the text content feature and the video content feature are spliced to obtain the first fusion feature; in the second possible fusion feature acquisition method, the text content feature and the voice content feature are combined. Splicing to obtain the second fusion feature; in the third possible fusion feature acquisition method, the text content feature, video content feature and voice content feature are spliced to obtain the third fusion feature. Among them, this application describes the third possible fusion feature acquisition method, where text content features, video content features, and voice content features are spliced together to obtain the third fusion feature.

其中，文本内容特征、视频内容特征与语音内容特征均是一个向量，而这三个向量的长度可能不尽相同，将三个向量进行拼接，得到一个长特征向量，该长特征向量的维度等于三个向量维度之和。其中，由于不是每个视频中都包含音频信息，当目标视频中不包含音频信息时，可以用预设的向量表示语音内容特征，预设的向量是一个固定的向量，如固定长度的一个全0向量。Among them, the text content feature, the video content feature, and the voice content feature are all a vector, and the lengths of these three vectors may be different. The three vectors are spliced to obtain a long feature vector. The dimension of the long feature vector is equal to The sum of the three vector dimensions. Among them, since not every video contains audio information, when the target video does not contain audio information, a preset vector can be used to represent the voice content characteristics. The preset vector is a fixed vector, such as a fixed length. 0 vector.

一些实施例中，在文本内容特征中的第一指定位置添加默认特征值，得到第一指定长度的文本内容特征；在语音内容特征中的第二指定位置添加默认特征值，得到第二指定长度的语音内容特征；在视频内容特征中的第三指定位置添加默认特征值，得到第三指定长度的视频内容特征，将第一指定长度的文本内容特征、第二指定长度的语音内容特征及第三指定长度的视频内容特征进行拼接，得到第三融合特征。在这种一些实施例中情况下，由于文本内容特征长度、视频内容特征长度及语音内容长度是预设的固定长度，因此目标视频无论包不包含音频信息，都可以直接用默认特征值进行补全。In some embodiments, the default feature value is added at the first specified position in the text content feature to obtain the text content feature of the first specified length; the default feature value is added at the second specified location in the voice content feature to obtain the second specified length The voice content feature; add the default feature value to the third specified position in the video content feature to obtain the third specified length of the video content feature, and the first specified length of text content feature, the second specified length of voice content feature, and the first specified length Three video content features of a specified length are spliced together to obtain a third fusion feature. In some such embodiments, since the text content feature length, the video content feature length, and the voice content length are preset fixed lengths, the target video can be directly supplemented with default feature values regardless of whether the package does not contain audio information. all.

以可选情况举例来说，假定获取到长度为11的文本内容特征，长度为3的语音内容特征以及长度为9的视频内容特征，且第一指定长度为15，第二指定长度为5，第三指定长度为12，默认特征值为0，则此时的长特征向量维度为32。基于上述假定条件，在文本内容特征后添加4位默认特征值0，得到一个(11+4)的文本内容特征，在语音内容特征后添加2位默认特征值0，得到一个(3+2)的语音内容特征，在视频内容特征后添加3位默认特征值0，得到一个(9+3)的视频内容特征，将(11+4)的文本内容特征、(3+2)的语音内容特征以及(9+3)的视频内容特征进行拼接，得到一个维度为32的第三融合特征。其中，可以认为该维度为32的第三融合特征在维度12到15、维度19至20以及维度30至32处的值为0。Take the optional case as an example, suppose that a text content feature with a length of 11, a voice content feature with a length of 3, and a video content feature with a length of 9 are obtained, and the first specified length is 15, and the second specified length is 5. The third specified length is 12, and the default feature value is 0, then the long feature vector dimension at this time is 32. Based on the above assumptions, add a 4-digit default feature value of 0 after the text content feature to get a (11+4) text content feature, and add a 2-digit default feature value of 0 after the voice content feature to get a (3+2) The voice content feature of the video content feature, add 3 default feature value 0 after the video content feature, get a (9+3) video content feature, combine the text content feature of (11+4) and the voice content feature of (3+2) And (9+3) video content features are spliced together to obtain a third fusion feature with a dimension of 32. Among them, it can be considered that the third fusion feature with a dimension of 32 has a value of 0 at dimensions 12 to 15, dimensions 19 to 20, and dimensions 30 to 32.

步骤S606，根据融合特征确定目标视频的视频类型标签。Step S606: Determine the video type label of the target video according to the fusion feature.

一些实施例中，将第三融合特征输入分类模型中，得到该目标视频的视频类型标签。一些实施例中，在上述步骤中获得长特征向量之后，可以使用长特征向量来表示目标视频，并将该长特征向量作为分类模型的输入，对目标视频进行分类，得到该目标视频的视频类型标签。分类模型可以是传统机器学习的分类模型，如支持向量机模型(Support Vector Machine，SVM)，线性回归分析模型(logistic回归)等，也可以是基于神经网络模型结构的一个端到端的结构，如几层全连接层加上softmax层的组合结构。其中，在得到目标视频的视频类型标签后，执行步骤S607及步骤S608，以应用该目标视频的视频类型标签。In some embodiments, the third fusion feature is input into the classification model to obtain the video type label of the target video. In some embodiments, after the long feature vector is obtained in the above steps, the long feature vector can be used to represent the target video, and the long feature vector can be used as the input of the classification model to classify the target video to obtain the video type of the target video label. The classification model can be a traditional machine learning classification model, such as Support Vector Machine (SVM), linear regression analysis model (logistic regression), etc., or it can be an end-to-end structure based on the neural network model structure, such as A combined structure of several fully connected layers plus a softmax layer. Wherein, after the video type label of the target video is obtained, step S607 and step S608 are executed to apply the video type label of the target video.

一些实施例中，若第三融合特征包括第一指定长度的文本内容特征、第二指定长度的语音内容特征及第三指定长度的视频内容特征，则基于文本内容特征、语音内容特征及视频内容特征对于目标视频的重要程度，在分类模型中的权重矩阵为不同文本内容特征、语音内容特征及视频内容特征赋予不同的权重。其中，基于第一指定长度、第二指定长度及第三指定长度，在分类模型中的权重矩阵的不同维度范围内为不同的特征赋予权重，则分类模型中的权重矩阵包括三个维度范围的权重部分，这三个维度范围的权重部分分别与文本内容特征、语音内容特征及视频内容特征进行计算。其中，可以基于文本内容特征、语音内容特征及视频内容特征对目标视频的重要程度，调整文本内容特征对应的权重部分、视频内容特征对应的权重部分以及语音内容特征对应的权重部分，例如，由于视频的音频信息的不确定性，可以将分类模型的权重矩阵中文本内容特征对应的权重部分大于视频内容特征对应的权重部分大于语音内容特征对应的权重部分，以增加文本内容特征对目标视频分类的影响，进而提高视频分类的准确性。In some embodiments, if the third fusion feature includes a text content feature of a first specified length, a voice content feature of a second specified length, and a video content feature of a third specified length, it is based on the text content feature, voice content feature, and video content The importance of features to the target video, the weight matrix in the classification model assigns different weights to different text content features, voice content features, and video content features. Among them, based on the first designated length, the second designated length, and the third designated length, weights are assigned to different features in the different dimension ranges of the weight matrix in the classification model, and the weight matrix in the classification model includes three dimension ranges. The weight part, the weight part of these three dimension ranges is calculated with the text content feature, the voice content feature and the video content feature respectively. Among them, the weight part corresponding to the text content feature, the weight part corresponding to the video content feature, and the weight part corresponding to the voice content feature can be adjusted based on the text content feature, the voice content feature, and the importance of the video content feature to the target video, for example, because For the uncertainty of the audio information of the video, the weight part corresponding to the text content feature in the weight matrix of the classification model can be greater than the weight part corresponding to the video content feature and greater than the weight part corresponding to the voice content feature to increase the text content feature to classify the target video The impact of video classification, thereby improving the accuracy of video classification.

举例来说，假定第一指定长度为15，第二指定长度为5，第三指定长度为12，基于上述步骤S605中的例子来说，将维度为32的第三融合特征输入分类模型中，基于该分类模型中的权重矩阵得到目标视频的视频类型标签。其中，该分类模型中的权重矩阵可以认为是32*1的一个矩阵，该权重矩阵中维度范围为维度1至维度15的权重部分，用于与文本内容特征进行计算，该权重矩阵中维度范围为维度16至维度20的权重部分，用于与语音内容特征进行计算，该权重矩阵中维度范围为维度 21至维度32的权重部分，用于与视频内容特征进行计算，从而使得该分类模型在进行视频类型分类时，对于视频的各个特征具有侧重点，使得可以基于需求调整该分类模型中的权重矩阵，以调整各个通知对于视频分类结果的影响，进而提高视频分类的准确性。For example, assuming that the first designated length is 15, the second designated length is 5, and the third designated length is 12, based on the example in step S605 above, the third fusion feature with a dimension of 32 is input into the classification model. The video type label of the target video is obtained based on the weight matrix in the classification model. Among them, the weight matrix in the classification model can be considered as a 32*1 matrix. The dimension range of the weight matrix is the weight part from dimension 1 to dimension 15, which is used to calculate the text content characteristics. The dimension range in the weight matrix It is the weight part from dimension 16 to dimension 20, which is used to calculate the voice content feature. The weight part in the weight matrix ranges from dimension 21 to dimension 32, which is used to calculate the video content feature, so that the classification model is When video type classification is performed, each feature of the video is focused, so that the weight matrix in the classification model can be adjusted based on needs to adjust the impact of each notification on the video classification result, thereby improving the accuracy of video classification.

步骤S607，将目标视频添加至视频类型标签对应的视频分类。Step S607: Add the target video to the video category corresponding to the video type tag.

一些实施例中，基于目标视频的视频类型标签，将目标视频添加至视频类型标签对应的视频分类中。一些实施例中，可以将目标视频的视频类型标签发送到目标服务器，该目标服务器用于管理对应的应用程序，应用程序可以基于目标服务器接收到的视频类型标签，将该目标视频添加至视频类型标签对应的分类中，以使用户在使用该应用程序时，可以在该视频类型标签对应的分类中查找到该目标视频，通过这种方式可以提高应用程序对视频的管理效率，其中，该应用程序可以是***工具，如视频分类标注工具，视频推荐***，视频搜索***等等。以视频分类标注工具为例，可以通过本申请中实现的方法对待标注视频进行预判断，然后基于待标注视频的视频类型标签为标注人员提供候选答案，加快标注效率；又比如视频推荐***，使用本申请对视频进行分类后，可以根据视频的类别和用户画像，给用户进行视频的精准推荐。In some embodiments, based on the video type tag of the target video, the target video is added to the video classification corresponding to the video type tag. In some embodiments, the video type tag of the target video may be sent to the target server. The target server is used to manage the corresponding application. The application may add the target video to the video type based on the video type tag received by the target server. The category corresponding to the tag, so that when the user uses the application, the target video can be found in the category corresponding to the video type tag. In this way, the efficiency of the application’s video management can be improved. Among them, the application The program can be a system tool, such as a video classification and labeling tool, a video recommendation system, a video search system, and so on. Taking a video classification and annotation tool as an example, the method implemented in this application can be used to pre-judge the video to be annotated, and then provide candidate answers for annotators based on the video type tag of the video to be annotated to speed up the efficiency of annotation; another example is a video recommendation system that uses After the application classifies the video, it can accurately recommend the video to the user according to the video category and user portrait.

步骤S608，将目标视频推送到目标终端。Step S608: Push the target video to the target terminal.

一些实施例中，基于目标视频的视频类型标签，将目标视频推送到目标终端，该目标终端为标记了视频类型标签的终端。该目标终端可以是用户的个人终端等，目标终端的使用用户基于自己对于视频的观看兴趣，添加关注列表，该关注列表中包括多个视频类型标签，当得到目标视频的视频类型标签后，获取标记有该视频类型标签的至少一个目标终端，并将该目标视频发送到至少一个目标终端。In some embodiments, the target video is pushed to the target terminal based on the video type tag of the target video, and the target terminal is a terminal marked with the video type tag. The target terminal may be the user’s personal terminal, etc. The user of the target terminal adds a watch list based on his or her own interest in watching the video. The watch list includes multiple video type tags. When the video type tag of the target video is obtained, obtain At least one target terminal marked with the video type tag, and the target video is sent to the at least one target terminal.

举例来说，用户A对喜剧及武侠相关的视频感兴趣，用户B对恐怖及推理相关的视频感兴趣，用户C对偶像及综艺相关的视频感兴趣，当得到目标视频的视频类型标签后，该视频类型标签指示目标视频为动作及武侠，则获取标记了目标视频的视频类型标签的用户，可以得知标记了动作或武侠视频类型标签的用户为用户A，则将目标视频发送给用户A所使用的目标终端。For example, user A is interested in videos related to comedy and martial arts, user B is interested in videos related to horror and reasoning, and user C is interested in videos related to idols and variety shows. When the video type tag of the target video is obtained, The video type tag indicates that the target video is action and martial arts, and the user who has the video type tag marked with the target video is obtained, and the user with the action or martial arts video type tag can be known as user A, then the target video is sent to user A The target terminal used.

一些实施例中，参见图10，图10是本申请实施例提供的一种视频类型标签确定过程示意图。如图10所示，该视频类型标签生成过程中的特征提取过程分为三条支路，第一条支路为基于目标视频的关键帧图像，获取该关键帧图像中的文本信息，提取文本信息中的关键词组，将该关键词组输入文本分类模型得到文本内容特征，该过程的具体实现可以参见图6中所示步骤S601至步骤S602；第二条支路为基于组成目标视频的帧图像及视频分类模型，得到目标视频的视频内容特征，该过程的具体实现可以参见图6中所示步骤S603；第三条支路为获取目标视频的音频信息，将该音频信息输入语音分类模型，得到目标视频的语音内容特征，该过程的具体实现可以参见图6中所示步骤S604。基于以上三条支路的随机组合，得到不同的视频分类方法。In some embodiments, refer to FIG. 10, which is a schematic diagram of a video type tag determination process provided by an embodiment of the present application. As shown in Figure 10, the feature extraction process in the video type label generation process is divided into three branches. The first branch is a key frame image based on the target video. The text information in the key frame image is obtained and the text information is extracted. In the keyword group, input the keyword group into the text classification model to obtain the text content characteristics. The specific implementation of this process can be seen from step S601 to step S602 shown in Fig. 6; the second branch is based on the frame image and composition of the target video. The video classification model obtains the video content characteristics of the target video. The specific implementation of this process can be seen in step S603 shown in Figure 6; the third branch is to obtain the audio information of the target video and input the audio information into the voice classification model to obtain The voice content characteristics of the target video. For the specific implementation of this process, refer to step S604 shown in FIG. 6. Based on the random combination of the above three branches, different video classification methods are obtained.

第一种视频分类方法，获取到目标视频的文本内容特征后，根据该文本内容特征确定目标视频的视频类型标签，该方法的具体实现方式可以参见图3中的各个步骤描述，在此不做赘述。The first video classification method, after obtaining the text content characteristics of the target video, determine the video type label of the target video according to the text content characteristics. The specific implementation of this method can be referred to the description of each step in Figure 3, which will not be done here. Go into details.

第二种视频分类方法，获取到目标视频的文本内容特征及视频内容特征后，将文本内容特征与视频内容特征进行拼接，得到第一融合特征，将第一融合特征输入分类模型中，得到目标视频的视频类型标签。一些实施例中，在文本内容特征中的第一指定位置添加默认特征值，得到第一指定长度的文本内容特征；在视频内容特征中的第三指定位置添加默认特征值，得到第三指定长度的视频内容特征；将第一指定长度的文本内容特征及第三指定长度的视频内容特征进行拼接，得到第一融合特征；将第一融合特征输入分类模型中，基于该分类模型中的分类权重矩阵，得到目标视频的视频类型标签。通过在目标视频的文本内容特征基础上，增加视频内容特征作为对目标视频进行视频分类的依据，使得对该目标视频进行分类时，针对目标视频提取的特征更加全面，从而提高目标视频的分类结果的准确性。其中，文本内容特征是对目标视频的常规说明，视频内容特征是基于目标视频的画面本身所提取的特征，而对于一个视频来说，视频的画面是视频的必要组成部分，即不同种类的视频均由画面所组成，该画面为组成视频的帧图像，是该视频的内容展示，从而可以通过对目标视频提取的第三方特征描述——文本内容特征，以及对目标视频本身所提取的自有特征描述——视频内容特征，进行结合，得到尽可能全面的目标视频的组成特征，而且视频内容特征提取的过程中关注了目标视频的每帧图像间的变化关系，从而可以表现出目标视频的整体变化，从目标视频的单独帧图像的特征到每帧图像间的变化特征对目标视频进行组合处理，可以进一步地提高视频分类的准确性。The second video classification method, after obtaining the text content features and video content features of the target video, stitching the text content features and the video content features to obtain the first fusion feature, and input the first fusion feature into the classification model to obtain the target The video type tag of the video. In some embodiments, the default feature value is added at the first specified position in the text content feature to obtain the text content feature of the first specified length; the default feature value is added at the third specified location in the video content feature to obtain the third specified length The video content features of the; the text content features of the first specified length and the video content features of the third specified length are spliced together to obtain the first fusion feature; the first fusion feature is input into the classification model, based on the classification weight in the classification model Matrix to get the video type label of the target video. On the basis of the text content features of the target video, the video content features are added as the basis for video classification of the target video, so that when the target video is classified, the features extracted for the target video are more comprehensive, thereby improving the classification result of the target video Accuracy. Among them, the text content feature is a general description of the target video, and the video content feature is based on the features extracted from the target video's screen itself. For a video, the video's screen is an essential part of the video, that is, different types of videos. All are composed of pictures, which are the frame images that make up the video and are the content display of the video, which can be described by the third-party feature extracted from the target video-text content features, as well as the self-extracted from the target video itself Feature description-video content features are combined to obtain as comprehensive as possible the composition features of the target video, and the process of extracting video content features pays attention to the change relationship between each frame of the target video, so as to show the target video The overall change, from the characteristics of a single frame of the target video to the characteristics of the change between each frame of the target video, can further improve the accuracy of video classification.

第三种视频分类方法，获取到目标视频的文本内容特征及语音内容特征后，将文本内容特征与语音内容特征进行拼接，得到第二融合特征，将第二融合特征输入分类模型中，得到目标视频的视频类型标签。一些实施例中，在文本内容特征中的第一指定位置添加默认特征值，得到第一指定长度的文本内容特征；在语音内容特征中的第二指定位置添加默认特征值，得到第二指定长度的语音内容特征；将第一指定长度的文本内容特征及第二指定长度的语音内容特征进行拼接，得到第二融合特征；将第二融合特征输入分类模型中，基于该分类模型中的分类权重矩阵，得到目标视频的视频类型标签。通过在目标视频的文本内容特征基础上，增加语音内容特征作为对目标视频进行视频分类的依据，使得对该目标视频进行分类时，针对目标视频提取的特征更加全面，从而提高目标视频的分类结果的准确性。其中，文本内容特征是对目标视频的常规说明，而语音内容特征是基于目标视频的音频信息所提取的特征，该音频信息一般是针对目标视频的相关介绍，包括独白及台词等，该音频信息是针对目标视频本身的一种说明，往往在目标视频的画面中可能无法体现出来，从而通过将文本内容特征与视频内容特征进行结合的方式，可以基于文本内容特征得到组成目标视频的帧图像的特征，并基于语音内容特征将目标视频的分类结果限制在目标视频的内容本身，从而使得对目标视频的分类更加精确。例如，目标视频的画面讲述了一个“一名侠客行侠仗义”的故事，通过提取该目标视频的文本信息，得到该目标视频的文本内容特征，可能只会得到“武侠”这一分类，而在目标视频的开始的一段音频信息，讲述了这一故事的发生背景，如“在XX朝代，烽烟四起，百姓因战乱流离失所，更有一些人趁战乱大肆欺压百姓，一名侠客......”，通过该音频信息可以提取到该目标视频发生的时间、地点等，从而可以得到该目标视频的另一个分类“历史”，进而可以使得目标视频的分类结果更加全面和准确。The third video classification method, after obtaining the text content feature and voice content feature of the target video, splicing the text content feature and the voice content feature to obtain the second fusion feature, and input the second fusion feature into the classification model to obtain the target The video type tag of the video. In some embodiments, the default feature value is added at the first specified position in the text content feature to obtain the text content feature of the first specified length; the default feature value is added at the second specified location in the voice content feature to obtain the second specified length The voice content feature of the; the text content feature of the first specified length and the voice content feature of the second specified length are spliced to obtain the second fusion feature; the second fusion feature is input into the classification model, based on the classification weight in the classification model Matrix to get the video type label of the target video. By adding voice content features on the basis of the text content features of the target video as the basis for video classification of the target video, when classifying the target video, the extracted features of the target video are more comprehensive, thereby improving the classification result of the target video Accuracy. Among them, the text content feature is a regular description of the target video, and the voice content feature is the feature extracted based on the audio information of the target video. The audio information is generally related to the introduction of the target video, including monologues and lines. The audio information It is an explanation for the target video itself, which may not be reflected in the screen of the target video. Therefore, by combining the text content characteristics with the video content characteristics, the frame image composition of the target video can be obtained based on the text content characteristics. Based on the characteristics of the voice content, the classification result of the target video is limited to the content of the target video itself, so that the classification of the target video is more accurate. For example, the screen of the target video tells a story of "a knight who is a man of justice". By extracting the text information of the target video to obtain the text content characteristics of the target video, it may only be classified as "wuxia". An audio message at the beginning of the target video tells the background of the story. For example, "During the XX dynasty, the beacon was everywhere, the people were displaced due to the war, and some people bullied the people during the war, a knight... ", through the audio information, the time, location, etc. of the target video can be extracted, so that another classification "history" of the target video can be obtained, and the classification result of the target video can be made more comprehensive and accurate.

第四种视频分类方法，获取到目标视频的文本内容特征、视频内容特征及语音内容特征后，将文本内容特征、视频内容特征及语音内容特征进行拼接，得到第三融合特征，将第三融合特征输入分类模型中，得到目标视频的视频类型标签。该方法的具体实现方式可以参见图6中各个步骤的描述，在此不做赘述。The fourth video classification method, after obtaining the text content feature, video content feature, and voice content feature of the target video, splicing the text content feature, video content feature, and voice content feature to obtain the third fusion feature, and merge the third The feature is input into the classification model to obtain the video type label of the target video. For the specific implementation of the method, refer to the description of each step in FIG. 6, which will not be repeated here.

其中，在通过以上视频分类方法得到目标视频的视频分类标签后，均可以执行图6中步骤S607及步骤S608，以应用该目标视频的视频分类标签。Wherein, after the video classification label of the target video is obtained by the above video classification method, step S607 and step S608 in FIG. 6 can be performed to apply the video classification label of the target video.

本申请实施例通过从目标视频中获取关键帧图像，将该关键帧图像输入图像搜索引擎，得到该关键帧图像的描述信息，根据该描述信息确定关键帧图像的关键词组，获取该关键词组对应的文本内容特征，根据该文本内容特征确定目标视频的视频类型标签。本申请将目标视频的相关文字信息作为该目标视频进行分类的一种依据，使得在训练时可以增加视频样本的文字信息作为训练的样本，以增加可训练的数据量，提高视频分类的准确性。而且由于该文字信息是基于图像搜索引擎得到的，该图像搜索引擎对于图像的文字标注是对该图像的一种解释说明，使得该目标视频的相关文字信息本身就是对该目标视频的一种内容说明，因此通过该相关文字信息确定目标视频的视频类型标签，可以进一步地提高视频分类的有效性和准确性。同时，在本申请中，通过将组成目标视频的不同的维度内容的特征进行随机组合，实现视频的多模态分类，具体基于目标视频的文本信息、视频帧图像及音频信息等维度内容的结合，对目标视频进行分类，在增加了用于视频分类的特征的情况下，对目标视频进行分类时所使用的特征就更为全面，兼具文本内容特征结合视频内容特征以及文本内容特征结合语音内容特征的优点，进而提高了视频分类的准确性和有效性。而且，在训练时也是基于目标视频的文本信息、视频帧图像及音频信息等维度内容样本的结合对用于视频分类的各个模型进行训练，这种多模态训练有效增加了可训练的数据量，进一步提高了用于视频分类的各个模型的预测准确性。In this embodiment of the application, a key frame image is obtained from a target video, and the key frame image is input to an image search engine to obtain the description information of the key frame image, and the keyword group of the key frame image is determined according to the description information, and the corresponding keyword group is obtained According to the text content feature, the video type tag of the target video is determined according to the text content feature. This application uses the relevant text information of the target video as a basis for classifying the target video, so that the text information of the video sample can be added as a training sample during training, so as to increase the amount of trainable data and improve the accuracy of video classification . And because the text information is obtained based on the image search engine, the text annotation of the image by the image search engine is an explanation of the image, so that the relevant text information of the target video itself is a kind of content of the target video It is explained that, therefore, determining the video type label of the target video through the relevant text information can further improve the effectiveness and accuracy of video classification. At the same time, in this application, by randomly combining the features of the different dimensional content that make up the target video, the multi-modal classification of the video is realized, specifically based on the combination of the text information, video frame image, and audio information of the target video. , Classify the target video, and when the features for video classification are added, the features used when classifying the target video are more comprehensive, with both text content features combined with video content features and text content features combined with voice The advantages of content features further improve the accuracy and effectiveness of video classification. Moreover, during training, each model used for video classification is trained based on the combination of text information, video frame images and audio information of the target video and other dimensional content samples. This multi-modal training effectively increases the amount of trainable data. , Further improve the prediction accuracy of each model used for video classification.

参见图11，图11是本申请实施例提供的一种视频分类装置示意图，如图11所示，该视频分类装置110可以用于上述图3或图6所对应实施例中的计算机，一些实施例中，该视频分类装置110可以包括：第一获取模块11、第一确定模块12、第二获取模块13及第二确定模块14。Referring to FIG. 11, FIG. 11 is a schematic diagram of a video classification device provided by an embodiment of the present application. As shown in FIG. 11, the video classification device 110 can be used in the computer in the embodiment corresponding to FIG. 3 or FIG. 6, some implementations In an example, the video classification device 110 may include: a first acquisition module 11, a first determination module 12, a second acquisition module 13, and a second determination module 14.

第一获取模块11，用于从目标视频中获取关键帧图像；The first obtaining module 11 is used to obtain key frame images from the target video;

第一确定模块12，用于将所述关键帧图像输入图像搜索引擎，得到所述关键帧图像的描述信息，根据所述描述信息确定所述关键帧图像的关键词组；The first determining module 12 is configured to input the key frame image into an image search engine to obtain description information of the key frame image, and determine the keyword group of the key frame image according to the description information;

第二获取模块13，用于获取所述关键词组对应的文本内容特征；The second acquiring module 13 is configured to acquire the text content characteristics corresponding to the keyword group;

第二确定模块14，用于根据所述文本内容特征确定所述目标视频的视频类型标签。The second determining module 14 is configured to determine the video type tag of the target video according to the text content feature.

其中，所述装置110还包括：Wherein, the device 110 further includes:

第三获取模块15，用于根据所述目标视频中每帧图像的内容，获取所述目标视频对应的视频内容特征；The third obtaining module 15 is configured to obtain the video content characteristics corresponding to the target video according to the content of each frame of the target video;

则所述第二确定模块14，包括：Then the second determining module 14 includes:

拼接单元141，用于将所述文本内容特征与所述视频内容特征进行拼接，得到第一融合特征；A splicing unit 141, configured to splice the text content feature and the video content feature to obtain a first fusion feature;

第一训练单元142，用于将所述第一融合特征输入分类模型中，得到所述目标视频的视频类型标签。The first training unit 142 is configured to input the first fusion feature into the classification model to obtain the video type label of the target video.

其中，所述第三获取模块15，包括：Wherein, the third obtaining module 15 includes:

第一获取单元151，用于获取所述目标视频中的至少一个图像对，每个图像对均包含所述目标视频中相邻的两帧图像；The first obtaining unit 151 is configured to obtain at least one image pair in the target video, and each image pair includes two adjacent frames of images in the target video;

第二获取单元152，用于获取所述至少一个图像对中的两帧图像间的光流图，将所述至少一个图像对对应的光流图组成所述目标视频的光流图序列；The second acquiring unit 152 is configured to acquire an optical flow diagram between two frames of images in the at least one image pair, and compose the optical flow diagrams corresponding to the at least one image pair into an optical flow diagram sequence of the target video;

第二训练单元153，用于将所述目标视频的帧图像序列及所述光流图序列输入视频分类模型，得到所述目标视频对应的视频内容特征，所述帧图像序列由组成所述目标视频的各个帧图像依次排列得到。The second training unit 153 is configured to input the frame image sequence of the target video and the optical flow diagram sequence into a video classification model to obtain the video content characteristics corresponding to the target video, and the frame image sequence is composed of the target video. Each frame image of the video is arranged in sequence.

其中，所述装置110还包括：Wherein, the device 110 further includes:

第四获取模块16，用于获取所述目标视频的音频信息，将所述音频信息输入语音分类模型，得到所述音频信息对应的语音内容特征；The fourth obtaining module 16 is configured to obtain audio information of the target video, input the audio information into a voice classification model, and obtain voice content characteristics corresponding to the audio information;

所述第二确定模块14，包括：The second determining module 14 includes:

所述拼接单元141，还用于将所述文本内容特征与所述语音内容特征进行拼接，得到第二融合特征；The splicing unit 141 is further configured to splice the text content feature and the voice content feature to obtain a second fusion feature;

所述第一训练单元142，还用于将所述第二融合特征输入分类模型中，得到所述目标视频的视频类型标签。The first training unit 142 is further configured to input the second fusion feature into a classification model to obtain the video type label of the target video.

其中，所述装置110还包括：Wherein, the device 110 further includes:

第五获取模块17，用于识别所述关键帧图像中的图像文字，并获取所述关键帧图像对应的字幕信息；The fifth acquiring module 17 is configured to identify the image text in the key frame image, and acquire the subtitle information corresponding to the key frame image;

在所述根据所述描述信息确定所述关键帧图像的关键词组方面，所述第一确定模块12具体用于：In terms of determining the keyword group of the key frame image according to the description information, the first determining module 12 is specifically configured to:

根据所述描述信息、所述图像文字及所述字幕信息确定所述关键帧图像的关键词组。The keyword group of the key frame image is determined according to the description information, the image text, and the caption information.

其中，所述第一确定模块12，包括：Wherein, the first determining module 12 includes:

添加单元121，用于将所述描述信息中的词组、所述图像文字中的词组以及所述字幕信息中的词组添加到词组集合；The adding unit 121 is configured to add the phrase in the description information, the phrase in the image text, and the phrase in the caption information to the phrase set;

第一确定单元122，用于根据所述词组集合中每个词组的出现次数以及类型权重，确定所述词组集合中每个词组对应的评估值；所述类型权重包括所述描述信息对应的权重、所述图像文字对应的权重以及所述字幕信息对应的权重；The first determining unit 122 is configured to determine the evaluation value corresponding to each phrase in the phrase set according to the number of occurrences of each phrase in the phrase set and the type weight; the type weight includes the weight corresponding to the description information , The weight corresponding to the image text and the weight corresponding to the subtitle information;

第二确定单元123，用于根据所述评估值，对所述词组集合中每个词组进行排序，根据排序结果从所述词组集合中确定所述关键帧图像的关键词组。The second determining unit 123 is configured to sort each phrase in the phrase set according to the evaluation value, and determine the keyword group of the key frame image from the phrase set according to the sorting result.

其中，所述第一获取模块11，包括：Wherein, the first obtaining module 11 includes:

第三获取单元111，用于获取组成所述目标视频的多个帧图像，将所述多个帧图像输入关键帧确定模型中的特征提取层，得到每个帧图像的图像特征；The third acquiring unit 111 is configured to acquire multiple frame images that make up the target video, and input the multiple frame images into the feature extraction layer in the key frame determination model to obtain the image features of each frame image;

第三确定单元112，用于将所述每个帧图像的图像特征输入所述关键帧确定模型中的关键值确定层，在所述关键值确定层中基于注意力机制确定所述每个帧图像的关键值；The third determining unit 112 is configured to input the image characteristics of each frame image into a key value determination layer in the key frame determination model, and determine each frame in the key value determination layer based on an attention mechanism The key value of the image;

第四确定单元113，用于根据所述每个帧图像的关键值，确定所述目标视频中的所述关键帧图像。The fourth determining unit 113 is configured to determine the key frame image in the target video according to the key value of each frame image.

其中，所述第三确定单元112，具体用于：Wherein, the third determining unit 112 is specifically configured to:

在所述关键值确定层中基于所述注意力机制，确定所述多个帧图像中第i个帧图像的图像特征与对照图像的图像特征间的关联度，根据所述第i个帧图像的图像特征与所述对照图像的图像特征间的关联度得到所述第i个帧图像的关键值；所述对照图像为所述组成所述目标视频的多个帧图像中除所述第i个帧图像以外的帧图像，i为正整数，i不大于所述多个帧图像的数量；In the key value determination layer, based on the attention mechanism, determine the correlation between the image feature of the i-th frame image and the image feature of the control image in the plurality of frame images, according to the i-th frame image The correlation between the image feature of the image feature and the image feature of the control image obtains the key value of the i-th frame image; the control image is the i-th frame image of the multiple frame images that make up the target video For frame images other than one frame image, i is a positive integer, and i is not greater than the number of the multiple frame images;

当所述第i个帧图像为所述多个帧图像中的最后一个帧图像时，得到所述每个帧图像的关键值。When the i-th frame image is the last frame image among the multiple frame images, the key value of each frame image is obtained.

其中，在所述根据所述描述信息确定所述关键帧图像的关键词组方面，所述第一确定模块12包括：Wherein, in the aspect of determining the keyword group of the key frame image according to the description information, the first determining module 12 includes:

统计单元124，用于统计所述描述信息包含的词组中每个词组的出现次数，将所述描述信息中出现次数大于统计次数阈值的词组确定为所述关键帧图像的关键词组。The statistical unit 124 is configured to count the number of occurrences of each phrase in the phrase included in the description information, and determine the phrase with the number of occurrences greater than the statistical number threshold in the description information as the keyword group of the key frame image.

其中，所述第二获取模块13，包括：Wherein, the second obtaining module 13 includes:

提取单元131，用于将所述关键词组输入文本分类模型中，提取所述关键词组对应的初始文本特征；The extraction unit 131 is configured to input the keyword group into a text classification model, and extract the initial text feature corresponding to the keyword group;

匹配单元132，用于将所述初始文本特征与所述文本分类模型中的多个待匹配类型特征进行匹配，得到匹配值；The matching unit 132 is configured to match the initial text feature with multiple types of features to be matched in the text classification model to obtain a matching value;

第五确定单元133，用于将具有最大匹配值的待匹配类型特征确定为所述关键词组对应的文本内容特征。The fifth determining unit 133 is configured to determine the feature of the type to be matched with the largest matching value as the text content feature corresponding to the keyword group.

其中，在所述将所述文本内容特征与所述语音内容特征进行拼接，得到第二融合特征方面，所述拼接单元141包括：Wherein, in terms of splicing the text content feature and the voice content feature to obtain a second fusion feature, the splicing unit 141 includes:

第一生成子单元1411，用于在所述文本内容特征中的第一指定位置添加默认特征值，得到第一指定长度的文本内容特征；The first generating subunit 1411 is configured to add a default feature value at the first specified position in the text content feature to obtain the text content feature of the first specified length;

所述第一生成子单元1411，还用于在所述语音内容特征中的第二指定位置添加所述默认特征值，得到第二指定长度的语音内容特征；The first generating subunit 1411 is further configured to add the default feature value to a second specified position in the voice content feature to obtain a voice content feature of a second specified length;

第二生成子单元1412，用于将所述第一指定长度的文本内容特征及所述第二指定长度的语音内容特征进行拼接，得到所述第二融合特征；The second generation subunit 1412 is configured to splice the text content feature of the first specified length and the voice content feature of the second specified length to obtain the second fusion feature;

所述第一训练单元142，具体用于：The first training unit 142 is specifically configured to:

将所述第二融合特征输入所述分类模型中，基于所述分类模型中的分类权重矩阵，得到所述目标视频的视频类型标签。The second fusion feature is input into the classification model, and the video type label of the target video is obtained based on the classification weight matrix in the classification model.

其中，所述装置110还包括：Wherein, the device 110 further includes:

添加模块18，用于基于所述目标视频的视频类型标签，将所述目标视频添加至所述视频类型标签对应的视频分类中；或者，The adding module 18 is configured to add the target video to the video classification corresponding to the video type tag based on the video type tag of the target video; or,

发送模块19，用于将所述目标视频推送到目标终端，所述目标终端为标记了所述视频类型标签的终端。The sending module 19 is configured to push the target video to a target terminal, and the target terminal is a terminal marked with the video type tag.

本申请实施例提供了一种视频分类装置，该装置通过从目标视频中获取关键帧图像，将该关键帧图像输入图像搜索引擎，得到该关键帧图像的描述信息，根据该描述信息确定关键帧图像的关键词组，获取该关键词组对应的文本内容特征，根据该文本内容特征确定目标视频的视频类型标签。本申请将目标视频的相关文字信息作为该目标视频进行分类的一种依据，使得在训练时可以增加视频样本的文字信息作为训练的样本，以增加可训练的数据量，提高视频分类的准确性。而且由于该文字信息是基于图像搜索引擎得到的，该图像搜索引擎对于图像的文字标注是对该图像的一种解释说明，使得该目标视频的相关文字信息本身就是对该目标视频的一种内容说明，因此通过该相关文字信息确定目标视频的视频类型标签，可以进一步地提高视频分类的有效性和准确性。The embodiment of the application provides a video classification device. The device obtains a key frame image from a target video, inputs the key frame image into an image search engine, obtains the description information of the key frame image, and determines the key frame according to the description information The keyword group of the image, the text content feature corresponding to the keyword group is obtained, and the video type tag of the target video is determined according to the text content feature. This application uses the relevant text information of the target video as a basis for classifying the target video, so that the text information of the video sample can be added as a training sample during training, so as to increase the amount of trainable data and improve the accuracy of video classification . And because the text information is obtained based on the image search engine, the text annotation of the image by the image search engine is an explanation of the image, so that the relevant text information of the target video itself is a kind of content of the target video It is explained that, therefore, determining the video type label of the target video through the relevant text information can further improve the effectiveness and accuracy of video classification.

参见图12，图12是本申请实施例提供的一种计算机的结构示意图。如图12所示，本申请实施例中的计算机可以包括：一个或多个处理器1201、存储器1202和输入输出接口1203。上述处理器1201、存储器1202和输入输出接口1203通过总线1204连接。存储器1202用于存储计算机程序，该计算机程序包括程序指令，输入输出接口1203用于输入数据和输出数据，具体用于各个本申请中所使用的各个模型中数据的输入和输出；处理器1201用于执行存储器1202存储的程序指令，执行如下操作：Refer to FIG. 12, which is a schematic structural diagram of a computer provided by an embodiment of the present application. As shown in FIG. 12, the computer in the embodiment of the present application may include: one or more processors 1201, a memory 1202, and an input/output interface 1203. The aforementioned processor 1201, memory 1202, and input/output interface 1203 are connected through a bus 1204. The memory 1202 is used to store computer programs, which include program instructions. The input and output interface 1203 is used to input data and output data, specifically for the input and output of data in each model used in this application; the processor 1201 uses To execute the program instructions stored in the memory 1202, the following operations are performed:

在一些可行的实施方式中，上述处理器1201可以是中央处理单元(central processing unit，CPU)，该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor，DSP)、专用集成电路(application specific integrated circuit，ASIC)、现成可编程门阵列(field-programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。In some feasible implementation manners, the foregoing processor 1201 may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), and dedicated integrated Circuit (application specific integrated circuit, ASIC), ready-made programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

该存储器1202可以包括只读存储器和随机存取存储器，并向处理器1201和输入输出接口1203提供指令和数据。存储器1202的一部分还可以包括非易失性随机存取存储器。例如，存储器1202还可以存储设备类型的信息。The memory 1202 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1201 and the input/output interface 1203. A part of the memory 1202 may also include a non-volatile random access memory. For example, the memory 1202 can also store device type information.

具体实现中，上述计算机可通过其内置的各个功能模块执行如上述图3或图6中各个步骤所提供的实现方式，具体可参见上述图3或图6中各个步骤所提供的实现方式，在此不再赘述。In specific implementation, the above-mentioned computer can execute the implementation manners provided in each step in Figure 3 or Figure 6 through its built-in functional modules. For details, please refer to the implementation manner provided in each step in Figure 3 or Figure 6 above. This will not be repeated here.

本申请实施例通过提供一种计算机，包括：处理器、输入输出接口、存储器，通过处理器获取存储器中的计算机指令，执行上述图3或图6中所示方法的各个步骤，进行视频分类操作。通过存储器中的计算机指令，处理器执行以下步骤：从目标视频中获取关键帧图像，将该关键帧图像输入图像搜索引擎，得到该关键帧图像的描述信息，根据该描述信息确定关键帧图像的关键词组，获取该关键词组对应的文本内容特征，根据该文本内容特征确定目标视频的视频类型标签。本申请将目标视频的相关文字信息作为该目标视频进行分类的一种依据，使得在训练时可以增加视频样本的文字信息作为训练的样本，以增加可训练的数据量，提高视频分类的准确性。而且由于该文字信息是基于图像搜索引擎得到的，该图像搜索引擎对于图像的文字标注是对该图像的一种解释说明，使得该目标视频的相关文字信息本身就是对该目标视频的一种内容说明，因此通过该相关文字信息确定目标视频的视频类型标签，可以进一步地提高视频分类的有效性和准确性。The embodiment of the present application provides a computer including a processor, an input and output interface, and a memory. The processor obtains computer instructions in the memory and executes each step of the method shown in FIG. 3 or FIG. 6 to perform video classification operations. . Through computer instructions in the memory, the processor executes the following steps: obtain a key frame image from the target video, input the key frame image into the image search engine, obtain the description information of the key frame image, and determine the key frame image according to the description information The keyword group, the text content feature corresponding to the keyword group is obtained, and the video type tag of the target video is determined according to the text content feature. This application uses the relevant text information of the target video as a basis for classifying the target video, so that the text information of the video sample can be added as a training sample during training, so as to increase the amount of trainable data and improve the accuracy of video classification . And because the text information is obtained based on the image search engine, the text annotation of the image by the image search engine is an explanation of the image, so that the relevant text information of the target video itself is a kind of content of the target video It is explained that, therefore, determining the video type label of the target video through the relevant text information can further improve the effectiveness and accuracy of video classification.

本申请实施例还提供一种计算机可读存储介质，该计算机可读存储介质存储有计算机程序，该计算机程序包括程序指令，该程序指令被处理器执行时实现图3或图6中各个步骤所提供的视频分类方法，具体可参见上述图3或图6各个步骤所提供的实现方式，在此不再赘述。An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the program instructions are executed by a processor to implement the steps shown in FIG. 3 or FIG. 6 For the provided video classification method, please refer to the implementation manner provided in each step of FIG. 3 or FIG. 6 for details, and details are not described herein again.

上述计算机可读存储介质可以是前述任一实施例提供的视频分类装置或者上述计算机的内部存储单元，例如计算机的硬盘或内存。该计算机可读存储介质也可以是该计算机的外部存储设备，例如该计算机上配备的插接式硬盘，智能存储卡(smart media card，SMC)，安全数字(secure digital，SD)卡，闪存卡(flash card)等。进一步地，该计算机可读存储介质还可以既包括该计算机的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该计算机所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The foregoing computer-readable storage medium may be the video classification device provided in any of the foregoing embodiments or the internal storage unit of the foregoing computer, such as the hard disk or memory of the computer. The computer-readable storage medium may also be an external storage device of the computer, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, and a flash memory card equipped on the computer. (flash card) and so on. Further, the computer-readable storage medium may also include both an internal storage unit of the computer and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.

本申请实施例的说明书和权利要求书及附图中的术语“第一”、“第二”等是用于区别不同对象，而非用于描述特定顺序。此外，术语“包括”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、装置、产品或设备没有限定于已列出的步骤或模块，而是可选地还包括没有列出的步骤或模块，或可选地还包括对于这些过程、方法、装置、产品或设备固有的其他步骤单元。The terms "first", "second", etc. in the description, claims, and drawings of the embodiments of the present application are used to distinguish different objects, rather than to describe a specific sequence. In addition, the term "including" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, device, product, or equipment that includes a series of steps or units is not limited to the listed steps or modules, but optionally includes unlisted steps or modules, or optionally also includes Other step units inherent to these processes, methods, devices, products or equipment.

本领域普通技术人员可以意识到，结合本文中所申请的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments applied for in this article can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

本申请实施例提供的方法及相关装置是参照本申请实施例提供的方法流程图和/或结构示意图来描述的，具体可由计算机程序指令实现方法流程图和/或结构示意图的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。这些计算机程序指令可提供到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或结构示意图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或结构示意图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或结构示意一个方框或多个方框中指定的功能的步骤。The methods and related devices provided in the embodiments of the present application are described with reference to the method flowcharts and/or structural schematic diagrams provided in the embodiments of the present application, and each process and/or structural schematic diagrams of the method flowcharts and/or structural schematic diagrams can be implemented by computer program instructions. Or a block, and a combination of processes and/or blocks in the flowcharts and/or block diagrams. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing equipment to generate a machine, so that instructions executed by the processor of the computer or other programmable data processing equipment are generated for use. It is a device that realizes the functions specified in a block or blocks in a flow chart or a plurality of flows and/or a schematic structural diagram. These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the schematic structural diagram. These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one block or multiple blocks in the flow chart or the flow chart and/or the structure.

以上所揭露的仅为本申请较佳实施例而已，当然不能以此来限定本申请之权利范围，因此依本申请权利要求所作的等同变化，仍属本申请所涵盖的范围。The above-disclosed are only preferred embodiments of this application, and of course the scope of rights of this application cannot be limited by this. Therefore, equivalent changes made in accordance with the claims of this application still fall within the scope of this application.

Claims

一种视频分类方法，由一个或多个计算设备执行，其特征在于，所述方法包括：A video classification method, executed by one or more computing devices, characterized in that the method includes:

从目标视频中获取关键帧图像；Obtain key frame images from the target video;

将所述关键帧图像输入图像搜索引擎，得到所述关键帧图像的描述信息，根据所述描述信息确定所述关键帧图像的复数个关键词组；Input the key frame image into an image search engine to obtain description information of the key frame image, and determine a plurality of keyword groups of the key frame image according to the description information;

从复数个预设的文本类型特征中确定所述复数个关键词组对应的预设文本类型特征；Determine the preset text type features corresponding to the plurality of keyword groups from the plurality of preset text type features;

根据所述预设文本类型特征确定所述目标视频的视频类型标签。The video type tag of the target video is determined according to the feature of the preset text type.
如权利要求1所述的方法，其特征在于，所述方法还包括：The method of claim 1, wherein the method further comprises:

根据所述目标视频中每帧图像的内容，获取所述目标视频对应的视频内容特征；Acquiring the video content feature corresponding to the target video according to the content of each frame of image in the target video;

则所述根据所述预设文本类型特征确定所述目标视频的视频类型标签，包括：Then, determining the video type tag of the target video according to the preset text type feature includes:

将所述文本类型特征与所述视频内容特征进行拼接，得到第一融合特征；Splicing the text type feature and the video content feature to obtain a first fusion feature;

将所述第一融合特征输入分类模型中，得到所述目标视频的视频类型标签。The first fusion feature is input into a classification model to obtain the video type label of the target video.
如权利要求2所述的方法，其特征在于，所述根据所述目标视频中每帧图像的内容，获取所述目标视频对应的视频内容特征，包括：The method according to claim 2, wherein the acquiring the video content characteristics corresponding to the target video according to the content of each frame of the image in the target video comprises:

获取所述目标视频中的至少一个图像对，每个图像对均包含所述目标视频中相邻的两帧图像；Acquiring at least one image pair in the target video, each image pair including two adjacent frames of images in the target video;

获取所述至少一个图像对中的两帧图像间的光流图，将所述至少一个图像对对应的光流图组成所述目标视频的光流图序列；Acquiring an optical flow diagram between two frames of images in the at least one image pair, and composing an optical flow diagram corresponding to the at least one image pair into an optical flow diagram sequence of the target video;

将所述目标视频的帧图像序列及所述光流图序列输入视频分类模型，得到所述目标视频对应的视频内容特征，所述帧图像序列由组成所述目标视频的各个帧图像依次排列得到。The frame image sequence of the target video and the optical flow diagram sequence are input into a video classification model to obtain the video content characteristics corresponding to the target video, and the frame image sequence is obtained by sequentially arranging each frame image constituting the target video .
如权利要求1所述的方法，其特征在于，所述方法还包括：The method of claim 1, wherein the method further comprises:

获取所述目标视频的音频信息，将所述音频信息输入语音分类模型，得到所述音频信息对应的语音内容特征；Acquiring audio information of the target video, inputting the audio information into a voice classification model, and obtaining voice content characteristics corresponding to the audio information;

所述根据所述文本类型特征确定所述目标视频的视频类型标签，包括：The determining the video type tag of the target video according to the text type feature includes:

将所述文本类型特征与所述语音内容特征进行拼接，得到第二融合特征；Splicing the text type feature and the voice content feature to obtain a second fusion feature;

将所述第二融合特征输入分类模型中，得到所述目标视频的视频类型标签。The second fusion feature is input into the classification model to obtain the video type label of the target video.
如权利要求1所述的方法，其特征在于，所述方法还包括：The method of claim 1, wherein the method further comprises:

识别所述关键帧图像中的图像文字，并获取所述关键帧图像对应的字幕信息；Identifying image text in the key frame image, and acquiring subtitle information corresponding to the key frame image;

所述根据所述描述信息确定所述关键帧图像的关键词组，包括：The determining the keyword group of the key frame image according to the description information includes:

根据所述描述信息、所述图像文字及所述字幕信息确定所述关键帧图像的关键词组。The keyword group of the key frame image is determined according to the description information, the image text, and the caption information.
如权利要求5所述的方法，其特征在于，所述根据所述描述信息、所述图像文字及所述字幕信息确定所述关键帧图像的关键词组，包括：5. The method of claim 5, wherein the determining the keyword group of the key frame image according to the description information, the image text, and the caption information comprises:

将所述描述信息中的词组、所述图像文字中的词组以及所述字幕信息中的词组添加到词组集合；Adding the phrase in the description information, the phrase in the image text, and the phrase in the caption information to the phrase set;

根据所述词组集合中每个词组的出现次数以及类型权重，确定所述词组集合中每个词组对应的评估值；所述类型权重包括所述描述信息对应的权重、所述图像文字对应的权重以及所述字幕信息对应的权重；Determine the evaluation value corresponding to each phrase in the phrase set according to the number of occurrences of each phrase in the phrase set and the type weight; the type weight includes the weight corresponding to the description information and the weight corresponding to the image text And the weight corresponding to the caption information;

根据所述评估值，对所述词组集合中每个词组进行排序，根据排序结果从所述词组集合中确定所述关键帧图像的关键词组。According to the evaluation value, each phrase in the phrase set is sorted, and the keyword group of the key frame image is determined from the phrase set according to the sorting result.
如权利要求1所述的方法，其特征在于，所述从目标视频中获取关键帧图像，包括：The method according to claim 1, wherein said obtaining the key frame image from the target video comprises:

获取组成所述目标视频的多个帧图像，将所述多个帧图像输入关键帧确定模型中的特征提取层，得到每个帧图像的图像特征；Acquiring multiple frame images composing the target video, and inputting the multiple frame images into the feature extraction layer in the key frame determination model to obtain the image feature of each frame image;

将所述每个帧图像的图像特征输入所述关键帧确定模型中的关键值确定层，在所述关键值确定层中基于注意力机制确定所述每个帧图像的关键值；Input the image characteristics of each frame image into a key value determination layer in the key frame determination model, and determine the key value of each frame image in the key value determination layer based on an attention mechanism;

根据所述每个帧图像的关键值，确定所述目标视频中的所述关键帧图像。According to the key value of each frame image, the key frame image in the target video is determined.
如权利要求7所述的方法，其特征在于，所述在所述关键值确定层中基于注意力机制确定所述每个帧图像的关键值，包括：8. The method according to claim 7, wherein the determining the key value of each frame image based on an attention mechanism in the key value determining layer comprises:

在所述关键值确定层中基于所述注意力机制，确定所述多个帧图像中第i个帧图像的图像特征与对照图像的图像特征间的关联度，根据所述第i个帧图像的图像特征与所述对照图像的图像特征间的关联度得到所述第i个帧图像的关键值；所述对照图像为所述组成所述目标视频的多个帧图像中除所述第i个帧图像以外的帧图像，i为正整数，i不大于所述多个帧图像的数量；In the key value determination layer, based on the attention mechanism, determine the correlation between the image feature of the i-th frame image and the image feature of the control image in the plurality of frame images, according to the i-th frame image The correlation between the image feature of the image feature and the image feature of the control image obtains the key value of the i-th frame image; the control image is the i-th frame image of the multiple frame images that make up the target video For frame images other than one frame image, i is a positive integer, and i is not greater than the number of the multiple frame images;

当所述第i个帧图像为所述多个帧图像中的最后一个帧图像时，得到所述每个帧图像的关键值。When the i-th frame image is the last frame image among the multiple frame images, the key value of each frame image is obtained.
如权利要求1所述的方法，其特征在于，所述根据所述描述信息确定所述关键帧图像的关键词组，包括：The method of claim 1, wherein the determining the keyword group of the key frame image according to the description information comprises:

统计所述描述信息包含的词组中每个词组的出现次数，将所述描述信息中出现次数大于统计次数阈值的词组确定为所述关键帧图像的关键词组。Count the number of occurrences of each phrase in the phrase contained in the description information, and determine the phrase in the description information whose occurrence number is greater than the threshold of the number of times of statistics as the key phrase of the key frame image.
如权利要求1所述的方法，其特征在于，从复数个预设的文本类型特征中获取确定所述复数个关键词组对应的预设文本类型特征，包括：The method according to claim 1, wherein obtaining and determining the preset text type features corresponding to the plurality of keyword groups from a plurality of preset text type features comprises:

将所述关键词组输入文本分类模型中，提取所述关键词组对应的初始文本特征；Input the keyword group into a text classification model, and extract the initial text features corresponding to the keyword group;

将所述初始文本特征与所述文本分类模型中的多个待匹配类型特征进行匹配，得到匹配值；Matching the initial text feature with a plurality of to-be-matched type features in the text classification model to obtain a matching value;

将具有最大匹配值的待匹配类型特征确定为所述关键词组对应的文本类型特征。The type feature to be matched with the largest matching value is determined as the text type feature corresponding to the keyword group.
如权利要求4所述的方法，其特征在于，所述将所述文本类型特征与所述语音内容特征进行拼接，得到第二融合特征，包括：The method according to claim 4, wherein the splicing the text type feature and the voice content feature to obtain a second fusion feature comprises:

在所述文本类型特征中的第一指定位置添加默认特征值，得到第一指定长度的文本类型特征；Adding a default feature value at the first specified position in the text type feature to obtain a text type feature of the first specified length;

在所述语音内容特征中的第二指定位置添加所述默认特征值，得到第二指定长度的语音内容特征；Adding the default feature value to a second specified position in the voice content feature to obtain a voice content feature of a second specified length;

将所述第一指定长度的文本类型特征及所述第二指定长度的语音内容特征进行拼接，得到所述第二融合特征；Splicing the text type feature of the first specified length and the voice content feature of the second specified length to obtain the second fusion feature;

所述将所述第二融合特征输入分类模型中，得到所述目标视频的视频类型标签，包括：The inputting the second fusion feature into a classification model to obtain the video type label of the target video includes:

将所述第二融合特征输入所述分类模型中，基于所述分类模型中的分类权重矩阵，得到所述目标视频的视频类型标签。The second fusion feature is input into the classification model, and the video type label of the target video is obtained based on the classification weight matrix in the classification model.
如权利要求1所述的方法，其特征在于，所述方法还包括：The method of claim 1, wherein the method further comprises:

基于所述目标视频的视频类型标签，将所述目标视频添加至所述视频类型标签对应的视频分类中；或者，Based on the video type tag of the target video, add the target video to the video classification corresponding to the video type tag; or,

将所述目标视频推送到目标终端，所述目标终端为标记了所述视频类型标签的终端。Push the target video to a target terminal, where the target terminal is a terminal marked with the video type tag.
一种视频分类装置，其特征在于，所述装置包括：A video classification device, characterized in that the device includes:

第一获取模块，用于从目标视频中获取关键帧图像；The first acquisition module is used to acquire key frame images from the target video;

第一确定模块，用于将所述关键帧图像输入图像搜索引擎，得到所述关键帧图像的描述信息，根据所述描述信息确定所述关键帧图像的关键词组；The first determining module is configured to input the key frame image into an image search engine to obtain description information of the key frame image, and determine the keyword group of the key frame image according to the description information;

第二获取模块，用于从复数个预设的文本类型特征中确定所述复数个关键词组对应的预设文本类型特征；The second acquisition module is configured to determine the preset text type features corresponding to the plurality of keyword groups from the plurality of preset text type features;

第二确定模块，用于根据所述预设文本类型特征确定所述目标视频的视频类型标签。The second determining module is configured to determine the video type tag of the target video according to the feature of the preset text type.
一种计算机，其特征在于，包括处理器、存储器、输入输出接口；A computer, which is characterized by comprising a processor, a memory, and an input and output interface;

所述处理器分别与所述存储器和所述输入输出接口相连，其中，所述输入输出接口用于输入数据和输出数据，所述存储器用于存储程序代码，所述处理器用于调用所述程序代码，以执行如权利要求1-12任一项所述的方法。The processor is respectively connected to the memory and the input and output interface, wherein the input and output interface is used to input data and output data, the memory is used to store program code, and the processor is used to call the program Code to perform the method according to any one of claims 1-12.
一种计算机可读存储介质，其特征在于，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时，执行如权利要求1-12任一项所述的方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, the computer program includes program instructions, and when executed by a processor, the program instructions execute as claimed in claims 1-12. Any of the methods described.