WO2023134088A1

WO2023134088A1 - Video summary generation method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023134088A1
Application number: PCT/CN2022/090761
Authority: WO
Inventors: 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-01-11
Filing date: 2022-04-29
Publication date: 2023-07-20
Also published as: CN114359810A; CN114359810B

Abstract

Embodiments of the present application relate to the technical field of artificial intelligence, and provide a video summary generation method and apparatus, an electronic device, and a storage medium. The method comprises: obtaining video data; performing video extraction on the video data by means of a preset video extraction model to obtain a plurality of video clips; encoding the video clips to obtain video hidden feature vectors; performing matrix multiplication processing on the video hidden feature vectors and a preset reference word vector to obtain video description word segments; and performing text recognition processing on the video description word segments by means of a preset text recognition model to obtain video summary statements; and splicing the video summary statements according to a preset splicing sequence to obtain a video summary text. According to the embodiments of the present application, the accuracy with which a video summary is generated can be improved.

Description

视频摘要生成方法、装置、电子设备及存储介质Video summary generation method, device, electronic device and storage medium

本申请要求于2022年1月11日提交中国专利局、申请号为202210028911.X，发明名称为“视频摘要生成方法、装置、电子设备及存储介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202210028911.X submitted to the China Patent Office on January 11, 2022, and the title of the invention is "video abstract generation method, device, electronic equipment and storage medium", the entire content of which Incorporated in this application by reference.

技术领域technical field

本申请涉及人工智能技术领域，尤其涉及一种视频摘要生成方法、装置、电子设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular to a video summary generation method, device, electronic equipment and storage medium.

背景技术Background technique

目前，在对视频进行摘要提取时，常常采用监督学习的方式：将视频数据等输入至训练好的监督学习模型，通过监督学习模型对视频数据进行处理，得到视频摘要；发明人意识到监督学习模型对训练集的人工标注要求较高，而人工标注的方式往往会存在着较大的人为误差，会影响视频摘要的准确性。因此，如何提高生成视频摘要的准确性，成为了亟待解决的技术问题。At present, when extracting video summaries, supervised learning is often used: input video data into the trained supervised learning model, process the video data through the supervised learning model, and obtain video summaries; the inventor realized that supervised learning The model has high requirements for manual labeling of the training set, and the manual labeling method often has large human errors, which will affect the accuracy of the video summary. Therefore, how to improve the accuracy of video summarization has become an urgent technical problem to be solved.

技术问题technical problem

以下是发明人意识到的现有技术的技术问题:监督学习模型对训练集的人工标注要求较高，而人工标注的方式往往会存在着较大的人为误差，会影响视频摘要的准确性。The following are the technical problems of the prior art realized by the inventor: the supervised learning model has higher requirements for manual labeling of the training set, and the manual labeling method often has large human errors, which will affect the accuracy of the video summary.

技术解决方案technical solution

第一方面，本申请实施例提出了一种视频摘要生成方法，所述方法包括：In the first aspect, the embodiment of the present application proposes a method for generating a video summary, the method comprising:

获取视频数据；Get video data;

通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；performing video extraction on the video data through a preset video extraction model to obtain multiple video clips;

对所述视频片段进行编码处理，得到视频隐藏特征向量；Encoding the video segment to obtain a video hidden feature vector;

将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；Carry out matrix multiplication processing with described video hidden feature vector and preset reference word vector, obtain video description word segment;

通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；Carrying out text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence;

根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。The video summary sentence is spliced according to a preset splicing sequence to obtain a video summary text.

第二方面，本申请实施例提出了一种视频摘要生成装置，所述装置包括：In the second aspect, the embodiment of the present application proposes a device for generating a video abstract, the device comprising:

视频数据获取模块，用于获取视频数据；A video data acquisition module, configured to acquire video data;

视频提取模块，用于通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；A video extraction module, configured to perform video extraction on the video data through a preset video extraction model to obtain a plurality of video clips;

编码模块，用于对所述视频片段进行编码处理，得到视频隐藏特征向量；An encoding module, configured to encode the video segment to obtain a video hidden feature vector;

矩阵相乘模块，用于将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；The matrix multiplication module is used to carry out matrix multiplication processing with the video hidden feature vector and the preset reference word vector to obtain the video description word segment;

文本识别模块，用于通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；A text recognition module is used to perform text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence;

拼接模块，用于根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。The splicing module is configured to splice the video summary sentences according to a preset splicing sequence to obtain video summary text.

第三方面，本申请实施例提出了一种电子设备，所述电子设备包括存储器、处理器、存储在所述存储器上并可在所述处理器上运行的程序以及用于实现所述处理器和所述存储器之间的连接通信的数据总线，所述程序被所述处理器执行时实现一种视频摘要生成方法，其中，所述视频摘要生成方法包括：获取视频数据；通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；对所述视频片段进行编码处理，得到视频隐藏特征向量；将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。In the third aspect, the embodiment of the present application provides an electronic device, the electronic device includes a memory, a processor, a program stored in the memory and operable on the processor, and a program for implementing the processor A data bus connecting and communicating with the memory, when the program is executed by the processor, a method for generating a video summary is implemented, wherein the method for generating a video summary includes: acquiring video data; The extraction model performs video extraction on the video data to obtain a plurality of video clips; encodes the video clips to obtain a video hidden feature vector; performs matrix multiplication of the video hidden feature vector and a preset reference word vector processing to obtain the video description word segment; carry out text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence; splicing the video summary sentence according to a preset splicing sequence to obtain a video Abstract text.

第四方面，本申请实施例提出了一种存储介质，所述存储介质为计算机可读存储介质，用于计算机可读存储，所述存储介质存储有一个或者多个程序，所述一个或者多个程序可被一个或者多个处理器执行，以实现一种视频摘要生成方法，其中，所述视频摘要生成方法包括获取视频数据；通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；对所述视频片段进行编码处理，得到视频隐藏特征向量；将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。In a fourth aspect, the embodiment of the present application provides a storage medium, the storage medium is a computer-readable storage medium for computer-readable storage, the storage medium stores one or more programs, and the one or more This program can be executed by one or more processors to implement a method for generating a video summary, wherein the method for generating a video summary includes acquiring video data; performing video extraction on the video data through a preset video extraction model, A plurality of video segments are obtained; the video segments are encoded to obtain video hidden feature vectors; the video hidden feature vectors are multiplied with preset reference word vectors to obtain video description word segments; The text recognition model performs text recognition processing on the video description word segment to obtain a video summary sentence; according to a preset splicing sequence, the video summary sentence is spliced to obtain a video summary text.

有益效果Beneficial effect

本申请提出的视频摘要生成方法、装置、电子设备及存储介质能够方便地得到符合需求的视频描述词段，从而提高生成的视频摘要的准确性和生成效率，并且使得得到的视频摘要语句能够更好地突显出视频的主要内容，能够提高视频摘要文本的质量。The video summary generation method, device, electronic equipment and storage medium proposed in this application can easily obtain video description segments that meet the requirements, thereby improving the accuracy and generation efficiency of the generated video summary, and making the obtained video summary sentences more accurate. Highlighting the main content of the video well can improve the quality of the video summary text.

附图说明Description of drawings

附图用来提供对本申请技术方案的进一步理解，并且构成说明书的一部分，与本申请的实施例一起用于解释本申请的技术方案，并不构成对本申请技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

图1是本申请实施例提供的视频摘要生成方法的流程图；Fig. 1 is the flow chart of the method for generating a video abstract provided by the embodiment of the present application;

图2是图1中的步骤S102的流程图；Fig. 2 is the flowchart of step S102 in Fig. 1;

图3是图1中的步骤S105的流程图；Fig. 3 is the flowchart of step S105 in Fig. 1;

图4是图3中的步骤S302的流程图；Fig. 4 is the flowchart of step S302 in Fig. 3;

图5是图3中的步骤S304的流程图；Fig. 5 is the flowchart of step S304 in Fig. 3;

图6是图5中的步骤S502的流程图；FIG. 6 is a flowchart of step S502 in FIG. 5;

图7是图1中的步骤S106的流程图；Fig. 7 is the flowchart of step S106 in Fig. 1;

图8是本申请实施例提供的视频摘要生成装置的结构示意图；FIG. 8 is a schematic structural diagram of a device for generating a video summary provided by an embodiment of the present application;

图9是本申请实施例提供的电子设备的硬件结构示意图。FIG. 9 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application.

本发明的实施方式Embodiments of the present invention

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

需要说明的是，虽然在装置示意图中进行了功能模块划分，在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于装置中的模块划分，或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的，不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

目前，在对视频进行事件标记时，常常采用监督学习的方式，即将视频图像等输入至训练好的模型进行视频摘要生成，这一方式对训练集的人工标注要求较高，而人工标注的方式往往会存在着较大的人为误差，会影响视频摘要的准确性。因此，如何提高生成视频摘要的准确性，成为了亟待解决的技术问题。At present, when event labeling videos, supervised learning is often used, that is, video images are input into the trained model to generate video summaries. This method requires high manual labeling of the training set, and the manual labeling method There is often a large human error, which will affect the accuracy of the video summary. Therefore, how to improve the accuracy of video summarization has become an urgent technical problem to be solved.

基于此，本申请实施例提供了一种视频摘要生成方法、装置、电子设备及存储介质，旨在提高生成视频摘要的准确性。Based on this, embodiments of the present application provide a method, device, electronic device, and storage medium for generating a video abstract, aiming at improving the accuracy of generating video abstracts.

本申请实施例提供的视频摘要生成方法、装置、电子设备及存储介质，具体通过如下实施例进行说明，首先描述本申请实施例中的视频摘要生成方法。The method, device, electronic device, and storage medium for generating a video abstract provided in the embodiments of the present application are specifically described through the following embodiments. First, the method for generating a video abstract in the embodiment of the present application is described.

本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中，人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互***、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

本申请实施例提供的视频摘要生成方法，涉及人工智能技术领域。本申请实施例提供的视频摘要生成方法可应用于终端中，也可应用于服务器端中，还可以是运行于终端或服务器端中的软件。在一些实施例中，终端可以是智能手机、平板电脑、笔记本电脑、台式计算机等；服务器端可以配置成独立的物理服务器，也可以配置成多个物理服务器构成的服务器集群或者分布式***，还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器；软件可以是实现视频摘要生成方法的应用等，但并不局限于以上形式。The method for generating a video abstract provided in the embodiment of the present application relates to the technical field of artificial intelligence. The method for generating a video summary provided in the embodiment of the present application can be applied to a terminal, can also be applied to a server, and can also be software running on the terminal or the server. In some embodiments, the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc.; the server end can be configured as an independent physical server, or can be configured as a server cluster or a distributed system composed of multiple physical servers, or It can be configured as a cloud that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The server; the software may be an application implementing the method for generating a video summary, but is not limited to the above forms.

本申请可用于众多通用或专用的计算机***环境或配置中。例如：个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器***、基于微处理器的***、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何***或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc. This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

图1是本申请实施例提供的视频摘要生成方法的一个可选的流程图，图1中的方法可以包括但不限于包括步骤S101至步骤S106。Fig. 1 is an optional flow chart of a method for generating a video abstract provided by an embodiment of the present application. The method in Fig. 1 may include but not limited to steps S101 to S106.

步骤S101，获取视频数据；Step S101, acquiring video data;

步骤S102，通过预设的视频提取模型对视频数据进行视频提取，得到多个视频片段；Step S102, performing video extraction on the video data through a preset video extraction model to obtain a plurality of video clips;

步骤S103，对视频片段进行编码处理，得到视频隐藏特征向量；Step S103, encoding the video clips to obtain video hidden feature vectors;

步骤S104，将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；Step S104, performing matrix multiplication processing on the video hidden feature vector and the preset reference word vector to obtain the video description word segment;

步骤S105，通过预设的文本识别模型对视频描述词段进行文本识别处理，得到视频摘要语句；Step S105, performing text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence;

步骤S106，根据预设的拼接顺序对视频摘要语句进行拼接处理，得到视频摘要文本。Step S106, splicing the video summary sentences according to the preset splicing order to obtain the video summary text.

本申请实施例所示意的步骤S101至步骤S106中，通过预设的视频提取模型对视频数据进行视频提取，得到多个视频片段，能够有效地剔除视频数据中相关性不高的数据，缩小数据总量，提高数据合理性。进而，对视频片段进行编码处理，得到视频隐藏特征向量，再将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段，通过这一方式能够方便地得到符合需求的视频描述词段，从而提高生成的视频摘要的准确性和生成效率。通过预设的文本识别模型对视频描述词段进行文本识别处理，得到视频摘要语句，使得得到的视频摘要语句能够更好地突显出视频的主要内容。最后，根据预设的拼接顺序对视频摘要语句进行拼接处理，得到视频摘要文本，能够进一步地提高视频摘要文本的质量。In steps S101 to S106 shown in the embodiment of the present application, video extraction is performed on video data through a preset video extraction model to obtain multiple video clips, which can effectively eliminate data that is not highly relevant in video data and reduce data size. The total amount improves the rationality of the data. Furthermore, the video segment is encoded to obtain the hidden feature vector of the video, and then the hidden feature vector of the video is multiplied by the matrix of the preset reference word vector to obtain the video description segment. The video description word segment, thereby improving the accuracy and generation efficiency of the generated video summarization. The text recognition process is performed on the video description word segments through the preset text recognition model to obtain video summary sentences, so that the obtained video summary sentences can better highlight the main content of the video. Finally, the video summary sentences are spliced according to the preset splicing order to obtain the video summary text, which can further improve the quality of the video summary text.

在一些实施例的步骤S101中，可以通过编写网络爬虫,设置好数据源之后进行有目标性的爬取数据，得到视频数据。也可以通过其他方式获取视频数据，不限于此。In step S101 of some embodiments, the video data can be obtained by writing a web crawler, setting the data source, and crawling the data in a targeted manner. The video data may also be obtained in other ways, but is not limited thereto.

请参阅图2，在一些实施例中，视频提取模型为BMN模型(即时序动作检测模型)，视频提取模型包括双流网络、BM层、卷积层和预设函数，步骤S102可以包括但不限于包括步骤S201至步骤S205：Referring to Fig. 2, in some embodiments, the video extraction model is a BMN model (i.e. sequential action detection model), and the video extraction model includes a two-stream network, a BM layer, a convolutional layer and a preset function, and step S102 may include but not limited to Including step S201 to step S205:

步骤S201，通过双流网络对视频数据进行特征提取，得到视频特征；Step S201, performing feature extraction on video data through a dual-stream network to obtain video features;

步骤S202，通过BM层将预设的权重矩阵与视频特征进行点乘处理，得到视频特征图；Step S202, performing dot product processing on the preset weight matrix and video features through the BM layer to obtain a video feature map;

步骤S203，通过卷积层对视频特征图进行卷积处理，得到视频特征置信度图；Step S203, performing convolution processing on the video feature map through the convolution layer to obtain a video feature confidence map;

步骤S204，通过预设函数对视频特征的每一时序位置进行特征概率计算，得到时序概率值；Step S204, using a preset function to calculate the feature probability for each time-series position of the video feature to obtain a time-series probability value;

步骤S205，根据视频特征置信度图和时序概率值对视频数据进行分割处理，得到视频片段。In step S205, the video data is segmented according to the video feature confidence map and time-series probability values to obtain video segments.

具体地，可以将视频数据定义为

其中，x _n表示视频数据中的某一帧，l _v表示视频数据的总帧数。 Specifically, video data can be defined as

Among them, x _n represents a certain frame in the video data, and _lv represents the total number of frames of the video data.

在一些实施例的步骤S201中，在BM模型中使用Two-Stream双流网络提取视频特征，双流网络一路提取视频图像的空间特征，一路提取视频光流的时序运动特征，在视频片段的每个位置提取视频图像特征和视频光流特征，该过程可以表示如公式(1)所示：In step S201 of some embodiments, the Two-Stream dual-stream network is used in the BM model to extract video features, and the dual-stream network extracts the spatial features of the video image all the way, and the temporal motion features of the video optical flow all the way, at each position of the video clip Extract video image features and video optical flow features, the process can be expressed as shown in formula (1):

其中，x _tn表示视频数据中的第t帧视频图像，o _tn表示以视频数据第t帧视频图像为中心的连续图像计算的视频光流，然后将x _tn和o _tn分别输入至Two-Stream双流网络以提取视频特征，然后将双流网络的两路输入特征进行合并融合，得到视频特征f _tn。对视频数据的每个时序位置都计算得到x _tn和o _tn，并提取视频特征f _tn，从而得到视频的时序特征

为了降低计算量，可以在进行视频特征提取时，采用步长为σ的间隔采样，通过该方式采样的视频特征序列的总帧数为l _s＝l _v/σ。 Among them, x _tn represents the t-th frame video image in the video data, o _tn represents the video optical flow calculated from the continuous images centered on the t-th frame video image of the video data, and then x _tn and o _tn are respectively input to Two-Stream A dual-stream network is used to extract video features, and then the two input features of the dual-stream network are merged to obtain video features f _tn . Calculate x _tn and o _tn for each timing position of the video data, and extract the video feature f _tn , so as to obtain the timing feature of the video

In order to reduce the amount of computation, interval sampling with a step size of σ can be used during video feature extraction, and the total number of frames of the video feature sequence sampled in this way is l _s =l _v /σ.

在一些实施例的步骤S202中，首先，定义提取到的视频特征中的时序特征序列为S _F，对于每个视频片段S _F,其中，S _F∈R ^C×T，在其开始时间和结束时间范围内，采样N个特征点，得到每个视频片段的视频特征

对每个视频片段都进行同样的采样过程，就得到了视频特征图M _F＝R ^C×N×D×T；其中，C是原始视频特征序列的输入通道数，N是每个视频片段的采样点数，D是定义的视频片段的最大持续时长超参数，T是原始视频特征序列长度。 In step S202 of some embodiments, first, define the temporal feature sequence in the extracted video features as S _F , for each video segment S _F , where, S _F ∈ ^{R C × T} , at its start time and end In the time range, sample N feature points to get the video features of each video segment

Carry out the same sampling process for each video clip to get the video feature map M _F =R ^C×N×D×T ; where, C is the number of input channels of the original video feature sequence, N is the number of channels of each video clip The number of sampling points, D is the hyperparameter of the maximum duration of the defined video segment, and T is the length of the original video feature sequence.

具体地，首先对视频片段的边界进行扩展处理，将边界扩展为[t _s-0.25d，t _e+0.25d]，t _s为起始边界点，t _e为结束边界点，d＝t _e-t _s，对该区间进行均匀采样N个点，并且生成采样点对应的权重矩阵，然后根据每个均匀采样点的权重以及与之对应的时序特征加权计算每个采样点的视频特征，其中，权重矩阵w _i，j∈R ^N×T的计算过程如公式(2)所示： Specifically, firstly, the boundary of the video segment is extended, and the boundary is extended to [t _s -0.25d, t _e +0.25d], t _s is the starting boundary point, t _e is the ending boundary point, d=t _e -t _s , uniformly sample N points in the interval, and generate a weight matrix corresponding to the sampling points, and then calculate the video features of each sampling point according to the weight of each uniform sampling point and the corresponding time series features, where , the calculation process of the weight matrix w _{i, j} ∈ ^{R N×T} is shown in formula (2):

对于在视频片段内采样的N个点，假如第n个采样点t<t _n<t+1,那么表示这个采样点落在时间序列t和t+1之间，通过t _n采样点的小数部分来衡量采样点t _n和时间序列t和t+1之间的靠近程度，并使用t和t+1的加权结果来表示t _n位置的采样结果，t _n的小数部分越大，表明t _n离t越远，时刻t的权重1-dec(t _n)就越小，时刻t+1的权重dec(t _n)就越大。 For the N points sampled in the video clip, if the nth sampling point t<t _n <t+1, it means that the sampling point falls between the time series t and t+1, and the fraction of the sampling point through t _n part to measure the closeness between the sampling point t _n and the time series t and t+1, and use the weighted results of t and t+1 to represent the sampling results at the position of t _n , the larger the fractional part of t _n , it indicates that t The farther _n is from t, the smaller the weight 1-dec(t _n ) at time t, and the larger the weight dec(t _n ) at time t+1.

对每一视频片段将权重矩阵w _i，j∈R ^N×T和该视频片段的时序特征S _F∈R ^C×T点乘能够得到该视频片段的视频特征

(如公式(3)所示)。将该过程扩展到二维BM图上，扩展后的权重矩阵为w＝R ^C×N×D×T，使用权重矩阵w＝R ^C×N×D×T和时序特征S _F∈R ^C×T进行点乘，得到视频特征图M _F＝R ^C×N×D×T。 For each video segment, multiply the weight matrix w _i,j ∈R ^N×T and the timing feature S _F ∈R ^C×T of the video segment to obtain the video feature of the video segment

(As shown in formula (3)). Extend this process to the two-dimensional BM map, the extended weight matrix is w=R ^C×N×D×T , using the weight matrix w=R ^C×N×D×T and the time series feature S _F ∈ ^{R C×} The dot product ^{of T} is performed to obtain the video feature map M _F =R ^C×N×D×T .

在一些实施例的步骤S203中，对视频特征图进行2D卷积处理以及3D卷积处理，计算出视频特征置信度图。In step S203 of some embodiments, 2D convolution processing and 3D convolution processing are performed on the video feature map to calculate a video feature confidence map.

在一些实施例的步骤S204中，通过sigmoid函数对视频时序特征的每个时序位置是动作开始和动作结束的概率进行预测，得到时序概率值，即在每个时序位置输出两个通道，然后使用sigmoid转换为概率值，

分别表示开始概率和结束概率。 In step S204 of some embodiments, the probability that each timing position of the video timing feature is an action start and an action end is predicted by a sigmoid function to obtain a timing probability value, that is, output two channels at each timing position, and then use sigmoid converted to probability values,

are the start and end probabilities, respectively.

在一些实施例的步骤S205中，从视频特征置信度图上获取每一视频片段的回归置信度、分类置信度，根据时序概率值得到开始节点概率

以及结束节点概率

最后基于视频片段的开始节点概率

结束节点概率

分类置信度p _cc、回归置信度p _cr计算得到视频片段的置信度值：

根据soft-nms函数和置信度值对视频数据进行预测和分割处理，得到视频片段。例如，选择概率值大于设定阈值，或者是峰值的开始点和结束点组成候选视频片段的开始点序列和结束点序列，然后将开始序列和结束序列进行两两配对得到候选视频片段，候选视频片段此时可以表示为：

进而，通过soft-nms函数将候选视频片段从视频数据中分割出来，得到多个视频片段。 In step S205 of some embodiments, the regression confidence degree and classification confidence degree of each video segment are obtained from the video feature confidence degree map, and the starting node probability is obtained according to the time series probability value

and end node probabilities

Finally based on the start node probability of the video segment

end node probability

Classification confidence p _cc and regression confidence p _cr are calculated to obtain the confidence value of the video segment:

According to the soft-nms function and the confidence value, the video data is predicted and segmented to obtain video segments. For example, the selection probability value is greater than the set threshold, or the start point and end point of the peak form the start point sequence and the end point sequence of the candidate video segment, and then pair the start sequence and the end sequence to obtain the candidate video segment, the candidate video Fragments can now be represented as:

Furthermore, the candidate video segment is segmented from the video data through a soft-nms function to obtain multiple video segments.

在一些实施例的步骤S103之前，该方法还包括预先训练语言模型，该语言模型用于对视频片段进行编码处理，得到视频隐藏特征向量；还用于将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段。具体地，该语言模型可以为transformer模型、LSTM模型或者GPT-2模型等等。Before step S103 in some embodiments, the method also includes pre-training a language model, the language model is used to encode video clips to obtain video hidden feature vectors; it is also used to combine video hidden feature vectors with preset reference words The vectors are multiplied by matrix to obtain video description words. Specifically, the language model may be a transformer model, an LSTM model, or a GPT-2 model, etc.

以transformer模型为例，transformer模型包括编码器和解码器。该语言模型的训练过程包括：获取样本视频片段，并将样本视频片段输入至transformer模型中。通过transformer模型的编码器对样本视频片段进行编码处理，得到样本视频隐藏特征向量，通过transformer模型的解码器对样本视频隐藏特征向量进行解码处理，使得样本视频隐藏特征向量与参考词向量进行矩阵相乘，得到视频描述词段，得到样本视频描述词段。通过transformer模型的损失函数计算样本视频描述词段与参考视频描述词段的相似度，根据相似度对transformer模型的损失函数进行优化，对损失函数的模型损失进行反向传播，不断地调整模型参数，直至相似度大于或者等于相似度阈值，停止对transformer模型的优化，得到符合要求的语言模型。Taking the transformer model as an example, the transformer model includes an encoder and a decoder. The training process of the language model includes: obtaining sample video clips, and inputting the sample video clips into the transformer model. Encode the sample video clips through the encoder of the transformer model to obtain the hidden feature vector of the sample video, and decode the hidden feature vector of the sample video through the decoder of the transformer model, so that the hidden feature vector of the sample video and the reference word vector are matrix correlated Multiply to get the video description word segment, and get the sample video description word segment. Calculate the similarity between the sample video description segment and the reference video description segment through the loss function of the transformer model, optimize the loss function of the transformer model according to the similarity, backpropagate the model loss of the loss function, and continuously adjust the model parameters , until the similarity is greater than or equal to the similarity threshold, stop optimizing the transformer model, and obtain a language model that meets the requirements.

在一些实施例的步骤S103中，将视频片段输入至语言模型的编码器中，通过语言模型的编码器对视频片段进行编码处理，得到视频隐藏特征向量。In step S103 of some embodiments, the video segment is input into an encoder of the language model, and the video segment is encoded by the encoder of the language model to obtain a video hidden feature vector.

在一些实施例的步骤S104中，将生成的视频隐藏特征向量输入至语言模型的解码器中，通过将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段。具体地，在t时刻解码器会根据当前的视频隐藏特征向量结合t-1时刻生成的视频描述词段的词向量，生成t时刻的视频描述词段wt。由于t-1时刻是没有视频描述词段的，这里使用全为0的词向量作为参考词向量。当视频描述词段wt生成结束后，解码器在t+1时刻同样根据t+1时刻的视频隐藏特征向量与t时刻生成的视频描述词段wt的词向量进行矩阵相乘，生成视频描述词段wt+1，以此类推，直到生成的视频描述词段wn是一个终止符<eos>，即代表本句话生成结束。In step S104 of some embodiments, the generated video hidden feature vector is input to the decoder of the language model, and the video description word segment is obtained by matrix multiplying the video hidden feature vector and the preset reference word vector. Specifically, at time t, the decoder will generate a video description word segment wt at time t according to the current video hidden feature vector combined with the word vector of the video description word segment generated at time t-1. Since there is no video description word segment at time t-1, a word vector with all 0s is used here as a reference word vector. When the video description word segment wt is generated, the decoder also performs matrix multiplication at the time t+1 according to the hidden feature vector of the video at the time t+1 and the word vector of the video description word segment wt generated at time t to generate the video description word Segment wt+1, and so on, until the generated video description word segment wn is a terminator <eos>, which means the end of the generation of this sentence.

需要说明的是，通过上述方式得到的每一视频描述词段都带有一对分隔符，该分隔符用于将多个视频描述词段分隔开。具体地，该分隔符可以是一对占位符：第一占位符[CLS]与第二占位符[SEP]，其中第一占位符[CLS]表示视频描述词段的开始，第二占位符[SEP]表示视频描述词段的结束。其中，CLS(classifer token)，也叫分类器标识符或者标识符，是一种特殊的token，该token的词嵌入通常用于进行分类任务；SEP(sentence separator)也叫语句分隔标识符或者分隔符，也是一种特殊的token，可应用于分隔两个视频描述词段。例如，对于某一视频描述词段W ₁的序列可以表示为：[CLS]-[W _1.1]-[W _1.2]-[SEP]，其中，[CLS]为视频描述词段W ₁的开始，[SEP]为视频描述词段的结束，[W _1.1]为视频描述词段W ₁的第一个视频词组，[W _1.2]为视频描述词段W ₁的第二个视频词组。 It should be noted that each video description segment obtained in the above manner has a pair of separators, which are used to separate multiple video description segments. Specifically, the separator can be a pair of placeholders: the first placeholder [CLS] and the second placeholder [SEP], wherein the first placeholder [CLS] represents the beginning of the video description segment, and the second placeholder [CLS] Two placeholders [SEP] indicate the end of the video description segment. Among them, CLS (classifer token), also called classifier identifier or identifier, is a special token whose word embedding is usually used for classification tasks; SEP (sentence separator) is also called sentence separation identifier or separation character, which is also a special token, can be used to separate two video description segments. For example, the sequence for a certain video description word segment W ₁ can be expressed as: [CLS]-[W _1.1 ]-[W _1.2 ]-[SEP], wherein, [CLS] is the beginning of the video description word segment W ₁ , [SEP] is the end of the video description phrase, [W _1.1 ] is the first video phrase of the video description phrase W ₁ , and [W _1.2 ] is the second video phrase of the video description phrase W ₁ .

请参阅图3，在一些实施例中，文本识别模型为BERT模型，文本识别模型包括Bert层和Transformer层，步骤S105可以包括但不限于包括步骤S301至步骤S304：Referring to Fig. 3, in some embodiments, text recognition model is BERT model, and text recognition model comprises Bert layer and Transformer layer, and step S105 can include but not limited to include step S301 to step S304:

步骤S301，对视频描述词段进行词向量化处理，得到每一视频描述词段对应的视频描述词向量；Step S301, performing word vectorization processing on the video description word segment to obtain a video description word vector corresponding to each video description word segment;

步骤S302，通过Bert层对视频描述词向量进行嵌入处理，得到视频描述表征向量；Step S302, embedding the video description word vector through the Bert layer to obtain the video description representation vector;

步骤S303，通过Transformer层对每一视频描述表征向量进行文本分值计算，得到每一视频描述表征向量的文本分值；Step S303, performing text score calculation on each video description characterization vector through the Transformer layer, to obtain the text score of each video description characterization vector;

步骤S304，根据文本分值对视频描述词段进行筛选处理，得到视频摘要语句。In step S304, the video description word segment is screened according to the text score to obtain the video summary sentence.

在一些实施例的步骤S301中，通过LSTM算法或者transformer算法等对视频描述词段进行词向量化处理，得到每一视频描述词段对应的视频描述词向量。In step S301 of some embodiments, the video description word segment is subjected to word vectorization processing by using LSTM algorithm or transformer algorithm, etc., to obtain a video description word vector corresponding to each video description word segment.

在一些实施例的步骤S302中，通过Bert层对视频描述词向量进行嵌入处理，该嵌入处理包括段嵌入处理、位置嵌入处理等等，对不同类型的嵌入处理得到的嵌入向量进行拼接处理，能够较为方便地得到视频描述表征向量。In step S302 of some embodiments, the video descriptor vector is embedded through the Bert layer. The embedding process includes segment embedding processing, position embedding processing, etc., and the embedding vectors obtained by different types of embedding processing are spliced. It is more convenient to obtain the video description representation vector.

在一些实施例的步骤S303中，可以通过Transformer层中的预设函数对每一视频描述表征向量进行文本分值计算，其中，预设函数可以为sigmoid函数，通过sigmoid函数计算视频描述表征向量与参考表征向量的相似概率值，以相似概率值代表每一视频描述表征向量的文本分值。In step S303 of some embodiments, the text score calculation can be performed on each video description characterization vector through a preset function in the Transformer layer, wherein the preset function can be a sigmoid function, and the video description characterization vector and Referring to the similarity probability value of the representation vector, the text score of each video description representation vector is represented by the similarity probability value.

在一些实施例的步骤S304中，可以根据文本分值对视频描述词段进行降序排列，得到视频描述词段序列，进而将视频描述词段序列内的每一视频描述词段的文本分值与预设的文本分值阈值对比，以对视频描述词段序列的视频描述词段进行筛选处理，并对筛选处理之后的视频描述词段进行语句增补，得到视频摘要语句。In step S304 of some embodiments, the video description word segments can be arranged in descending order according to the text score to obtain a video description word segment sequence, and then the text score of each video description word segment in the video description word segment sequence and The preset text score threshold is compared to filter the video description words in the sequence of video description words, and add sentences to the video description words after the screening processing to obtain video summary sentences.

请参阅图4，在一些实施例中，步骤S302可以包括但不限于包括步骤S401至步骤S403：Referring to FIG. 4, in some embodiments, step S302 may include but not limited to include steps S401 to S403:

步骤S401，通过Bert层中预设的参考段嵌入向量对视频描述词向量进行段嵌入处理，得到视频段嵌入向量；Step S401, performing segment embedding processing on the video descriptor vector through the reference segment embedding vector preset in the Bert layer, to obtain the video segment embedding vector;

步骤S402，通过Bert层中预设的特征维度对视频描述词向量进行位置嵌入处理，得到视频位置嵌入向量；Step S402, performing position embedding processing on the video descriptor vector through the preset feature dimension in the Bert layer to obtain a video position embedding vector;

步骤S403，对视频描述词向量、视频段嵌入向量以及视频位置嵌入向量进行组合处理，得到视频描述表征向量。In step S403, the video description word vector, the video segment embedding vector and the video position embedding vector are combined to obtain a video description representation vector.

在一些实施例的步骤S401中，通过Bert层中预设的参考段嵌入向量对视频描述词向量进行段嵌入处理，得到视频段嵌入向量，即根据参考段嵌入向量对视频描述词向量进行分词处理，得到视频描述词段cap的视频词组(W _1.1，W _1.2，…，W _50.1，W _50.2，W _50.3)，并对每一视频描述词段赋予不同的视频段嵌入向量。其中，视频段嵌入向量可以表示为E _A，E _B…,该视频段嵌入向量从1开始取值，依次递增；即对应第一个视频描述词段cap1的视频词组W _1.1、视频词组W _1.2，参考段嵌入向量均为E _A，E _A等于1；在对应第二个视频描述词段cap2的第一个视频词组W _2.1时，参考段嵌入向量为E _B，E _B等于2，…，依次类推。 In step S401 of some embodiments, the video descriptor vector is subjected to segment embedding processing through the reference segment embedding vector preset in the Bert layer to obtain the video segment embedding vector, that is, the video descriptor vector is subjected to word segmentation processing according to the reference segment embedding vector , get the video phrases (W _1.1 , W _1.2 ,..., W _50.1 , W _50.2 , W _50.3 ) of the video description segment cap, and assign different video segment embedding vectors to each video description segment. Among them, the video segment embedding vector can be expressed as E _A , E _B ..., the video segment embedding vector starts from 1 and increases in turn; that is, the video phrase W _1.1 and video phrase W _1.2 corresponding to the first video description term cap1 , the reference segment embedding vectors are all E _A , E _A is equal to 1; when corresponding to the first video phrase W _2.1 of the second video description term cap2, the reference segment embedding vector is E _B , E _B is equal to 2,..., And so on.

在一些实施例的步骤S402中，获取每一视频描述词向量对应的特征维度以及词向量的位置数据，根据特征维度、预设函数以及词向量位置数据对视频描述词向量进行位置嵌入处理，得到视频位置嵌入向量。其中，当特征维度为偶数位，则预设函数为正弦函数，该正弦函数可以表示为公式(4)，当特征维度为奇数位，则预设函数为余弦函数，该余弦函数可以表示为公式(5)：In step S402 of some embodiments, the feature dimension corresponding to each video description word vector and the position data of the word vector are obtained, and the video description word vector is subjected to position embedding processing according to the feature dimension, preset function and word vector position data, to obtain Video position embedding vector. Among them, when the feature dimension is an even number, the default function is a sine function, which can be expressed as formula (4), when the feature dimension is an odd number, the default function is a cosine function, and the cosine function can be expressed as the formula (5):

PE _(pos，2i)＝sin(pos/10000 ^2i/dmodel) 公式(4) PE _{(pos, 2i)} = sin(pos/10000 ^2i/dmodel ) formula (4)

PE _(pos，2i+1)＝cos(pos/10000 ^2i/dmodel) 公式(5) PE _{(pos, 2i+1)} = cos(pos/10000 ^2i/dmodel ) formula (5)

其中，2i代表特征维度的偶数位，2i+1代表奇数位，位数从0开始计算。dmodel代表特征维度，一般为512维。pos代表词向量的位置数据。Among them, 2i represents the even-numbered bits of the feature dimension, 2i+1 represents the odd-numbered bits, and the number of bits starts from 0. dmodel represents the feature dimension, generally 512 dimensions. pos represents the position data of the word vector.

在一些实施例的步骤S403中，对视频描述词向量、视频段嵌入向量以及视频位置嵌入向量进行向量叠加，得到视频描述表征向量。In step S403 of some embodiments, vector superposition is performed on the video description word vector, the video segment embedding vector and the video position embedding vector to obtain a video description representation vector.

例如，假设视频描述词段有50个，即视频描述词段可以表示为cap1，…，cap50，对每一视频描述词段分词处理，得到每一视频描述词段cap的视频词组W，如对第一个视频描述词段cap1分词处理，得到视频词组W _1.1，W _1.2，其中，E _1.1表示第一个视频描述词段cap1中的第一个视频词组W _1.1的词向量，E _1.2表示第一个视频描述词段cap1中的第二个视频词组W _1.2的词向量。由于第一个视频描述词段cap1对应的视频段嵌入向量为E _A，通过上述公式(4)、公式(5)可以分别计算出视频词组W _1.1，W _1.2对应的视频位置嵌入向量E ₁、E ₂，则第一个视频描述词段cap1中的第一个视频词组W _1.1的视频描述表征向量可以表示为E _1.1+E _A+E ₁；第一个视频描述词段cap1中的第二个视频词组W _1.2的视频描述表征向量可以表示为E _1.2+E _A+E ₂。 For example, assuming that there are 50 video description words, that is, the video description words can be expressed as cap1, ..., cap50, and each video description word is segmented to obtain the video phrase W of each video description word cap, as for The first video description word segment cap1 word segmentation process, get video phrase W _1.1 , W _1.2 , wherein, E _1.1 represents the word vector of the first video phrase W _1.1 in the first video description word segment cap1, E _1.2 represents the first A word vector of the second video phrase W _1.2 in the video description segment cap1. Since the video segment embedding vector corresponding _to the first video description word segment cap1 is E _A , the video position embedding vectors E ₁ , E ₁ , and E ₂ , then the video description representation vector of the first video phrase W _1.1 in the first video description word segment cap1 can be expressed as E _1.1 +E _A +E ₁ ; the second in the first video description word segment cap1 The video description representation vector of a video phrase W _1.2 can be expressed as E _1.2 +E _A +E ₂ .

请参阅图5，在一些实施例中，步骤S304还可以包括但不限于包括步骤S501至步骤S502：Referring to FIG. 5, in some embodiments, step S304 may also include but not limited to include steps S501 to S502:

步骤S501，根据文本分值对视频描述词段进行降序排列，得到视频描述词段序列；Step S501, sorting the video description word segments in descending order according to the text score to obtain a sequence of video description word segments;

步骤S502，根据预设的筛选条件对视频描述词段序列进行筛选处理，得到视频摘要语句。In step S502, the sequence of video description phrases is screened according to preset screening conditions to obtain video summary sentences.

在一些实施例的步骤S501中，根据文本分值，从大到小，对视频描述词段进行降序排列，得到视频描述词段序列。In step S501 of some embodiments, according to the text score, the video description word segments are sorted in descending order from large to small to obtain a sequence of video description word segments.

在一些实施例的步骤S502中，可以根据文本分值与文本分值阈值的大小关系，对视频描述词段序列进行筛选处理，得到视频摘要语句。In step S502 of some embodiments, the sequence of video description words can be screened according to the relationship between the text score and the text score threshold to obtain a video summary sentence.

请参阅图6，在一些实施例，步骤S502还包括但不限于包括步骤S601至步骤S602：Please refer to FIG. 6, in some embodiments, step S502 also includes but is not limited to steps S601 to S602:

步骤S601，比对视频描述词段序列中每一视频描述词段的文本分值与预设的文本阈值；Step S601, comparing the text score of each video description word segment in the video description word segment sequence with a preset text threshold;

步骤S602，对文本分值大于或者等于文本分值阈值的视频描述词段进行增补处理，得到视频摘要语句。Step S602, supplementing the video description word segment whose text score is greater than or equal to the text score threshold to obtain a video summary sentence.

在一些实施例的步骤S601中，比对视频描述词段序列中每一视频描述词段的文本分值与预设文本阈值，将文本分值大于或者等于文本分值阈值的视频描述词段纳入同一集合，得到视频摘要语句集合，其中，文本分值阈值可以为75。In step S601 of some embodiments, the text score of each video description word segment in the video description word segment sequence is compared with the preset text threshold, and the video description word segments whose text score is greater than or equal to the text score threshold are included. In the same set, a set of video summary sentences is obtained, wherein the text score threshold may be 75.

在一些实施例的步骤S602中，根据预设的文本字段数据库内的标准字段，对视频摘要语句集合的视频描述词段进行语句补全、实体改写、大小写改写、同义词变换等处理，实现对视频描述词段的增补，得到视频摘要语句，提高了视频摘要语句的语句完整性。In step S602 of some embodiments, according to the standard fields in the preset text field database, sentence completion, entity rewriting, case rewriting, and synonym conversion are performed on the video description word segments of the video summary sentence set to realize the Supplementing the video description segment to obtain the video summary sentence improves the sentence integrity of the video summary sentence.

请参阅图7，在一些实施例中，步骤S106还可以包括但不限于包括步骤S701至步骤S702：Referring to FIG. 7, in some embodiments, step S106 may also include but not limited to include steps S701 to S702:

步骤S701，获取每一视频摘要语句的预设分隔符；Step S701, obtaining the preset separator of each video summary sentence;

步骤S702，根据预设的拼接顺序和预设分隔符对视频摘要语句进行拼接处理，得到视频摘要文本。Step S702, splicing the video summary sentences according to the preset splicing order and preset separators to obtain the video summary text.

在一些实施例的步骤S701中，由于每一视频描述词段都设置有对应的一个占位符[CLS]，因而，将视频摘要语句输入至数据库平台中，可以通过数据库平台获取每一视频摘要语句的分隔符[CLS]，并对视频摘要语句进行格式转换，得到完整视频摘要语句，完整视频摘要语句的开始位置嵌入有分隔符[CLS]。In step S701 of some embodiments, since each video description word segment is provided with a corresponding placeholder [CLS], therefore, the video summary sentence is input into the database platform, and each video summary can be obtained through the database platform The delimiter [CLS] of the sentence, and convert the format of the video summary sentence to obtain the complete video summary sentence, and the start position of the complete video summary sentence is embedded with the delimiter [CLS].

在一些实施例的步骤S702中，为了提高视频摘要的合理性，可以根据预设的拼接顺序和拼接函数对完整视频摘要语句进行拼接处理，预设的拼接顺序可以是获取到视频片段的时间先后顺序等等，预设的拼接函数可以为CONCAT()函数或者CONCAT_WS()函数。例如，在数据库平台上，根据获取到视频片段的时间先后顺序，将多个完整视频摘要语句进行标注处理，使得每一完整的视频摘要语句带上序列标签，该序列标签可以是***序列(1、2、3、…)，也可以是英文字母序列(A、B、C、…)；进而，通过CONCAT()函数对多个带有序列标签的完整视频摘要语句按照序列标签顺序进行拼接融合，得到视频摘要文本。In step S702 of some embodiments, in order to improve the rationality of the video summary, the complete video summary sentence can be spliced according to the preset splicing order and splicing function, and the preset splicing order can be the time sequence of the video clips obtained order, etc., the preset splicing function can be the CONCAT() function or the CONCAT_WS() function. For example, on the database platform, according to the chronological order of the acquired video clips, multiple complete video summary sentences are marked, so that each complete video summary sentence is equipped with a sequence label, and the sequence label can be an Arabic sequence (1 , 2, 3, ...), or a sequence of English letters (A, B, C, ...); furthermore, multiple complete video summary sentences with sequence labels are spliced and fused according to the order of the sequence labels through the CONCAT() function , to get the video summary text.

本申请实施例通过获取视频数据；通过预设的视频提取模型对视频数据进行视频提取，得到多个视频片段，能够有效地剔除视频数据中相关性不高的数据，缩小数据总量，提高数据合理性。进而，对视频片段进行编码处理，得到视频隐藏特征向量，再将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段，通过这一方式能够方便地得到符合需求的视频描述词段，从而提高生成的视频摘要的准确性和生成效率。进而，通过预设的文本识别模型对视频描述词段进行文本识别处理，得到视频摘要语句，这样一来，使得得到的视频摘要语句能够更好地突显出视频的主要内容。最后，根据预设的拼接顺序对视频摘要语句进行拼接处理，得到视频摘要文本，能够进一步地提高视频摘要文本的质量。In the embodiment of the present application, by acquiring video data; video extraction is performed on the video data through a preset video extraction model to obtain multiple video segments, which can effectively eliminate data that is not highly relevant in the video data, reduce the total amount of data, and improve the data quality. rationality. Furthermore, the video segment is encoded to obtain the hidden feature vector of the video, and then the hidden feature vector of the video is multiplied by the matrix of the preset reference word vector to obtain the video description segment. The video description word segment, thereby improving the accuracy and generation efficiency of the generated video summarization. Furthermore, the text recognition process is performed on the video description word segment through the preset text recognition model to obtain the video summary sentence, so that the obtained video summary sentence can better highlight the main content of the video. Finally, the video summary sentences are spliced according to the preset splicing order to obtain the video summary text, which can further improve the quality of the video summary text.

请参阅图8，本申请实施例还提供一种视频摘要生成装置，可以实现上述视频摘要生成方法，该装置包括：Please refer to FIG. 8 , the embodiment of the present application also provides a video summary generation device, which can realize the above video summary generation method, and the device includes:

视频数据获取模块801，用于获取视频数据；A video data acquisition module 801, configured to acquire video data;

视频提取模块802，用于通过预设的视频提取模型对视频数据进行视频提取，得到多个视频片段； Video extraction module 802, is used for carrying out video extraction to video data by preset video extraction model, obtains a plurality of video clips;

编码模块803，用于对视频片段进行编码处理，得到视频隐藏特征向量；The encoding module 803 is used to encode the video segment to obtain the hidden feature vector of the video;

矩阵相乘模块804，用于将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段； Matrix multiplication module 804, is used for carrying out matrix multiplication processing with video hidden feature vector and preset reference word vector, obtains video description word segment;

文本识别模块805，用于通过预设的文本识别模型对视频描述词段进行文本识别处理，得到视频摘要语句； Text recognition module 805, is used for carrying out text recognition processing to video description word segment by preset text recognition model, obtains video summary sentence;

拼接模块806，用于根据预设的拼接顺序对视频摘要语句进行拼接处理，得到视频摘要文本。The splicing module 806 is configured to splice the video summary sentences according to a preset splicing sequence to obtain video summary text.

该视频摘要生成装置的具体实施方式与上述视频摘要生成方法的具体实施例基本相同，在此不再赘述。The specific implementation of the device for generating a video summary is basically the same as the specific embodiment of the method for generating a video summary above, and will not be repeated here.

本申请实施例还提供了一种电子设备，电子设备包括：存储器、处理器、存储在存储器上并可在处理器上运行的程序以及用于实现处理器和存储器之间的连接通信的数据总线，程序被处理器执行时实现上述视频摘要生成方法。该电子设备可以为包括平板电脑、车载电脑等任意智能终端。The embodiment of the present application also provides an electronic device, the electronic device includes: a memory, a processor, a program stored in the memory and operable on the processor, and a data bus for realizing connection and communication between the processor and the memory , when the program is executed by the processor, the above video abstract generation method is realized. The electronic device may be any intelligent terminal including a tablet computer, a vehicle-mounted computer, and the like.

请参阅图9，图9示意了另一实施例的电子设备的硬件结构，电子设备包括：Please refer to FIG. 9. FIG. 9 illustrates a hardware structure of an electronic device in another embodiment. The electronic device includes:

处理器901，可以采用通用的CPU(CentralProcessingUnit，中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit，ASIC)、或者一个或多个集成电路等方式实现，用于执行相关程序，以实现本申请实施例所提供的技术方案；The processor 901 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs, so as to realize The technical solutions provided by the embodiments of the present application;

存储器902，可以采用只读存储器(ReadOnlyMemory，ROM)、静态存储设备、动态存储设备或者随机存取存储器(RandomAccessMemory，RAM)等形式实现。存储器902可以存储操作***和其他应用程序，在通过软件或者固件来实现本说明书实施例所提供的技术方案时，相关的程序代码保存在存储器902中，并由处理器901来调用执行一种视频摘要生成方法，其中，该视频摘要生成方法包括：获取视频数据；通过预设的视频提取模型对视频数据进行视频提取，得到多个视频片段；对视频片段进行编码处理，得到视频隐藏特征向量；将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；通过预设的文本识别模型对视频描述词段进行文本识别处理，得到视频摘要语句；根据预设的拼接顺序对视频摘要语句进行拼接处理，得到视频摘要文本。；The memory 902 may be implemented in the form of a read-only memory (ReadOnlyMemory, ROM), a static storage device, a dynamic storage device, or a random access memory (RandomAccessMemory, RAM). The memory 902 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute a video A method for generating a summary, wherein the method for generating a video summary includes: obtaining video data; performing video extraction on the video data through a preset video extraction model to obtain multiple video clips; encoding the video clips to obtain video hidden feature vectors; Carry out matrix multiplication processing of the video hidden feature vector and the preset reference word vector to obtain the video description segment; perform text recognition processing on the video description segment through the preset text recognition model to obtain the video summary sentence; according to the preset The splicing sequence splices the video summary sentences to obtain the video summary text. ;

输入/输出接口903，用于实现信息输入及输出；The input/output interface 903 is used to realize information input and output;

通信接口904，用于实现本设备与其他设备的通信交互，可以通过有线方式(例如USB、网线等)实现通信，也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信；The communication interface 904 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.);

总线905，在设备的各个组件(例如处理器901、存储器902、输入/输出接口903和通信接口904)之间传输信息；bus 905, for transferring information between various components of the device (such as processor 901, memory 902, input/output interface 903 and communication interface 904);

其中处理器901、存储器902、输入/输出接口903和通信接口904通过总线905实现彼此之间在设备内部的通信连接。The processor 901 , the memory 902 , the input/output interface 903 and the communication interface 904 are connected to each other within the device through the bus 905 .

本申请实施例还提供了一种存储介质，存储介质为计算机可读存储介质，用于计算机可读存储，计算机可读存储介质可以是非易失性，也可以是易失性。存储介质存储有一个或者多个程序，一个或者多个程序可被一个或者多个处理器执行，以实现一种视频摘要生成方法，其中，该视频摘要生成方法包括：获取视频数据；通过预设的视频提取模型对视频数据进行视频提取，得到多个视频片段；对视频片段进行编码处理，得到视频隐藏特征向量；将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；通过预设的文本识别模型对视频描述词段进行文本识别处理，得到视频摘要语句；根据预设的拼接顺序对视频摘要语句进行拼接处理，得到视频摘要文本。。An embodiment of the present application also provides a storage medium, which is a computer-readable storage medium for computer-readable storage. The computer-readable storage medium may be non-volatile or volatile. The storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement a method for generating a video summary, wherein the method for generating a video summary includes: acquiring video data; The video extraction model extracts video from video data to obtain multiple video clips; encodes video clips to obtain video hidden feature vectors; performs matrix multiplication processing of video hidden feature vectors and preset reference word vectors to obtain video Describing the word segment; performing text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence; splicing the video summary sentence according to a preset splicing sequence to obtain a video summary text. .

存储器作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外，存储器可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.

在一些实施方式中，存储器可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

本申请实施例提供的视频摘要生成方法、视频摘要生成装置、电子设备及存储介质，其通过获取视频数据；通过预设的视频提取模型对视频数据进行视频提取，得到多个视频片段，能够有效地剔除视频数据中相关性不高的数据，缩小数据总量，提高数据合理性。进而，对视频片段进行编码处理，得到视频隐藏特征向量，再将视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段，通过这一方式能够方便地得到符合需求的视频描述词段，从而提高生成的视频摘要的准确性和生成效率。进而，通过预设的文本识别模型对视频描述词段进行文本识别处理，得到视频摘要语句，这样一来，使得得到的视频摘要语句能够更好地突显出视频的主要内容。最后，根据预设的拼接顺序对视频摘要语句进行拼接处理，得到视频摘要文本，能够进一步地提高视频摘要文本的质量。The video abstract generation method, video abstract generation device, electronic equipment, and storage medium provided in the embodiments of the present application obtain video data; perform video extraction on video data through a preset video extraction model, and obtain multiple video clips, which can effectively Eliminate the less relevant data in the video data, reduce the total amount of data, and improve the rationality of the data. Furthermore, the video segment is encoded to obtain the hidden feature vector of the video, and then the hidden feature vector of the video is multiplied by the matrix of the preset reference word vector to obtain the video description segment. The video description word segment, thereby improving the accuracy and generation efficiency of the generated video summarization. Furthermore, text recognition processing is performed on the video description word segments through a preset text recognition model to obtain video summary sentences. In this way, the obtained video summary sentences can better highlight the main content of the video. Finally, the video summary sentences are spliced according to the preset splicing order to obtain the video summary text, which can further improve the quality of the video summary text.

本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案，并不构成对于本申请实施例提供的技术方案的限定，本领域技术人员可知，随着技术的演变和新应用场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。The embodiments described in the embodiments of the present application are to illustrate the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation to the technical solutions provided by the embodiments of the present application. Those skilled in the art know that with the evolution of technology and new For the emergence of application scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.

本领域技术人员可以理解的是，图1-7中示出的技术方案并不构成对本申请实施例的限定，可以包括比图示更多或更少的步骤，或者组合某些步骤，或者不同的步骤。Those skilled in the art can understand that the technical solutions shown in Figures 1-7 do not constitute a limitation to the embodiments of the present application, and may include more or fewer steps than those shown in the illustrations, or combine certain steps, or be different A step of.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、***、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括多指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例的方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disk or optical disc, etc., which can store programs. medium.

以上参照附图说明了本申请实施例的优选实施例，并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进，均应在本申请实施例的权利范围之内。The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, which does not limit the scope of rights of the embodiments of the present application. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall fall within the scope of rights of the embodiments of the present application.

Claims

一种视频摘要生成方法，包括：A video summary generation method, comprising:

获取视频数据；Get video data;

通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；performing video extraction on the video data through a preset video extraction model to obtain multiple video clips;

对所述视频片段进行编码处理，得到视频隐藏特征向量；Encoding the video segment to obtain a video hidden feature vector;

将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；Carry out matrix multiplication processing with described video hidden feature vector and preset reference word vector, obtain video description word segment;

通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；Carrying out text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence;

根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。The video summary sentence is spliced according to a preset splicing sequence to obtain a video summary text.
根据权利要求1所述的视频摘要生成方法，其中，所述视频提取模型包括双流网络、BM层、卷积层和预设函数，所述通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段的步骤，包括：The method for generating a video abstract according to claim 1, wherein the video extraction model includes a dual-stream network, a BM layer, a convolutional layer, and a preset function, and the video data is processed by the preset video extraction model. The steps of extracting and obtaining multiple video clips include:

通过所述双流网络对所述视频数据进行特征提取，得到视频特征；performing feature extraction on the video data through the dual-stream network to obtain video features;

通过BM层将预设的权重矩阵与所述视频特征进行点乘处理，得到视频特征图；Carry out dot product processing with the preset weight matrix and the video feature through the BM layer to obtain the video feature map;

通过所述卷积层对所述视频特征图进行卷积处理，得到视频特征置信度图；Convolving the video feature map through the convolution layer to obtain a video feature confidence map;

通过所述预设函数对所述视频特征的每一时序位置进行特征概率计算，得到时序概率值；Perform feature probability calculation on each time-series position of the video feature through the preset function to obtain a time-series probability value;

根据所述视频特征置信度图和所述时序概率值对所述视频数据进行分割处理，得到所述视频片段。The video data is segmented according to the video feature confidence map and the time series probability value to obtain the video segment.
根据权利要求1所述的视频摘要生成方法，其中，所述文本识别模型包括Bert层和Transformer层，所述通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句的步骤，包括：The method for generating a video abstract according to claim 1, wherein the text recognition model includes a Bert layer and a Transformer layer, and the text recognition process is performed on the video description word segment through the preset text recognition model to obtain a video abstract Statement steps, including:

对所述视频描述词段进行词向量化处理，得到每一所述视频描述词段对应的视频描述词向量；Carry out word vectorization processing to described video description word segment, obtain the video description word vector corresponding to each described video description word segment;

通过所述Bert层对所述视频描述词向量进行嵌入处理，得到视频描述表征向量；Embedding the video description word vector through the Bert layer to obtain a video description representation vector;

通过所述Transformer层对每一所述视频描述表征向量进行文本分值计算，得到每一所述视频描述表征向量的文本分值；Perform text score calculation on each of the video description characterization vectors through the Transformer layer to obtain the text score of each of the video description characterization vectors;

根据所述文本分值对所述视频描述词段进行筛选处理，得到视频摘要语句。The video description word segment is screened according to the text score to obtain a video summary sentence.
根据权利要求3所述的视频摘要生成方法，其中，所述通过所述Bert层对所述视频描述词向量进行嵌入处理，得到视频描述表征向量的步骤，包括：The method for generating a video abstract according to claim 3, wherein the step of embedding the video description word vector through the Bert layer to obtain a video description representation vector includes:

通过所述Bert层中预设的参考段嵌入向量对所述视频描述词向量进行段嵌入处理，得到视频段嵌入向量；Carry out segment embedding process to described video descriptor vector by reference segment embedding vector preset in the Bert layer, obtain video segment embedding vector;

通过所述Bert层中预设的特征维度对所述视频描述词向量进行位置嵌入处理，得到视频位置嵌入向量；Carry out position embedding process to described video descriptor vector by the feature dimension preset in the Bert layer, obtain video position embedding vector;

对所述视频描述词向量、所述视频段嵌入向量以及所述视频位置嵌入向量进行组合处理，得到所述视频描述表征向量。The video description word vector, the video segment embedding vector and the video position embedding vector are combined to obtain the video description representation vector.
根据权利要求3所述的视频摘要生成方法，其中，所述根据所述文本分值对所述视频描述词段进行筛选处理，得到视频摘要语句的步骤，包括：The method for generating a video summary according to claim 3, wherein the step of filtering the video description word segments according to the text score to obtain a video summary statement includes:

根据所述文本分值对视频描述词段进行降序排列，得到视频描述词段序列；According to the text score, the video description words are arranged in descending order to obtain a sequence of video description words;

根据预设的筛选条件对所述视频描述词段序列进行筛选处理，得到所述视频摘要语句。The video description sentence sequence is screened according to preset screening conditions to obtain the video summary sentence.
根据权利要求5所述的视频摘要生成方法，其中，所述根据预设的筛选条件对所述视频描述词段序列进行筛选处理，得到所述视频摘要语句的步骤，包括：The method for generating a video summary according to claim 5, wherein the step of filtering the sequence of video description words according to preset filtering conditions to obtain the video summary sentence includes:

比对所述视频描述词段序列中每一视频描述词段的文本分值与预设的文本阈值；comparing the text score of each video description word segment in the video description word segment sequence with a preset text threshold;

对所述文本分值大于或者等于所述文本分值阈值的视频描述词段进行增补处理，得到所述视频摘要语句。Supplementary processing is performed on the video description word segment whose text score is greater than or equal to the text score threshold to obtain the video summary sentence.
根据权利要求1至6任一项所述的视频摘要生成方法，其中，所述根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本的步骤，包括：The method for generating a video summary according to any one of claims 1 to 6, wherein the step of splicing the video summary sentences according to a preset splicing sequence to obtain a video summary text includes:

获取每一视频摘要语句的预设分隔符；Get the default delimiter for each video summary sentence;

根据预设的拼接顺序和所述预设分隔符对所述视频摘要语句进行拼接处理，得到所述视频摘要文本。The video summary sentence is spliced according to a preset splicing sequence and the preset separator to obtain the video summary text.
一种视频摘要生成装置，包括：A video summarization generating device, comprising:

视频数据获取模块，用于获取视频数据；A video data acquisition module, configured to acquire video data;

视频提取模块，用于通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；A video extraction module, configured to perform video extraction on the video data through a preset video extraction model to obtain a plurality of video clips;

编码模块，用于对所述视频片段进行编码处理，得到视频隐藏特征向量；An encoding module, configured to encode the video segment to obtain a video hidden feature vector;

矩阵相乘模块，用于将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；The matrix multiplication module is used to carry out matrix multiplication processing with the video hidden feature vector and the preset reference word vector to obtain the video description word segment;

文本识别模块，用于通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；A text recognition module is used to perform text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence;

拼接模块，用于根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。The splicing module is configured to splice the video summary sentences according to a preset splicing sequence to obtain video summary text.
一种电子设备，其中，所述电子设备包括存储器、处理器、存储在所述存储器上并可在所述处理器上运行的程序以及用于实现所述处理器和所述存储器之间的连接通信的数据总线，所述程序被所述处理器执行时实现一种视频摘要生成方法，其中，所述视频摘要生成方法包括：An electronic device, wherein the electronic device includes a memory, a processor, a program stored on the memory and operable on the processor, and a program for realizing the connection between the processor and the memory A data bus for communication, when the program is executed by the processor, a method for generating a video summary is implemented, wherein the method for generating a video summary includes:

获取视频数据；Get video data;

通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；performing video extraction on the video data through a preset video extraction model to obtain multiple video clips;

对所述视频片段进行编码处理，得到视频隐藏特征向量；Encoding the video segment to obtain a video hidden feature vector;

将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；Carry out matrix multiplication processing with described video hidden feature vector and preset reference word vector, obtain video description word segment;

通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；Carrying out text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary statement;

根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。The video summary sentence is spliced according to a preset splicing sequence to obtain a video summary text.
根据权利要求9所述的电子设备，其中，所述视频提取模型包括双流网络、BM层、卷积层和预设函数，所述通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段的步骤，包括：The electronic device according to claim 9, wherein the video extraction model includes a dual-stream network, a BM layer, a convolutional layer, and a preset function, and the video extraction is performed on the video data through the preset video extraction model, The steps to get multiple video clips include:

通过所述双流网络对所述视频数据进行特征提取，得到视频特征；performing feature extraction on the video data through the dual-stream network to obtain video features;

通过BM层将预设的权重矩阵与所述视频特征进行点乘处理，得到视频特征图；Carry out dot product processing with the preset weight matrix and the video feature through the BM layer to obtain the video feature map;

通过所述卷积层对所述视频特征图进行卷积处理，得到视频特征置信度图；Convolving the video feature map through the convolution layer to obtain a video feature confidence map;

通过所述预设函数对所述视频特征的每一时序位置进行特征概率计算，得到时序概率值；Perform feature probability calculation on each time-series position of the video feature through the preset function to obtain a time-series probability value;

根据所述视频特征置信度图和所述时序概率值对所述视频数据进行分割处理，得到所述视频片段。The video data is segmented according to the video feature confidence map and the time series probability value to obtain the video segment.
根据权利要求9所述的电子设备，其中，所述文本识别模型包括Bert层和Transformer层，所述通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句的步骤，包括：The electronic device according to claim 9, wherein the text recognition model includes a Bert layer and a Transformer layer, and the text recognition process is performed on the video description word segment through the preset text recognition model to obtain the video summary sentence steps, including:

对所述视频描述词段进行词向量化处理，得到每一所述视频描述词段对应的视频描述词向量；Carry out word vectorization processing to described video description word segment, obtain the video description word vector corresponding to each described video description word segment;

通过所述Bert层对所述视频描述词向量进行嵌入处理，得到视频描述表征向量；Embedding the video description word vector through the Bert layer to obtain a video description representation vector;

通过所述Transformer层对每一所述视频描述表征向量进行文本分值计算，得到每一所述视频描述表征向量的文本分值；Perform text score calculation on each of the video description characterization vectors through the Transformer layer to obtain the text score of each of the video description characterization vectors;

根据所述文本分值对所述视频描述词段进行筛选处理，得到视频摘要语句。The video description word segment is screened according to the text score to obtain a video summary sentence.
根据权利要求11所述的电子设备，其中，所述通过所述Bert层对所述视频描述词向量进行嵌入处理，得到视频描述表征向量的步骤，包括：The electronic device according to claim 11, wherein the step of embedding the video description word vector through the Bert layer to obtain a video description representation vector includes:

通过所述Bert层中预设的参考段嵌入向量对所述视频描述词向量进行段嵌入处理，得到视频段嵌入向量；Carry out segment embedding process to described video descriptor vector by reference segment embedding vector preset in the Bert layer, obtain video segment embedding vector;

通过所述Bert层中预设的特征维度对所述视频描述词向量进行位置嵌入处理，得到视频位置嵌入向量；Carry out position embedding process to described video descriptor vector by the feature dimension preset in the Bert layer, obtain video position embedding vector;

对所述视频描述词向量、所述视频段嵌入向量以及所述视频位置嵌入向量进行组合处理，得到所述视频描述表征向量。The video description word vector, the video segment embedding vector and the video position embedding vector are combined to obtain the video description representation vector.
根据权利要求11所述的电子设备，其中，所述根据所述文本分值对所述视频描述词段进行筛选处理，得到视频摘要语句的步骤，包括：The electronic device according to claim 11, wherein the step of filtering the video description word segment according to the text score to obtain a video summary sentence includes:

根据所述文本分值对视频描述词段进行降序排列，得到视频描述词段序列；According to the text score, the video description words are arranged in descending order to obtain a sequence of video description words;

根据预设的筛选条件对所述视频描述词段序列进行筛选处理，得到所述视频摘要语句。The video description sentence sequence is screened according to preset screening conditions to obtain the video summary sentence.
根据权利要求13所述的电子设备，其中，所述根据预设的筛选条件对所述视频描述词段序列进行筛选处理，得到所述视频摘要语句的步骤，包括：The electronic device according to claim 13, wherein the step of filtering the sequence of video description words according to preset filtering conditions to obtain the video summary sentence includes:

比对所述视频描述词段序列中每一视频描述词段的文本分值与预设的文本阈值；comparing the text score of each video description word segment in the video description word segment sequence with a preset text threshold;

对所述文本分值大于或者等于所述文本分值阈值的视频描述词段进行增补处理，得到所述视频摘要语句。Supplementary processing is performed on the video description word segment whose text score is greater than or equal to the text score threshold to obtain the video summary sentence.
一种存储介质，所述存储介质为计算机可读存储介质，用于计算机可读存储，其中，所述存储介质存储有一个或者多个程序，所述一个或者多个程序可被一个或者多个处理器执行，以实现一种视频摘要生成方法，其中，所述视频摘要生成方法包括：A storage medium, the storage medium is a computer-readable storage medium for computer-readable storage, wherein the storage medium stores one or more programs, and the one or more programs can be used by one or more The processor executes to implement a method for generating a video summary, wherein the method for generating a video summary includes:

获取视频数据；Get video data;

通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段；performing video extraction on the video data through a preset video extraction model to obtain multiple video clips;

对所述视频片段进行编码处理，得到视频隐藏特征向量；Encoding the video segment to obtain a video hidden feature vector;

将所述视频隐藏特征向量与预设的参考词向量进行矩阵相乘处理，得到视频描述词段；Carry out matrix multiplication processing with described video hidden feature vector and preset reference word vector, obtain video description word segment;

通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句；Carrying out text recognition processing on the video description word segment through a preset text recognition model to obtain a video summary sentence;

根据预设的拼接顺序对所述视频摘要语句进行拼接处理，得到视频摘要文本。The video summary sentence is spliced according to a preset splicing sequence to obtain a video summary text.
根据权利要求15所述的存储介质，其中，所述视频提取模型包括双流网络、BM层、卷积层和预设函数，所述通过预设的视频提取模型对所述视频数据进行视频提取，得到多个视频片段的步骤，包括：The storage medium according to claim 15, wherein the video extraction model includes a dual-stream network, a BM layer, a convolutional layer, and a preset function, and the video extraction is performed on the video data through the preset video extraction model, The steps to get multiple video clips include:

通过所述双流网络对所述视频数据进行特征提取，得到视频特征；performing feature extraction on the video data through the dual-stream network to obtain video features;

通过BM层将预设的权重矩阵与所述视频特征进行点乘处理，得到视频特征图；Carry out dot product processing with the preset weight matrix and the video feature through the BM layer to obtain the video feature map;

通过所述卷积层对所述视频特征图进行卷积处理，得到视频特征置信度图；Convolving the video feature map through the convolution layer to obtain a video feature confidence map;

通过所述预设函数对所述视频特征的每一时序位置进行特征概率计算，得到时序概率值；Perform feature probability calculation on each time-series position of the video feature through the preset function to obtain a time-series probability value;

根据所述视频特征置信度图和所述时序概率值对所述视频数据进行分割处理，得到所述视频片段。The video data is segmented according to the video feature confidence map and the time series probability value to obtain the video segment.
根据权利要求15所述的存储介质，其中，所述文本识别模型包括Bert层和Transformer层，所述通过预设的文本识别模型对所述视频描述词段进行文本识别处理，得到视频摘要语句的步骤，包括：The storage medium according to claim 15, wherein the text recognition model includes a Bert layer and a Transformer layer, and the text recognition process is performed on the video description word segment through the preset text recognition model to obtain the video summary sentence steps, including:

对所述视频描述词段进行词向量化处理，得到每一所述视频描述词段对应的视频描述词向量；Carry out word vectorization processing to described video description word segment, obtain the video description word vector corresponding to each described video description word segment;

通过所述Bert层对所述视频描述词向量进行嵌入处理，得到视频描述表征向量；Embedding the video description word vector through the Bert layer to obtain a video description representation vector;

通过所述Transformer层对每一所述视频描述表征向量进行文本分值计算，得到每一所述视频描述表征向量的文本分值；Perform text score calculation on each of the video description characterization vectors through the Transformer layer to obtain the text score of each of the video description characterization vectors;

根据所述文本分值对所述视频描述词段进行筛选处理，得到视频摘要语句。The video description word segment is screened according to the text score to obtain a video summary sentence.
根据权利要求17所述的存储介质，其中，所述通过所述Bert层对所述视频描述词向量进行嵌入处理，得到视频描述表征向量的步骤，包括：The storage medium according to claim 17, wherein the step of embedding the video description word vector through the Bert layer to obtain the video description representation vector includes:

通过所述Bert层中预设的参考段嵌入向量对所述视频描述词向量进行段嵌入处理，得到视频段嵌入向量；Carry out segment embedding process to described video descriptor vector by reference segment embedding vector preset in the Bert layer, obtain video segment embedding vector;

通过所述Bert层中预设的特征维度对所述视频描述词向量进行位置嵌入处理，得到视频位置嵌入向量；Carry out position embedding process to described video descriptor vector by the feature dimension preset in the Bert layer, obtain video position embedding vector;

对所述视频描述词向量、所述视频段嵌入向量以及所述视频位置嵌入向量进行组合处理，得到所述视频描述表征向量。The video description word vector, the video segment embedding vector and the video position embedding vector are combined to obtain the video description representation vector.
根据权利要求17所述的存储介质，其中，所述根据所述文本分值对所述视频描述词段进行筛选处理，得到视频摘要语句的步骤，包括：The storage medium according to claim 17, wherein the step of filtering the video description word segment according to the text score to obtain a video summary sentence includes:

根据所述文本分值对视频描述词段进行降序排列，得到视频描述词段序列；According to the text score, the video description words are arranged in descending order to obtain a sequence of video description words;

根据预设的筛选条件对所述视频描述词段序列进行筛选处理，得到所述视频摘要语句。The video description sentence sequence is screened according to preset screening conditions to obtain the video summary sentence.
根据权利要求19所述的存储介质，其中，所述根据预设的筛选条件对所述视频描述词段序列进行筛选处理，得到所述视频摘要语句的步骤，包括：The storage medium according to claim 19, wherein the step of filtering the sequence of video description words according to preset filtering conditions to obtain the video summary sentence includes:

比对所述视频描述词段序列中每一视频描述词段的文本分值与预设的文本阈值；comparing the text score of each video description word segment in the video description word segment sequence with a preset text threshold;

对所述文本分值大于或者等于所述文本分值阈值的视频描述词段进行增补处理，得到所述视频摘要语句。Supplementary processing is performed on the video description word segment whose text score is greater than or equal to the text score threshold to obtain the video summary sentence.