CN114329070A

CN114329070A - Video feature extraction method and device, computer equipment and storage medium

Info

Publication number: CN114329070A
Application number: CN202111408061.8A
Authority: CN
Inventors: 李传俊; 许有疆; 胡智超
Original assignee: Tencent Technology Wuhan Co Ltd
Current assignee: Tencent Technology Wuhan Co Ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-04-12

Abstract

The application relates to a video feature extraction method, a video feature extraction device, computer equipment and a storage medium. The method comprises the following steps: acquiring video data; disassembling the video data to obtain disassembled data corresponding to the video data, wherein the disassembled data comprise a video frame set and a video clip set; performing frame feature extraction on video frames in the video frame set to obtain visual feature information, and performing segment feature extraction on video segments in the video segment set to obtain segment feature information; performing convolution on the visual characteristic information on the number dimension of the video frames to obtain first characteristic information; convolving the fragment feature information on the video fragment number dimension to obtain second feature information; and obtaining video characteristic information according to the first characteristic information and the second characteristic information. By adopting the method, the video data identification rate can be improved.

Description

Video feature extraction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for extracting video features, a computer device, and a storage medium.

Background

With the development of artificial intelligence technology, a video feature extraction technology has appeared, which is to extract feature information from video data so as to identify the video data by using the feature information, for example, the video data may be classified by using the feature information so as to determine the category of the video data.

However, when the video features are extracted by the conventional method, the extracted image features cannot sufficiently describe the features of the video data, so that inaccurate identification is caused when the feature information is used for identification, and the video data identification rate is reduced.

Disclosure of Invention

In view of the above technical problems, it is desirable to provide a video feature extraction method, apparatus, computer device, storage medium, and computer program product capable of improving the video data recognition rate.

A method of video feature extraction, the method comprising:

acquiring video data;

disassembling the video data to obtain disassembled data corresponding to the video data, wherein the disassembled data comprise a video frame set and a video clip set;

performing frame feature extraction on video frames in the video frame set to obtain visual feature information, and performing segment feature extraction on video segments in the video segment set to obtain segment feature information;

performing convolution on the visual characteristic information on the number dimension of the video frames to obtain first characteristic information;

convolving the fragment feature information on the video fragment number dimension to obtain second feature information;

and obtaining video characteristic information according to the first characteristic information and the second characteristic information.

A video feature extraction apparatus, the apparatus comprising:

the acquisition module is used for acquiring video data;

the disassembling module is used for disassembling the video data to obtain disassembling data corresponding to the video data, and the disassembling data comprises a video frame set and a video clip set;

the characteristic extraction module is used for extracting the frame characteristics of the video frames in the video frame set to obtain visual characteristic information and extracting the segment characteristics of the video segments in the video segment set to obtain segment characteristic information;

the first convolution module is used for performing convolution on the visual characteristic information on the number dimension of the video frames to obtain first characteristic information;

the second convolution module is used for performing convolution on the fragment feature information on the video fragment number dimension to obtain second feature information;

and the processing module is used for obtaining the video characteristic information according to the first characteristic information and the second characteristic information.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring video data;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring video data;

A computer program product comprising a computer program which when executed by a processor performs the steps of:

acquiring video data;

The video feature extraction method, the device, the computer equipment, the storage medium and the program product can obtain the visual feature information by acquiring the video data and disassembling the video data to obtain the disassembled data comprising the video frame set and the video segment set, and the frame feature extraction of the video frames in the video frame set to obtain the visual feature information, and the segment feature information can be obtained by extracting the segment feature of the video segments in the video segment set, so that the visual feature information can be convolved in the video frame number dimension to model between the continuous video frames to obtain the first feature information with the time sequence, and the segment feature information can be convolved in the video segment number dimension to model between the continuous video frames based on the video segment number dimension to obtain the second feature information with the time sequence, so that the first feature information and the second feature information with the time sequence can be obtained according to the first feature information and the second feature information with the time sequence, video feature information which fully describes the features of the video data is obtained, and the video data identification rate can be improved.

Drawings

FIG. 1 is a schematic flow chart of a video feature extraction method according to an embodiment;

FIG. 2 is a diagram illustrating an ViT (Vision Transformer) network, according to one embodiment;

FIG. 3 is a diagram of a TSM (Temporal Shift Module for Efficient Video Understanding) network in one embodiment;

FIG. 4 is a diagram illustrating the convolution of visual characteristic information according to a predetermined convolution kernel in one embodiment;

FIG. 5 is a diagram that illustrates extraction of textual features, in one embodiment;

FIG. 6 is a diagram of a video classification model in one embodiment;

FIG. 7 is a diagram illustrating a process for obtaining feature information in one embodiment;

FIG. 8 is a flowchart illustrating a method for extracting video features according to another embodiment;

FIG. 9 is a block diagram showing the structure of a video feature extraction apparatus according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The application relates to the technical field of artificial intelligence. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The computer vision technology is a science for researching how to make a machine look, and particularly, the computer vision technology is to use a camera and a computer to replace human eyes to carry out machine vision such as identification, tracking and measurement on a target and further carry out graphic processing, so that the computer is processed into an image which is more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

In an embodiment, as shown in fig. 1, a video feature extraction method is provided, and this embodiment is illustrated by applying the method to a server, and it is to be understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The terminal can be but not limited to various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server can be realized by an independent server or a server cluster formed by a plurality of servers, and can also be a node on a block chain. In this embodiment, the method includes the steps of:

step 102, video data is acquired.

Specifically, when video feature extraction needs to be performed on video data, the server acquires the video data.

And 104, disassembling the video data to obtain disassembled data corresponding to the video data, wherein the disassembled data comprises a video frame set and a video fragment set.

The method for disassembling the video data includes the steps of disassembling and analyzing the video data to obtain the disassembling data required by video feature extraction. The splitting data comprises a video frame set and a video fragment set, wherein the video frame set refers to a set formed by video frames obtained by frame extraction of video data, when the frame extraction of the video data is carried out, the number of frames obtained per second can be set according to needs, for example, the number of frames obtained per second can be 1, and 1 frame is extracted per second when the frame extraction of the video is carried out. The video segment set refers to a set composed of video segments obtained by segmenting video data, and when the video data is segmented, the duration of each video segment can be set according to needs, for example, the duration of each video segment can be 8 seconds, and one video segment is intercepted every 8 seconds.

Specifically, after the video data is obtained, the server disassembles the video data, performs frame extraction on the video data according to a preset number of frames per second to obtain a video frame set, and performs segment division on the video data according to a preset duration of a video segment to obtain a video segment set.

And 106, extracting frame characteristics of the video frames in the video frame set to obtain visual characteristic information, and extracting segment characteristics of the video segments in the video segment set to obtain segment characteristic information.

Specifically, the server may extract frame features of the video frames in the video frame set, respectively, to obtain visual feature information corresponding to each frame of the video frames. The frame feature extraction may specifically refer to performing visual semantic feature extraction, where the server splits a video frame into a plurality of small blocks, and establishes a connection between different small blocks to obtain visual feature information corresponding to the video frame. For example, the frame feature extraction here can be implemented by a pre-trained ViT network, the ViT network is a successful application of a Transformer structure on an image task, as shown in fig. 2, the ViT network represents a video frame as nxn small blocks, and establishes a link between different small blocks by a Transformer module, and finally obtains a feature expression and a corresponding prediction result of the whole video frame.

The structure of the transform module is shown in fig. 2, and includes an Embedded Patches module, a Norm module, a Multi-Head Attention module, and a Norm and MLP (multilayer sensing layer) module, where the Embedded Patches module is used to divide a video frame to obtain multiple Patches, and each patch is flattened into a one-dimensional patch Embedded by connecting all pixel channels in series in one patch, and then is linearly projected to a required input dimension, the Norm module is used to perform normalization, the Multi-Head Attention module is used to establish a connection between different small blocks by using the Attention mechanism, which is helpful to capture more abundant features, the MLP module is used to handle nonlinear separable problems, and in this embodiment, the MLP module is mainly used to map inputs and map inputs to outputs. In this embodiment, specifically, the pre-trained ViT 384 network may be used to extract the features of the video frame, and the features of the 2 nd hidden layer from the last of the ViT 384 network may be extracted as the features of the video frame, that is, the visual feature information.

Specifically, the server may extract segment features of the video segments in the video segment set, respectively, to obtain segment feature information corresponding to each video segment. Wherein, the segment feature extraction here may specifically refer to performing segment feature extraction based on moving video frames, i.e., in each video segment, segment feature extraction is performed by moving features over successive video frames, that is, in each video segment, after the frame feature extraction is performed on each frame of video in the video segment to obtain the feature of the corresponding video frame, for the current video frame, partial characteristics of the previous video frame are reserved, partial characteristics of the current video frame are replaced, so that the current video frame obtains the time information of the previous video frame, partial characteristics of the current video frame are reserved, partial characteristics of the next video frame corresponding to the current video frame are replaced, and enabling the next video frame to obtain the time information of the current video frame, and sequentially replacing the characteristics of the continuous video frames, so that the characteristics of the previous video frame are reserved in each video frame. The above-mentioned manner of moving the video frame is a unidirectional movement, and further, a bidirectional movement may also be adopted in this embodiment, that is, after the feature of the last frame of video frame in the video segment is replaced, a part of the feature of the last frame of video frame is retained, the feature of the previous video frame corresponding to the last frame of video frame is replaced, the features of the consecutive video frames are sequentially replaced, until the first frame of video frame in the video segment is replaced, and the bidirectional movement is completed.

For example, the segment feature extraction in this embodiment may be implemented by a pre-trained TSM network, and the TSM network increases the modeling capability of the network on the timing information by moving the features on consecutive video frames, so that the segment feature information with the timing information may be obtained. As shown in fig. 3, (a) is the feature of the video frame in the video segment before the continuous video frame is moved, (b) is the schematic diagram of moving the continuous video frame in two directions, and (c) is the schematic diagram of moving the continuous video frame in one direction.

And 108, performing convolution on the visual characteristic information in the dimension of the number of the video frames to obtain first characteristic information.

The convolution is carried out on the number dimension of the video frames, the linear expansion of the visual characteristic information can be changed, the original pure dimension expansion is replaced by the expansion operation of a plurality of different time sequences, therefore, the time sequence information is compensated for, and the characteristic that the video frames directly have the time sequences can be extracted.

Specifically, the server can perform convolution on visual feature information respectively according to a plurality of preset convolution kernels in the dimension of the number of video frames to obtain first convolution feature information corresponding to the preset convolution kernels, then splices the first convolution feature information, performs feature aggregation on the spliced first convolution feature information to obtain first feature information, and performs convolution operation in the dimension of the number of the video frames, which is equivalent to modeling between continuous video frames, so that the first convolution feature information with time sequences can be obtained. The size and number of the preset convolution kernels may be set as required, for example, the preset convolution kernels may be one-dimensional convolutions with kernel sizes of 1 × 1, 1 × 3, and 1 × 5, respectively. For example, when the preset convolution kernel is a one-dimensional convolution with kernel sizes of 1 × 1, 1 × 3, and 1 × 5, and when the visual feature information is convolved in the video frame number dimension, the server performs convolution once in each frame, every three frames, and every five frames, respectively, to obtain the first convolution feature information. For example, as shown in fig. 4, X represents visual feature information, X1, X2, and X3 are first convolution feature information corresponding to a preset convolution kernel, respectively, and the first convolution feature information after being spliced, that is, X4, can be obtained by splicing the first convolution feature information.

And step 110, performing convolution on the segment characteristic information on the video segment number dimension to obtain second characteristic information.

The method has the advantages that the convolution is carried out on the dimensionality of the video segments, the linear expansion of the segment characteristic information can be changed, the original pure dimensionality expansion is replaced by the expansion operation of a plurality of different time sequences, so that the time sequence information is compensated for, and the characteristic that the video frame directly has the time sequence can be extracted.

Specifically, the server convolves the segment feature information respectively according to a plurality of preset convolution kernels in the video segment number dimension to obtain second convolution feature information corresponding to the preset convolution kernels, then splices the second convolution feature information, performs feature aggregation on the spliced second convolution feature information to obtain second feature information, and performs convolution operation in the video segment number dimension, which is equivalent to modeling between continuous video frames based on the video segment number dimension, so that second convolution feature information with a time sequence can be obtained. The size and number of the preset convolution kernels may be set as required, for example, the preset convolution kernels may be one-dimensional convolutions with kernel sizes of 1 × 1, 1 × 3, and 1 × 5, respectively. For example, when the predetermined convolution kernel is a one-dimensional convolution with kernel sizes of 1 × 1, 1 × 3, and 1 × 5, respectively, and the fragment feature information is convolved in the video fragment number dimension, the server performs convolution on each fragment, every third fragment, and every fifth fragment, respectively, to obtain second convolution feature information.

And step 112, obtaining video characteristic information according to the first characteristic information and the second characteristic information.

Specifically, after the first characteristic information and the second characteristic information are obtained, the server splices the first characteristic information and the second characteristic information to obtain spliced characteristic information, wherein the spliced characteristic information comprises multi-channel characteristic information. After the splicing characteristic information is obtained, the server calculates weighting parameters corresponding to all channels in the splicing characteristic information, updates the multi-channel characteristic information by using the weighting parameters to obtain updated splicing characteristic information, and obtains video characteristic information according to the updated splicing characteristic information. The calculation of the weighting parameters corresponding to the channels in the spliced feature information means that the importance degree of each feature channel is automatically obtained in a learning mode, and then useful features are promoted according to the importance degree and the features which are not useful for the current task are suppressed.

For example, the server may calculate the weighting parameter corresponding to each channel in the splicing feature information by training a SENet (Squeeze-and-Excitation Networks) network in advance, where the processing mode of the SENet network is as follows: firstly, the feature graph obtained by convolution is subjected to Squeeze (extrusion) operation to obtain channel-level global features, then the global features are subjected to Excitation operation to learn the relation among the channels and also obtain the weights of different channels, and finally the weights are multiplied by the original feature graph to obtain the final features. In this embodiment, Squeeze (Squeeze) operation is performed on the splicing feature information to obtain a channel-level global feature, then Excitation operation is performed on the global feature to learn the relationship among the channels, a weighting parameter corresponding to each channel in the splicing feature information is also obtained, and finally the weighting parameter is multiplied by the feature information of the corresponding channel to obtain updated splicing feature information. In essence, the SENET module performs attentions or gating operations in the channel dimension, which allows the model to focus more on the channel features with the largest amount of information, while suppressing those unimportant channel features.

The video feature extraction method obtains the split data comprising the video frame set and the video segment set by obtaining the video data and splitting the video data, obtains the visual feature information by extracting the frame feature of the video frames in the video frame set, obtains the segment feature information by extracting the segment feature of the video segments in the video segment set, further obtains the second feature information with the time sequence by convolving the visual feature information on the video frame number dimension and modeling between the continuous video frames on the basis of the video segment number dimension by convolving the segment feature information on the video segment number dimension, thereby obtaining the video feature information fully describing the features of the video data according to the first feature information with the time sequence and the second feature information, the video data recognition rate can be improved.

In one embodiment, convolving the visual characteristic information in the video frame number dimension to obtain the first characteristic information includes:

on the video frame number dimension, respectively convolving the visual characteristic information according to a plurality of preset convolution kernels to obtain first convolution characteristic information corresponding to the preset convolution kernels;

and splicing the first convolution characteristic information, and performing characteristic aggregation on the spliced first convolution characteristic information to obtain first characteristic information.

Specifically, the server convolves the visual characteristic information respectively according to a plurality of preset convolution kernels in the number dimension of the video frames to obtain first convolution characteristic information corresponding to the preset convolution kernels, splices the first convolution characteristic information, performs characteristic clustering on the spliced first convolution characteristic information, and converts the spliced first convolution characteristic information into global characteristic information, namely the first characteristic information, by using a characteristic clustering result. The feature clustering of the spliced first convolution feature information may specifically be K-means clustering, and K cluster centers corresponding to the spliced first convolution feature information may be obtained by using the K-means clustering, so that the difference distribution of the spliced first convolution feature information in the K cluster centers may be calculated by using the K cluster centers, and the spliced first convolution feature information is converted into global feature information, that is, the first feature information.

For example, in this embodiment, a pre-trained Next Vector of Local Aggregated Descriptors (Next Local aggregation descriptor vectors) network may be used to perform feature clustering on the spliced first volume feature information, so as to obtain first feature information. The nextvrad network is an improvement based on a VLAD (Vector of locally Aggregated Descriptors) network, which is one of image feature extraction methods, and the calculation flow thereof is as follows: firstly, performing K-means clustering on all the N X D feature graphs to obtain K clustering centers, and then converting the local feature graphs of the N X D into a global feature graph V by using a formula, wherein the size of the global feature graph is K X D. Wherein, the formula is as follows:

wherein x is_iRepresenting the ith local feature, c_kDenotes the k-th cluster center, x_iAnd c_kAre all D-dimensional vectors, a_k(x_i) Is a sign function if x_iNot belonging to the cluster center c_k，a_k(x_i) 0; if x_iBelonging to a cluster center c_k，a_k(x_i)＝1。

The improvement of the NextVLAD network to the VLAD network is that a_k(x_i) Is no longer a simple sign function but a weight function, such that x_iAnd c_kThe closer to a_k(x_i) The closer to 1, the closer to 0, and the increased nonlinear parameter of the VLAD layer, the reduced output layer parameter of the VLAD network, so that the overall parameter is reduced, the weight parameter here may be set as required, and mainly may satisfy the condition, and this embodiment is not specifically limited here. With such an improvement, a parameter c obtained by clustering is required_kThe local feature difference is calculated by setting K classes, and then the global feature V (j, K) is obtained by training the NextVLAD network.

In this embodiment, visual feature information is convolved respectively according to a plurality of preset convolution kernels in the dimension of the number of video frames, modeling can be performed between consecutive video frames, first convolution feature information with a time sequence directly in a video frame is extracted, the first convolution feature information is spliced, feature aggregation is performed on the spliced first convolution feature information, the first feature information is obtained, and conversion of feature dimensions can be achieved through feature aggregation.

In one embodiment, convolving the segment feature information in the video segment number dimension, and obtaining the second feature information includes:

on the video segment number dimension, respectively convolving the segment characteristic information according to a plurality of preset convolution kernels to obtain second convolution characteristic information corresponding to the preset convolution kernels;

and splicing the second convolution characteristic information, and performing characteristic aggregation on the spliced second convolution characteristic information to obtain second characteristic information.

Specifically, the server performs convolution on the segment feature information respectively according to a plurality of preset convolution kernels on the video segment number dimension to obtain second convolution feature information corresponding to the preset convolution kernels, then splices the second convolution feature information, performs feature clustering on the spliced second convolution feature information, and converts the spliced second convolution feature information into global feature information, namely the second feature information, by using a feature clustering result. The feature clustering of the spliced second convolution feature information may specifically be K-means clustering, and K cluster centers corresponding to the spliced second convolution feature information may be obtained by using the K-means clustering, so that the difference distribution of the spliced second convolution feature information in the K cluster centers may be calculated by using the K cluster centers, and the spliced second convolution feature information is converted into global feature information, that is, the second feature information. For example, in this embodiment, a pre-trained NextVLAD network may be used to perform feature clustering on the spliced second convolution feature information to obtain second feature information.

In this embodiment, segment feature information is convolved respectively according to a plurality of preset convolution kernels in the video segment number dimension, modeling can be performed between consecutive video frames based on the video segment number dimension, second convolution feature information of the video segment directly with a time sequence is extracted, the second convolution feature information is spliced, feature aggregation is performed on the spliced second convolution feature information, the second feature information is obtained, and conversion of feature dimensions can be achieved through the feature aggregation.

In one embodiment, obtaining the video feature information according to the first feature information and the second feature information comprises:

splicing the first characteristic information and the second characteristic information to obtain spliced characteristic information, wherein the spliced characteristic information comprises multi-channel characteristic information;

determining a weighting parameter corresponding to each channel in the splicing characteristic information;

updating the multi-channel characteristic information according to the weighting parameters to obtain updated splicing characteristic information;

and obtaining video characteristic information according to the updated splicing characteristic information.

Specifically, the server splices the first characteristic information and the second characteristic information to obtain spliced characteristic information, wherein the spliced characteristic information comprises multi-channel characteristic information. After the splicing characteristic information is obtained, the server calculates weighting parameters corresponding to all channels in the splicing characteristic information, updates the multi-channel characteristic information by using the weighting parameters to obtain updated splicing characteristic information, and reduces the dimension of the updated splicing characteristic information to obtain video characteristic information. The calculation of the weighting parameters corresponding to the channels in the spliced feature information means that the importance degree of each feature channel is automatically obtained in a learning mode, and then useful features are promoted according to the importance degree and the features which are not useful for the current task are suppressed. The dimension reduction is carried out on the updated splicing characteristic information, the video characteristic information can be obtained through a full connection layer, the full connection layer trained in advance is accessed after the updated splicing characteristic information, the updated splicing characteristic information can be classified through the full connection layer, and therefore the characteristics of the last but one hidden layer of the full connection layer can be used as the video characteristic information before the classification result is obtained.

In this embodiment, the first feature information and the second feature information are spliced to obtain spliced feature information, the weighting parameters corresponding to the channels in the spliced feature information are determined, and the spliced feature information can be updated by using the weighting parameters, so that useful features are improved, and features with low use are suppressed, so that video feature information is obtained according to the updated spliced feature information.

In one embodiment, the split data further comprises a set of audio segments;

the video feature extraction method further comprises:

performing audio feature extraction on the audio segments in the audio segment set to obtain audio feature information;

convolving the audio feature information on the dimensionality of the audio fragment number to obtain third feature information;

obtaining the video feature information according to the first feature information and the second feature information includes:

and splicing the first characteristic information, the second characteristic information and the third characteristic information to obtain video characteristic information.

The audio segment collection is a collection composed of audio segments obtained by segmenting the audio corresponding to the video data, and when the audio corresponding to the video data is segmented, the duration of each audio segment can be set according to needs, for example, the duration of each audio segment can be 3 seconds, and then one audio segment is intercepted every 3 seconds.

Specifically, when the video data are disassembled, the server converts the video data to obtain audio corresponding to the video data, performs fragment division on the audio corresponding to the video data to obtain an audio fragment set, performs fourier transform on audio fragments in the audio fragment set to obtain a spectrogram corresponding to the audio fragments, performs feature extraction on the spectrogram to obtain audio feature information, and performs convolution on the audio feature information in the audio fragment dimension to obtain third feature information. After the third feature information is obtained, the server splices the first feature information, the second feature information and the third feature information to obtain spliced feature information, wherein the spliced feature information comprises multi-channel feature information. After the splicing characteristic information is obtained, the server calculates weighting parameters corresponding to all channels in the splicing characteristic information, updates the multi-channel characteristic information by using the weighting parameters to obtain updated splicing characteristic information, and obtains video characteristic information according to the updated splicing characteristic information. The calculation of the weighting parameters corresponding to the channels in the spliced feature information means that the importance degree of each feature channel is automatically obtained in a learning mode, and then useful features are promoted according to the importance degree and the features which are not useful for the current task are suppressed. For example, the server may calculate the weighting parameter corresponding to each channel in the splicing feature information by training the SENet network in advance.

In this embodiment, the third feature information is obtained according to the audio segment set, and the first feature information, the second feature information, and the third feature information are spliced to obtain the video feature information, so that the video feature can be described by combining the audio frame and the video frame at the same time, and more comprehensive video feature information can be obtained.

In one embodiment, the split data further comprises video text data;

the video feature extraction method further comprises:

performing text feature extraction on the video text data to obtain fourth feature information;

and splicing the first characteristic information, the second characteristic information and the fourth characteristic information to obtain video characteristic information.

The video text data refers to data obtained by performing text recognition on an obtained video frame after the video frame is obtained by performing frame extraction on the video data.

Specifically, when video data is disassembled, after video frames are obtained through frame extraction, the server can further perform text recognition on the video frames to extract video text data corresponding to the video data, and then perform text feature extraction on the video text data to obtain fourth feature information, so that splicing feature information can be obtained through the first feature information, the second feature information and the fourth feature information, and the splicing feature information comprises multi-channel feature information.

Specifically, after the splicing characteristic information is obtained, the server calculates weighting parameters corresponding to each channel in the splicing characteristic information, updates the multi-channel characteristic information by using the weighting parameters to obtain updated splicing characteristic information, and obtains video characteristic information according to the updated splicing characteristic information. The calculation of the weighting parameters corresponding to the channels in the spliced feature information means that the importance degree of each feature channel is automatically obtained in a learning mode, and then useful features are promoted according to the importance degree and the features which are not useful for the current task are suppressed. For example, the server may calculate the weighting parameter corresponding to each channel in the splicing feature information by training the SENet network in advance.

In this embodiment, the fourth feature information is obtained by performing text feature extraction on the video text data, and the first feature information, the second feature information, and the fourth feature information are spliced to obtain the video feature information.

In one embodiment, the split data further comprises a set of audio segments and video text data;

the video feature extraction method further comprises:

and splicing the first characteristic information, the second characteristic information, the third characteristic information and the fourth characteristic information to obtain the video characteristic information.

Specifically, the split data further comprises an audio clip collection and video text data, the server obtains third feature information according to the audio clip collection, performs text feature extraction according to the video text data to obtain fourth feature information, and splices the first feature information, the second feature information, the third feature information and the fourth feature information to obtain spliced feature information, wherein the spliced feature information comprises multi-channel feature information. After the splicing characteristic information is obtained, the server calculates weighting parameters corresponding to all channels in the splicing characteristic information, updates the multi-channel characteristic information by using the weighting parameters to obtain updated splicing characteristic information, and obtains video characteristic information by reducing the dimension of the updated splicing characteristic information. The calculation of the weighting parameters corresponding to the channels in the spliced feature information means that the importance degree of each feature channel is automatically obtained in a learning mode, and then useful features are promoted according to the importance degree and the features which are not useful for the current task are suppressed. For example, the server may calculate the weighting parameter corresponding to each channel in the splicing feature information by training the SENet network in advance.

In this embodiment, the video characteristics are described by simultaneously combining the audio, video frames and video text data, so that more comprehensive video characteristic information can be obtained.

In one embodiment, convolving the audio feature information in the audio segment number dimension, and obtaining the third feature information comprises:

on the dimensionality of the audio fragment number, respectively convolving the audio characteristic information according to a plurality of preset convolution kernels to obtain third convolution characteristic information corresponding to the preset convolution kernels;

and splicing the third convolution characteristic information, and performing characteristic aggregation on the spliced third convolution characteristic information to obtain third characteristic information.

Specifically, the server performs convolution on the audio feature information respectively according to the plurality of preset convolution layers in the audio segment number dimension to obtain third convolution feature information corresponding to the preset convolution kernels, then splices the third convolution feature information, and performs feature aggregation on the spliced third convolution feature information to obtain third feature information. The manner of convolving the audio feature information respectively according to the plurality of preset convolutions to obtain the third convolution feature information corresponding to the preset convolution kernel is similar to the manner of convolving the segment feature information respectively according to the plurality of preset convolutions to obtain the second convolution feature information corresponding to the preset convolution kernel in the above embodiment, which is not described here. The manner of performing feature aggregation on the spliced third convolution feature information to obtain the third feature information is similar to the manner of performing feature aggregation on the spliced second convolution feature information to obtain the second feature information in the above embodiment, and this embodiment is not described here. The method comprises the steps of extracting features of a spectrogram to obtain audio feature information, wherein the audio feature information can be obtained by adopting a pre-trained MusiCNN (pre-trained music excitation convolutional neural network for music audio labeling), and after the MusiCNN network is pre-trained, extracting a full connection layer of the first last layer of the MusiCNN network to obtain the audio feature information.

In this embodiment, by performing convolution on audio feature information respectively according to a plurality of preset convolution kernels in the dimension of the number of audio segments, modeling can be performed between consecutive video frames based on the audio segments, third convolution feature information of the audio segments directly with time sequences is extracted, the third convolution feature information is obtained by splicing the third convolution feature information, feature aggregation is performed on the spliced third convolution feature information, and conversion of feature dimensions can be achieved through feature aggregation.

In one embodiment, performing text feature extraction on the video text data to obtain fourth feature information includes:

extracting text characteristics of each section of text data in the video text data;

and performing feature dimension conversion on the text features to obtain fourth feature information.

Specifically, the server performs feature extraction on each segment of text data in the video text data, extracts text features of each segment of text data in the video text data, performs feature dimension conversion on the text features, and converts the text features into text features with specific dimensions, that is, fourth feature information.

Specifically, the server may first extract features of each text segment in the video text data by using a pre-trained BERT (coder of Bidirectional Transformer) network, and then convert the extracted features of each text segment into text features of a specific dimension, that is, fourth feature information, by using a pre-trained TextCNN (text convolutional Neural network) network. The specific dimension can be set according to needs, for example, the specific dimension can be consistent with the feature dimension of the single feature category of the first feature information and the second feature information.

The BERT network is a pre-trained language characterization model, which is pre-trained by using MLM (masked language model) and adopts a deep bidirectional Transformer component to construct the whole model, so as to generate a deep bidirectional language characterization capable of fusing left and right context information. In this embodiment, when extracting features of each text segment in video text data by using a pre-trained BERT network, the feature of each text segment is mainly extracted from a penultimate layer of the BERT network. The TextCNN network comprises an embedded layer, a convolutional layer, a maximum pooling layer and a full-connection layer, wherein the embedded layer is used for encoding input data to obtain embedded representation of the input data, the convolutional layer is used for extracting characteristics of the input data based on the embedded representation, the maximum pooling layer is used for taking the maximum value of the extracted characteristics after convolution and then splicing together to serve as output, and the full-connection layer is used for obtaining the final output result, namely fourth characteristic information, based on the output characteristics of the maximum pooling layer.

For example, as shown in fig. 5, a schematic diagram of extracting text features according to video text data to obtain fourth feature information is obtained, where a specific dimension is assumed to be 1 × 768 dimensions. After the video frame is obtained through frame extraction, the server further performs text recognition on the video frame to extract video text data corresponding to the video data, extracts the features of each section of text data in the video text data by using a pre-trained BERT network, and converts the extracted features of each section of text data into text features of a specific dimension by using a pre-trained TextCNN network.

In this embodiment, the text features of each segment of text data in the video text data are extracted first, and then feature dimension conversion is performed on the text features, so that fourth feature information corresponding to the video text data can be acquired.

In an embodiment, the video feature extraction method in the present application may be implemented based on a pre-trained video classification model, where the video classification model takes sample video data carrying video category labels as training samples, and the pre-trained video classification model may be obtained by training the training samples. As shown in FIG. 6, ViT network, TSM network, NetVLAD + module, MusCNN network, BERT network, TextCNN network, SENET network and full connectivity layer are included in the video classification model. The ViT network is used for extracting frame features of video frames in the video frame set to obtain visual feature information, the TSM network is used for extracting segment features of video segments in the video segment set to obtain segment feature information, the MusicNN network is used for extracting features of a spectrogram obtained according to audio segments in the audio segment set to obtain audio feature information, the BERT network and the TextCNN network are used for extracting text features of video text data to obtain fourth feature information, and the NetVLAD + module is connected to the ViT network, the TSM network and the MusicCNN network and then used for performing convolution and feature aggregation on the output visual feature information, the segment feature information and the audio feature information to obtain corresponding first feature information, second feature information and third feature information. The SENet network is used for learning splicing characteristic information (obtained by splicing the first characteristic information, the second characteristic information, the third characteristic information and the fourth characteristic information), determining a weighting parameter corresponding to each channel in the splicing characteristic information, and updating the splicing characteristic information by using the weighting parameter. The full connection layer is used for classifying the updated splicing characteristic information to obtain a video classification result, and the training of the video classification model can be realized by comparing the video classification result with the video category label carried by the sample video data. It should be noted that, the features of the penultimate hidden layer of the full connection layer can be used as video feature information, so after the video classification model is trained in advance, the server inputs the video data into the video classification model trained in advance, and extracts the features of the penultimate hidden layer of the full connection layer, so that the video feature information corresponding to the video data can be obtained.

Further, the feature dimensions of the first feature information, the second feature information and the third feature information are the same, and the feature dimension of the fourth feature information is consistent with the feature dimensions of the single feature categories of the first feature information, the second feature information and the third feature information. The process of obtaining the corresponding feature information through the ViT network, the TSM network, the NetVLAD + module, the MusiCNN network, the BERT network, the TextCNN network, and the NetVLAD + module can be shown in fig. 7, where it is assumed that the video data includes M frames of video frames.

For extracting features of video frames, a server firstly performs frame extraction on video data to obtain a video frame set (including M frames of video frames), then performs frame feature extraction on the video frames in the video frame set by using an ViT network to obtain visual feature information (M x D dimension, wherein D is a feature dimension corresponding to single frame of visual feature information), and a NetVLAD + module is connected behind a ViT network to convert the visual feature information from the M x D dimension to a fixed K x D dimension.

Meanwhile, the server divides video data to obtain a video segment set, M/8 video segments can be obtained if video data of every 8 seconds is taken as one video segment, the server extracts segment features of the video segments by using a TSM network to obtain segment feature information ((M/8) D dimensions, wherein D is the feature dimension corresponding to a single video segment), and a NetVLAD + module is connected behind the TSM network to convert the segment feature information into a fixed K D dimension.

The method comprises the steps that the characteristics are extracted aiming at an audio frame, a server converts video data to obtain audio data corresponding to the video data, the audio data are divided to obtain an audio fragment set, M/3 audio fragments can be obtained by assuming that every 3 seconds of audio data are taken as one audio fragment, the server converts the audio fragments into corresponding spectrograms through Fourier transform, then characteristic extraction is carried out on the spectrograms through a MusicNN network to obtain audio characteristic information, a NetVLAD + module is connected behind the MusicNN network, and the audio characteristic information is converted into fixed K x D dimensions.

For texts in the video data, the server extracts video text data (assumed to include M sections of texts) from the video data by using text recognition, extracts features (M × D dimension, where D is a feature dimension of a single section of text data) of each section of text data in the video text data by using a BERT network, and converts the features of each section of text data into fourth feature information of a specific dimension (D dimension) by using a TextCNN network.

The video feature information extracted by the video feature extraction method can be applied to video classification, the labels of video data and the classification accuracy can be improved, especially, the labels related to time are obviously improved, so that the indexes of work such as video content recommendation and the like can be improved, and accurate video content recommendation is realized. For example, when video data of dance actions are classified, the video feature information extracted by the video feature extraction method of the application can better identify dance types due to the fact that the video feature information contains time sequence features, and the time sequence-free feature identification rate is low.

In an embodiment, as shown in fig. 8, a video feature extraction method according to the present application is described by a flowchart, and the video feature extraction method specifically includes the following steps:

step 802, acquiring video data;

step 804, disassembling the video data to obtain disassembled data corresponding to the video data, wherein the disassembled data comprises a video frame set, a video fragment set, an audio fragment set and video text data;

step 806, performing frame feature extraction on the video frames in the video frame set to obtain visual feature information, and performing segment feature extraction on the video segments in the video segment set to obtain segment feature information;

808, respectively convolving the visual characteristic information according to a plurality of preset convolution kernels in the dimension of the number of the video frames to obtain first convolution characteristic information corresponding to the preset convolution kernels;

step 810, splicing the first convolution characteristic information, and performing characteristic aggregation on the spliced first convolution characteristic information to obtain first characteristic information;

step 812, respectively convolving the segment characteristic information according to a plurality of preset convolution kernels in the dimension of the number of video segments to obtain second convolution characteristic information corresponding to the preset convolution kernels;

step 814, splicing the second convolution characteristic information, and performing characteristic aggregation on the spliced second convolution characteristic information to obtain second characteristic information;

step 816, extracting audio features of the audio segments in the audio segment set to obtain audio feature information;

step 818, convolving the audio feature information respectively according to a plurality of preset convolution kernels in the dimension of the audio fragment number to obtain third convolution feature information corresponding to the preset convolution kernels;

step 820, splicing the third convolution characteristic information, and performing characteristic aggregation on the spliced third convolution characteristic information to obtain third characteristic information;

step 822, performing text feature extraction on the video text data to obtain fourth feature information;

step 824, splicing the first characteristic information, the second characteristic information, the third characteristic information and the fourth characteristic information to obtain spliced characteristic information, wherein the spliced characteristic information comprises multi-channel characteristic information;

step 826, determining weighting parameters corresponding to each channel in the splicing characteristic information;

828, updating the multi-channel characteristic information according to the weighting parameters to obtain updated splicing characteristic information;

and 830, obtaining video characteristic information according to the updated splicing characteristic information.

It should be understood that, although the steps in the flowcharts related to the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each flowchart related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 9, there is provided a video feature extraction apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and specifically includes: an acquisition module 902, a deconstructing module 904, a feature extraction module 906, a first convolution module 908, a second convolution module 910, and a processing module 912, wherein:

an obtaining module 902, configured to obtain video data;

a disassembling module 904, configured to disassemble the video data to obtain disassembled data corresponding to the video data, where the disassembled data includes a video frame set and a video clip set;

the feature extraction module 906 is configured to perform frame feature extraction on video frames in the video frame set to obtain visual feature information, and perform segment feature extraction on video segments in the video segment set to obtain segment feature information;

a first convolution module 908, configured to convolve the visual feature information in the number dimension of the video frame to obtain first feature information;

a second convolution module 910, configured to convolve the segment feature information on the video segment number dimension to obtain second feature information;

the processing module 912 is configured to obtain video feature information according to the first feature information and the second feature information.

The video feature extraction device obtains the video data, disassembles the video data to obtain the split data comprising the video frame set and the video segment set, extracts the frame features of the video frames in the video frame set to obtain the visual feature information, extracts the segment features of the video segments in the video segment set to obtain the segment feature information, further models between the continuous video frames by convolving the visual feature information on the video frame number dimension to obtain the first feature information with the time sequence, models between the continuous video frames based on the video segment number dimension by convolving the segment feature information on the video segment number dimension to obtain the second feature information with the time sequence, thereby obtaining the video feature information fully describing the features of the video data according to the first feature information and the second feature information with the time sequence, the video data recognition rate can be improved.

In an embodiment, the first convolution module is further configured to, in a video frame number dimension, convolve the visual feature information according to a plurality of preset convolution kernels to obtain first convolution feature information corresponding to the preset convolution kernels, splice the first convolution feature information, and perform feature aggregation on the spliced first convolution feature information to obtain the first feature information.

In one embodiment, the second convolution module is further configured to, in the video segment number dimension, convolve the segment feature information according to a plurality of preset convolution kernels respectively to obtain second convolution feature information corresponding to the preset convolution kernels, splice the second convolution feature information, and perform feature aggregation on the spliced second convolution feature information to obtain second feature information.

In an embodiment, the processing module is further configured to splice the first feature information and the second feature information to obtain spliced feature information, where the spliced feature information includes multi-channel feature information, determine a weighting parameter corresponding to each channel in the spliced feature information, update the multi-channel feature information according to the weighting parameter to obtain updated spliced feature information, and obtain video feature information according to the updated spliced feature information.

In one embodiment, the split data further comprises a set of audio segments; the feature extraction module is further used for performing audio feature extraction on the audio clips in the audio clip set to obtain audio feature information, and performing convolution on the audio feature information on the audio clip number dimension to obtain third feature information; the processing module is further used for splicing the first characteristic information, the second characteristic information and the third characteristic information to obtain video characteristic information.

In one embodiment, the split data further comprises video text data; the characteristic extraction module is also used for extracting the text characteristic of the video text data to obtain fourth characteristic information; the processing module is further used for splicing the first characteristic information, the second characteristic information and the fourth characteristic information to obtain video characteristic information.

In one embodiment, the split data further comprises a set of audio segments and video text data; the feature extraction module is further used for performing audio feature extraction on the audio segments in the audio segment set to obtain audio feature information, performing convolution on the audio feature information on the audio segment number dimension to obtain third feature information, and performing text feature extraction on video text data to obtain fourth feature information; the processing module is further used for splicing the first characteristic information, the second characteristic information, the third characteristic information and the fourth characteristic information to obtain video characteristic information.

In one embodiment, the feature extraction module is further configured to, in the dimension of the number of audio segments, perform convolution on the audio feature information according to the plurality of preset convolution kernels respectively to obtain third convolution feature information corresponding to the preset convolution kernels, splice the third convolution feature information, and perform feature aggregation on the spliced third convolution feature information to obtain third feature information.

In an embodiment, the feature extraction module is further configured to extract a text feature of each text data segment in the video text data, and perform feature dimension conversion on the text feature to obtain fourth feature information.

For specific limitations of the video feature extraction device, reference may be made to the above limitations of the video feature extraction method, which are not described herein again. The modules in the video feature extraction device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as video characteristic information. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video feature extraction method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for video feature extraction, the method comprising:

acquiring video data;

the video data are disassembled to obtain the disassembled data corresponding to the video data, and the disassembled data comprise a video frame set and a video clip set;

convolving the visual characteristic information on the video frame number dimension to obtain first characteristic information;

2. The method of claim 1, wherein convolving the visual characteristic information over a video frame number dimension to obtain first characteristic information comprises:

on the dimension of the number of video frames, according to a plurality of preset convolution kernels, respectively convolving the visual characteristic information to obtain first convolution characteristic information corresponding to the preset convolution kernels;

3. The method of claim 1, wherein convolving the segment feature information over a video segment number dimension to obtain second feature information comprises:

on the video segment number dimension, respectively convolving the segment feature information according to a plurality of preset convolution kernels to obtain second convolution feature information corresponding to the preset convolution kernels;

4. The method of claim 1, wherein obtaining video feature information according to the first feature information and the second feature information comprises:

5. The method of claim 1, wherein the split data further comprises a set of audio segments and video text data;

the video feature extraction method further comprises:

performing audio feature extraction on the audio clips in the audio clip set to obtain audio feature information;

convolving the audio feature information on the dimensionality of the audio segments to obtain third feature information;

the obtaining of the video feature information according to the first feature information and the second feature information includes:

and splicing the first characteristic information, the second characteristic information, the third characteristic information and the fourth characteristic information to obtain video characteristic information.

6. The method of claim 5, wherein convolving the audio feature information in an audio segment number dimension to obtain third feature information comprises:

7. The method of claim 5, wherein the performing text feature extraction on the video text data to obtain fourth feature information comprises:

8. A video feature extraction apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring video data;

the first convolution module is used for performing convolution on the visual characteristic information in the video frame number dimension to obtain first characteristic information;

and the processing module is used for obtaining video characteristic information according to the first characteristic information and the second characteristic information.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.