CN111008280B

CN111008280B - Video classification method, device, equipment and storage medium

Info

Publication number: CN111008280B
Application number: CN201911228426.1A
Authority: CN
Inventors: 迟至真; 李甫; 孙昊; 何栋梁; 龙翔; 周志超; 王平
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2023-09-05
Anticipated expiration: 2039-12-04
Also published as: CN111008280A

Abstract

The application discloses a video classification method, a device, equipment and a storage medium, and relates to the technical field of video classification. The specific implementation scheme is as follows: performing frame extraction processing on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified; inputting frame data of a plurality of video frames of the video to be classified into a feature extraction model trained in advance to obtain feature data of the video to be classified; the feature extraction model comprises a TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer; and inputting the characteristic data into a sequence model trained in advance to obtain a classification result of the video to be classified. According to the embodiment of the application, the TSM model is introduced during feature extraction, and the time sequence convolution layers of the TSM model perform time sequence offset operation with random directions on input data of each layer, so that the data expansion is facilitated, the extracted feature data is richer and more comprehensive, and the accuracy of video classification results is further improved.

Description

Video classification method, device, equipment and storage medium

Technical Field

The application relates to a data processing technology, in particular to the technical field of video classification.

Background

The video classification technology outputs specific category information for the video by analyzing and understanding image characteristics, audio characteristics or user barrage information and the like of the video.

In the prior art, the implementation scheme of video classification mainly comprises the following three types: firstly, inputting feature data of a key frame of a video into a classification model to obtain an output classification result; secondly, inputting the title/attribute of the video into a classification model to obtain an output classification result; third, the classification information of the video is identified from the tags uploaded by the user.

However, the above three schemes have a problem of low accuracy of video classification.

Disclosure of Invention

The embodiment of the application provides a video classification method, a device, equipment and a storage medium, which are used for improving the accuracy of video classification results.

In a first aspect, the present application provides a video classification method, including:

performing frame extraction processing on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified;

inputting frame data of a plurality of video frames of a video to be classified into a feature extraction model trained in advance, and obtaining feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a time sequence conversion model TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer;

And inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model.

According to the embodiment of the application, frame extraction processing is carried out on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified; inputting frame data of a plurality of video frames of the video to be classified into a feature extraction model trained in advance to obtain feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer; and inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model. According to the technical scheme, the feature extraction model comprising the TSM model is introduced, in the feature extraction process, the time sequence convolution layers of the TSM model carry out time sequence offset operation with random directions on input data of each layer, so that data augmentation is facilitated, hidden information in the input data of each layer is conveniently mined, extracted feature data is richer and more comprehensive, and accuracy of video classification results is further improved.

Optionally, each time sequence convolution layer in the TSM model performs time sequence offset operation with random direction aiming at the first 1/N element of the input data of the layer, wherein the value of N is 2 or 3.

In an optional implementation manner in the above application, in each time sequence convolution layer in the TSM model, for the first 1/2 or 1/3 element of the input data of the layer, a time sequence offset operation with random direction is performed, so that the use mechanism of the TSM model is perfected, and the input data of each time sequence convolution layer has a larger receptive field in the time dimension.

Optionally, before performing frame extraction processing on the video to be classified, the method further includes:

performing frame extraction processing on the first sample video to obtain frame data of a plurality of video frames of the first sample video;

and training the initially established feature extraction model by taking frame data of a plurality of video frames of the first sample video and feature data of the first sample video as sample data.

In an optional embodiment of the above application, before performing frame extraction processing on the video to be classified, frame extraction processing is performed on the first sample video, and the obtained frame data and feature data of the first sample video are used as sample data to train the feature extraction model which is initially built, so that a training mechanism of the feature extraction model is perfected, and a guarantee is provided for normal use of the feature extraction model.

Optionally, the feature extraction model includes a first TSM model; training the feature extraction model includes training a first TSM model;

the input data of the first TSM model is RGB images of each frame in a plurality of video frames of a first sample video;

and the feature data of the video to be classified output by the feature extraction model is the feature data output by the first TSM model.

In an alternative embodiment of the above application, the feature extraction model is refined to include the first TSM model, so as to perfect the construction details and training content of the feature extraction model. Training a first TSM model when the feature extraction model is carried out, and taking RGB images of each frame in a plurality of video frames of a first sample video as input data when the first TSM model is trained, so that a training mechanism of the first TSM model is perfected; in the use process of the feature extraction model, the feature data output by the first TSM model is directly used as the feature data of the video to be classified output by the feature extraction model, so that the possibility is provided for classifying the video based on RGB images in the video frame.

Optionally, the feature extraction model includes a first TSM model and a second TSM model; training the feature extraction model, including training the first TSM model and the second TSM model;

the input data of the second TSM model is an optical flow field image of each frame in a plurality of video frames of the first sample video;

the feature data of the video to be classified, which is output by the feature extraction model, is feature data obtained by superposing feature data output by the first TSM model and feature data output by the second TSM model.

In an optional implementation manner in the above application, the feature extraction model is refined to include the first TSM model and the second TSM model, so that the construction details of the feature extraction model are perfected; training the first TSM model and the second TSM model respectively, wherein in the training process of the feature extraction model, an RGB image is used as input data of the first TSM model, and an optical flow field image is used as input data of the second TSM model, so that a model training mechanism of the feature extraction model is further improved; in the use process of the feature extraction model, the feature data of the first TSM model and the feature data of the second TSM model are overlapped to obtain the final feature data of the video to be classified, so that the spatial information and the time sequence information are combined in the feature extraction process, the robustness of the classification model is obviously improved, and the accuracy and the recall rate of the classification model are guaranteed.

Optionally, before inputting the feature data into the pre-trained sequence model, the method further comprises:

performing frame extraction processing on the second sample video to obtain frame data of a plurality of video frames of the second sample video;

inputting frame data of a plurality of video frames of the second sample video into a trained feature extraction module to obtain feature data of the second sample video;

and training the initially established sequence model by taking the characteristic data of the second sample video and the classification label information of the second sample video as sample data.

In an optional implementation manner in the above application, before the feature data is input into the pre-trained sequence model, frame extraction processing is additionally performed on the second sample video, and feature extraction operation is performed on the obtained frame data of a plurality of video frames, so that the training operation of the sequence model is performed by adopting the feature data and the classification label information of the second sample video, the training mechanism of the sequence model is perfected, and the guarantee is provided for normal use of the sequence model.

Optionally, the sequence model includes a nextvld layer that decomposes the input feature vectors into a set number of low-dimensional feature vectors.

In an optional embodiment of the above application, the sequence model is refined into a nextvld layer including decomposing the input feature vector, and the low-dimensional decomposition processing is performed on the high-dimensional feature vector, so that the time complexity and the space complexity in the video classification process are reduced, and the classification efficiency is improved.

Optionally, the sequence model includes: the system comprises a whitening layer, a NeXtVLAD layer, a filtering layer, a full connection layer, a context gating layer and a classifier which are connected in sequence.

In an optional embodiment of the above application, a focus mechanism is introduced through a context gating layer to ensure that different features have better differentiation on different categories, thereby improving accuracy of video classification results.

Optionally, the set number is an integer value greater than 8.

In an optional implementation manner in the above application, the time complexity and the space complexity of the sequence model can be remarkably reduced by specifically decomposing the feature vector input by the NextVLAD layer into more than 8 low-dimensional feature vectors, and the video classification efficiency is improved.

In a second aspect, an embodiment of the present application further provides a video classification apparatus, including:

the video frame extracting module is used for carrying out frame extracting processing on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified;

The feature extraction module is used for inputting frame data of a plurality of video frames of the video to be classified into a feature extraction model trained in advance to obtain feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a time sequence conversion model TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer;

and the classification module is used for inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model.

In a third aspect, an embodiment of the present application further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a video classification method as provided by embodiments of the first aspect.

In a fourth aspect, embodiments of the present application also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a video classification method as provided by the embodiments of the first aspect.

Other effects of the above alternative will be described below in connection with specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is a flow chart of a video classification method according to a first embodiment of the application;

FIG. 2A is a flow chart of a video classification method according to a second embodiment of the application;

FIG. 2B is a flowchart illustrating a timing offset operation according to a second embodiment of the present application;

FIG. 3 is a flow chart of a video classification method according to a third embodiment of the application;

FIG. 4A is a flow chart of a video classification method according to a fourth embodiment of the application;

FIG. 4B is a block diagram of a sequence model in a fourth embodiment of the application;

FIG. 5A is a flow chart of a video classification method according to a fifth embodiment of the application;

FIG. 5B is a schematic diagram of a video classification model framework in accordance with a fifth embodiment of the application;

fig. 5C is a schematic structural diagram of a TSN model in a fifth embodiment of the present application;

FIG. 5D is a diagram illustrating a structure of a NetVLAD layer according to a fifth embodiment of the application;

FIG. 5E is a schematic diagram of a NextVLAD layer according to a fifth embodiment of the application;

fig. 6 is a block diagram of a video classification apparatus according to a sixth embodiment of the present application;

Fig. 7 is a block diagram of an electronic device for implementing a video classification method according to an embodiment of the application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example 1

Fig. 1 is a flowchart of a video classification method according to a first embodiment of the present application, where the method is applicable to classifying videos (such as cartoons) in combination with a TSM model, and the method is implemented by using a video classification device, where the device is implemented by software and/or hardware, and is specifically configured in an electronic device with a certain data computing capability.

A video classification method as shown in fig. 1, comprising:

s101, performing frame extraction processing on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified.

The frame extraction processing is performed on the video to be classified, and the frame extraction processing can be performed on the whole video to be classified in the time dimension, so that the whole video can be covered in the time dimension. Of course, in order to ensure that frame data obtained by frame extraction can be uniformly spread over the time dimension of the video to be classified, typically, the video may be equally divided into multiple segments, and one video frame is randomly extracted from each segment, so as to realize equally-spaced frame extraction sampling.

It can be understood that in order to capture global information of the video as much as possible, and avoid a great amount of redundancy in the video classification process, the data operand is reduced, and typically, a sparse sampling video frame mode is adopted to replace a dense sampling mode, so that the video frames of the video to be classified are subjected to frame extraction processing.

S102, inputting frame data of a plurality of video frames of a video to be classified into a feature extraction model trained in advance, and obtaining feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a time sequence conversion model TSM model; and each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on the input data of the layer.

The feature extraction model is used for extracting feature data in frame data of each video frame as feature data of the video to be classified. The input data of the feature extraction model is the frame data of each video frame; the output result is characteristic data of each frame of data.

It will be appreciated that when using the feature extraction model to perform feature extraction, a large amount of data is required to perform model training, so that the trained model meets the accuracy requirement.

In the training phase of the feature extraction model, the following manner can be adopted to realize: performing frame extraction processing on the first sample video to obtain frame data of a plurality of video frames of the first sample video; and training the initially established feature extraction model by taking frame data of a plurality of video frames of the first sample video and feature data of the first sample video as sample data.

And performing frame extraction processing on the first sample video, and performing frame extraction processing on the whole first sample video in the time dimension so as to ensure that the whole video can be covered in the time dimension. Of course, in order to ensure that the frame data obtained by frame extraction can be uniformly spread over the time dimension of the first sample video, typically, the video may be equally spaced into multiple segments, and one video frame is randomly extracted from each segment, so as to implement equally spaced frame extraction sampling.

When the feature extraction model is trained, frame data of a plurality of video frames of a first sample video and feature data of the first sample video are adopted as sample data, model parameters of the feature extraction model which are initially established are trained, and the model parameters of the feature extraction model are continuously adjusted according to training results, so that the distance between the feature data output by the model and the feature data in the sample data is gradually approximated and tends to be stable, and a final feature extraction model is obtained.

It can be understood that, in order to ensure the feature extraction effect of the feature extraction model, when the feature extraction model is used to extract features of the video to be classified, the generation mode of the input data of the feature extraction model needs to be consistent with the model training process, that is, the mode of extracting frames of the video to be classified in the model trial stage is the same as the mode of extracting frames of the first sample video in the model training stage.

It should be noted that, since the feature extraction model includes a time sequence conversion model (Temporal Shift Module, TSM), and each time sequence convolution layer of the TSM model performs a time sequence offset operation with random direction on the input data of the layer, when the feature extraction model including the TSM model is adopted to perform feature extraction, more abundant and comprehensive feature data can be obtained, so as to provide a guarantee for further improving the model precision of the feature extraction model.

Because the TSM model can carry out time sequence offset operation on the input data of the layer in the using process of each time sequence convolution layer, the input data of each layer can have larger receptive field in the time dimension, hidden information in the input data of the layer is convenient to mine, and therefore the finally extracted characteristic data is richer and more comprehensive.

In the time sequence offset process, the embodiment of the application is realized in a random offset mode, does not limit the time sequence offset direction, and is beneficial to data augmentation, thereby realizing further excavation of hidden information of input data of each layer.

It should be noted that, the TSM model used in the present application can be obtained by performing model improvement on the basis of the existing time sequence segmented network (Temporal Segment Network, TSN) model, that is, embedding a time sequence offset layer in a convolution layer, and does not introduce new network parameters and data calculation amount, thereby realizing the upgrade of the network structure more flexibly.

When each time sequence convolution layer using the TSM model performs a time sequence offset operation with random direction on the input data (i.e. input tensor) of the layer, in order to avoid losing time sequence information due to excessive random operation, the video classification result is not ideal, and a certain limitation is usually performed on the position of the time sequence offset. In an optional implementation manner of the embodiment of the present application, when each time sequence convolution layer in the TSM model performs a time sequence offset operation with random direction on the input data of the present layer, each time sequence convolution layer in the TSM model may perform a time sequence offset operation with random direction on the first 1/N element of the input data (i.e. the input tensor) of the present layer, where the value of N is 2 or 3. That is, the sequential convolution layer performs a random directional sequential shift operation on only the first 1/2 or 1/3 of the elements of the input data of the present layer.

It can be understood that by setting the position of the timing offset operation in the first 1/2 or the first 1/3 of the input data of the present layer, the influence of the confusion of the timing information can be avoided while the receptive field of the time dimension is enlarged, so that the receptive field of the time dimension and the richness of the timing information are balanced, and meanwhile, the data augmentation is facilitated, so that the overall recognition capability of the video classification model is remarkably improved.

S103, inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model.

The sequence model is used for classifying the feature data of each video to be classified, so that a video classification result of each video to be classified is obtained. The sequence model may be a classified model, or a multi-classified model obtained by combining a plurality of classified models, so as to realize multi-class identification of the model to be classified.

The classification result of the video may be coarse-grained classification, such as category labels of "education category", "life category", "game category", and "cartoon category". Of course, the classification result of the video may be fine-grained classification, for example, directly outputting the video name as the classification result. For example, when a cartoon video is to be classified, the video classification result may be the cartoon name of the cartoon video.

According to the embodiment of the application, frame extraction processing is carried out on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified; inputting frame data of a plurality of video frames of the video to be classified into a feature extraction model trained in advance to obtain feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer; and inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model. According to the technical scheme, the feature extraction model comprising the TSM model is introduced, in the feature extraction process, the time sequence convolution layers of the TSM model carry out time sequence offset operation with random directions on input data of each layer, data augmentation is facilitated, hidden information in the input data of each layer is conveniently mined, extracted feature data is richer and more comprehensive, and accuracy of video classification results is further improved.

Example two

Fig. 2A is a flowchart of a video classification method in a second embodiment of the present application, where the embodiment of the present application is optimized and improved based on the technical solutions of the foregoing embodiments.

Further, before the operation of performing frame extraction processing on the video to be classified, adding the frame extraction processing on the first sample video to obtain frame data of a plurality of video frames of the first sample video; and training the initially established feature extraction model by taking frame data of a plurality of video frames of the first sample video and feature data of the first sample video as sample data so as to perfect a training mechanism of the feature extraction model.

Further, the feature extraction model is refined to be ' the feature extraction model comprises a first TSM model ', and the training of the feature extraction model is refined to be ' the training of the feature extraction model comprises the training of the first TSM model; the input data of the first TSM model is RGB image of each frame in a plurality of video frames of a first sample video; correspondingly, in the use process of the feature extraction model, the output of the feature extraction model is further refined into 'feature data of the video to be classified output by the feature extraction model is feature data output by the first TSM model', so that the construction details and training content of the feature extraction model are improved.

A video classification method as shown in fig. 2A, comprising:

S201, performing frame extraction processing on the first sample video to obtain frame data of a plurality of video frames of the first sample video.

The frame data includes frame data corresponding to the RGB image.

S202, training a first TSM model which is initially built by taking RGB images of each frame in a plurality of video frames of a first sample video and characteristic data of the first sample video as sample data. And each time sequence convolution layer of the first TSM model performs time sequence offset operation with random direction on the input data of the layer.

Wherein the feature extraction model comprises a first TSM model. When the first TSM model is trained, RGB images of frames in video frames of a first sample video and characteristic data of the first sample video are used as sample data, the first TSM model which is built initially is trained, and the distance between the characteristic data output by the model and the characteristic data of the first sample video is gradually approximated and a numerical result tends to be stable by continuously adjusting model parameters of the first TSM model, so that a final first TSM model is obtained.

In the process of training the first TSM model, random time sequence offset operation is carried out on the input data of each time sequence convolution layer of the first TSM model. Referring to the flow chart of the timing shift operation shown in fig. 2B, C represents a channel, and T represents the dimension of the timing. Each row represents a video frame with the same color and each tile represents a different channel.

FIG. 2B is a diagram of a normal two-dimensional convolution channel of normal input data, each convolution operation being performed on T _i The features on the frame (i.e., the features of each row) are processed; fig. 2B (2) shows a time-series shift, with zero padding for the foremost and last parts. FIG. 2B (3) shows the manner in which the cyclic shift, with respect to the zero padding, supplements more to the rearThereby making the size of the input data unchanged.

In order to avoid the loss of timing information caused by excessive timing offset, the position of the timing offset is generally defined. The embodiment of the application only carries out time sequence offset operation on the first 1/2 or the first 1/3 elements of each time sequence convolution layer. Fig. 2B is illustrated with the previous 1/2 example.

S203, performing frame extraction processing on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified.

S204, inputting RGB images of a plurality of video frames of the video to be classified into a first TSM model trained in advance, and obtaining feature data of the video to be classified, which is output by the first TSM model.

S205, inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model.

According to the embodiment of the application, the feature extraction model is refined to comprise the first TSM model, and the first TSM model is subjected to model training through the RGB images of each frame in a plurality of video frames of the first sample video and the feature data of the first sample video, so that the trained first TSM model is used for carrying out feature extraction on the RGB images of each video frame of the video to be classified, and the feature data of the video to be classified is obtained. The technical scheme improves the construction details and training content of the feature extraction model, and simultaneously provides possibility for video classification based on RGB images in video frames by improving the model training of the first TSM model.

Example III

Fig. 3 is a flowchart of a video classification method in a third embodiment of the present application, where the embodiment of the present application is optimized and improved based on the technical solutions of the foregoing embodiments.

Further, the feature extraction model is refined to "the feature extraction model includes a first TSM model and a second TSM model"; the training of the feature extraction model is thinned to 'training of the feature extraction model', including training of a first TSM model and a second TSM model; the input data of the first TSM model is RGB images of each frame in a plurality of video frames of a first sample video; the input data of the second TSM model is an optical flow field image of each frame in a plurality of video frames of the first sample video; in the use process of the feature extraction model, the output of the feature extraction model is further refined into feature data of the video to be classified, which is output by the feature extraction model, wherein the feature data is obtained by superposing the feature data output by the first TSM model and the feature data output by the second TSM model, so that the comprehensiveness of the extracted features is improved in a multi-mode, and the classification capability of the model is improved.

A video classification method as shown in fig. 3, comprising:

s301, performing frame extraction processing on the first sample video to obtain frame data of a plurality of video frames of the first sample video.

The frame data comprises frame data corresponding to RGB images and frame data corresponding to optical flow field images.

S302, training a first TSM model which is initially built by taking RGB images of each frame in a plurality of video frames of a first sample video and characteristic data of the first sample video as sample data. And each time sequence convolution layer of the first TSM model performs time sequence offset operation with random direction on the input data of the layer.

The training process of the first TSM model is referred to the foregoing embodiments, and will not be described herein.

S303, training the initially established second TSM model by taking the optical flow field image of each frame in a plurality of video frames of the first sample video and the characteristic data of the first sample video as sample data. And each time sequence convolution layer of the second TSM model performs time sequence offset operation with random direction on the input data of the layer.

The optical flow field image can find the corresponding relation between the previous frame and the current frame by utilizing the change of the pixels of the image sequence in the video frame in the time domain and the correlation between the adjacent frames, so as to calculate the motion information of the object between the adjacent frames and obtain the optical flow field image corresponding to the video frame.

It should be noted that, because the optical flow field image is substantially different from the RGB image, it is necessary to additionally construct a second TSM model for training the optical flow field image, and extract feature data by using the trained second TSM model and the first TSM model, respectively.

When the second TSM model is trained, the optical flow field image of each frame in the video frame of the first sample video and the characteristic data of the first sample video are used as sample data, the initially constructed second TSM model is trained, and the distance between the characteristic data output by the model and the characteristic data of the first sample video is gradually approximated and the numerical result tends to be stable by continuously adjusting the model parameters of the second TSM model, so that the final second TSM model is obtained.

S304, performing frame extraction processing on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified.

S305, inputting RGB images of each frame in a plurality of video frames of the video to be classified into a first TSM model trained in advance to output characteristic data of the video to be classified.

S306, inputting the optical flow field image of each frame in the plurality of video frames of the video to be classified into a second TSM model trained in advance to output the characteristic data of the video to be classified.

S307, the characteristic data output by the first TSM model and the characteristic data output by the second TSM model are overlapped.

Because the TSM model adopts a structure of a double-flow method and comprises a second TSM model and a second TSM model, double flows are required to be combined, namely, the characteristic data output by the first TSM model and the characteristic data output by the second TSM model are overlapped to obtain final characteristic data.

It should be noted that, since the RGB image only includes static information at a certain time in the video frame, and lacks context information, and the optical flow field image can provide time sequence information of the previous and subsequent frames, the extracted feature data is richer and more comprehensive by superposing the feature data output by the first TSM model and the feature data output by the second TSM model.

S308, inputting the superimposed characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model.

It can be understood that, since the finally determined feature data can be added with more information in the time dimension, the content is richer, so that the features referred in the video classification are richer and more comprehensive, thereby improving the accuracy and recall rate of the sequence model laterally, and improving the robustness of the classification model by introducing the time dimension information.

According to the embodiment of the application, the feature extraction model is refined to comprise the first TSM model and the second TSM model, the first TSM model is trained through RGB images of each frame in a plurality of video frames of the first sample video and the feature data of the first sample video, the second TSM model is trained through the optical flow field images of each frame in the plurality of video frames of the first sample video and the feature data of the first sample video, so that feature extraction is carried out on videos to be classified by using the trained first TSM model and second TSM model, and the extracted feature data of the first TSM model and the extracted feature data of the second TSM model are overlapped, so that the finally obtained feature data contains space information and time information in the videos, the comprehensiveness of extracted features is improved, the robustness of the classification model is further improved, and the accuracy and recall rate of the classification model are guaranteed.

Example IV

Fig. 4A is a flowchart of a video classification method according to a fourth embodiment of the present application, where the embodiment of the present application performs optimization and improvement based on the technical solutions of the foregoing embodiments.

Further, before the operation of inputting the characteristic data into a pre-trained sequence model, adding the frame extraction processing to the second sample video to obtain frame data of a plurality of video frames of the second sample video; inputting frame data of a plurality of video frames of the second sample video into a trained feature extraction module to obtain feature data of the second sample video; and training the initially established sequence model by taking the characteristic data of the second sample video and the classification label information of the second sample video as sample data so as to perfect a model training mechanism of the sequence model.

A video classification method as shown in fig. 4A, comprising

S401, performing frame extraction processing on the second sample video to obtain frame data of a plurality of video frames of the second sample video.

And performing frame extraction processing on the second sample video, wherein the frame extraction processing can be performed on the whole second sample video in the time dimension so as to ensure that the whole video can be covered in the time dimension. Of course, in order to ensure that the frame data obtained by frame extraction can be uniformly spread over the time dimension of the second sample video, typically, the video may be equally spaced into multiple segments, and one video frame is randomly extracted from each segment, so as to implement equally spaced frame extraction sampling.

S402, inputting frame data of a plurality of video frames of the second sample video into the trained feature extraction module to obtain feature data of the second sample video.

The feature extraction model comprises a time sequence conversion model TSM model; and each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on the input data of the layer.

The feature extraction model may include a first TSM model, or the feature extraction model may also include a second TSM model, for example. The training manner of the first TSM model and the second TSM model can be referred to the foregoing embodiments, and will not be described herein.

S403, training the initially established sequence model by taking the characteristic data of the second sample video and the classification label information of the second sample video as sample data.

When the sequence model is trained, the feature data of the second sample video and the classification label information of the second sample video are used as sample data, the initially established sequence model is trained, and model parameters in the sequence model are continuously adjusted according to training results, so that an error value between a classification result output by the model and the classification label information of the second sample video meets the set precision requirement. The setting accuracy can be set by a technician according to the requirement or an empirical value.

See the block diagram of the sequence model shown in fig. 4B, where the sequence model includes a whitening layer, netVLAD layer or nextvld layer, a filtering layer, a fully connected layer, a context gating layer, and a classifier, connected in sequence.

And a whitening layer for removing redundant information between the feature data to reduce correlation between the input feature data. Alternatively, a reverse whitening layer (Reverse Whitening) may be employed as the whitening layer.

The NetVLAD layer or the nextvld layer is used for fusing frame-level features into video-level features, and emphasizes the distribution association among the features, and the main starting point is to learn the clustering of video frames, and the clustering center vector is used as the video-level features.

The context gating layer is used for introducing an attention mechanism in the channel dimension and aims at modeling the dependence between classification results so as to learn better characteristic representation and ensure that different characteristics have better differentiation on different classes. Alternatively, a SE context gating layer (SE Context Gating) may be employed.

Alternatively, the classifier may be implemented in a logistic regression manner.

In an alternative implementation manner of the embodiment of the present application, the nextvld layer may decompose the input feature vector into a set number of low-dimensional feature vectors based on NetVLAD aggregation, so as to reduce the data operand when converting the frame-level feature vector into the video-level feature vector.

In the process of executing the embodiment of the application, the time complexity and the space complexity of the sequence model in the data operation process have different amplitude changes due to the different numbers of the low-dimensional feature vectors. To ensure a reduction in both temporal and spatial complexity during data operations, the number is typically set to an integer value greater than 8.

It should be noted that, when the features are extracted, the RGB image and/or the optical flow field image of the video frame are processed, so that the extracted feature data only includes the video feature data, and does not include the audio feature data, thereby reducing the data calculation amount of video classification by adopting the sequence model.

S404, performing frame extraction processing on the video to be classified to obtain frame data of a plurality of video frames of the video to be classified.

S405, inputting frame data of a plurality of video frames of a video to be classified into a feature extraction model trained in advance, and obtaining feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a time sequence conversion model TSM model; and each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on the input data of the layer.

S406, inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model.

Before feature data is input into a pre-trained sequence model, the embodiment of the application additionally carries out frame extraction processing on the second sample video and carries out feature extraction operation on the frame data of a plurality of obtained video frames, so that the training operation of the sequence model is carried out by adopting the feature data and the classification label information of the second sample video, the training mechanism of the sequence model is perfected, and the normal use of the sequence model is ensured.

Example five

Fig. 5A is a flowchart of a video classification method in a fifth embodiment of the present application, which provides a preferred implementation manner based on the technical solutions of the foregoing embodiments, and is described in detail with reference to the schematic view of the video classification model frame shown in fig. 5B.

A video classification method as shown in fig. 5A, comprising:

s510, a video preprocessing stage;

s520, model training stage;

s530, a model use stage.

Wherein, in the video preprocessing stage, the method comprises the following steps:

s511, acquiring videos to be trained and videos to be classified, equally dividing each video into a plurality of sections, and randomly extracting a video frame from each section.

S512, determining the RGB image and the optical flow field image of each extracted video frame.

Wherein, in the training stage of the TSM model, the method comprises the following steps:

s521, the RGB image of each video frame in the video to be trained and the characteristic data of the video to be trained are input into the first TSM model which is initially constructed as training samples, and the first TSM model is trained.

S522, inputting the optical flow field image of each video frame in the video to be trained and the characteristic data of the video to be trained into the first TSM model which is built as training samples, and training the first TSM model.

The principle of the first TSM model is the same as that of the second TSM model, and the first TSM model will be described in detail as an example.

The first TSM model adopted in the application is obtained based on TSN model improvement, and the TSN model is firstly described in detail.

Referring to the schematic structure of the TSN model shown in fig. 5C, the input data of the TSN model is the frame data of the video segment obtained by splitting the complete video and selecting a part of video frames from the split video segment.

The TSN model 500 includes feature extraction 51 and feature fusion 52. Wherein the feature extraction 51 includes a timing feature extraction 511 and a semantic feature extraction 512. Wherein, the time sequence feature extraction 511 is used for extracting the time sequence feature information in the video clip; semantic feature extraction 512 is used to extract spatial feature information in the video segments. And the feature fusion 52 is used for fusing the extracted time sequence feature information and the space information to obtain final feature data.

Specifically, an input video is divided into N segments (segments), and a segment is randomly sampled from its corresponding segment. To ensure the input of the entire video level, the features of the entire video are typically ensured in a uniformly sampled manner. That is, given a video V, it is equally divided into k segments, and video segments { S } are randomly extracted in each segment ₁ ,S ₂ ,S ₃ ,…,S _k }. Modeling is then performed with reference to the following:

TSN(S ₁ ,S ₂ ,…,S _k )＝H(g(F(S ₁ ；W),F(S ₂ ；W),…,F(S _k ；W)))；

wherein, H (), g () and F () are functions of a certain layer respectively, and W is a model parameter to be trained.

After each frame of features is obtained, since the TSN is microscopic, the model parameters can be optimized by back-conduction, the loss function is as follows:

wherein y is a label of a certain category of output, G is the output of a softmax layer in the TSN model, and C is the iteration number.

The TSM model adds a timing shift operation with random direction based on the TSN model, that is, before the convolution operation is performed on each timing convolution layer of the TSN model, the timing shift operation with random direction needs to be performed along with the first 1/2 of the input data of the layer, and the specific timing shift operation can be referred to fig. 2B, and the related description of the foregoing embodiment will not be repeated here.

It should be noted that, the TSM model improves the classification capability through a multi-mode manner, and uses the RGB image as the input data of the spatial convolution network (i.e., the first TSM model), and uses the optical flow field image as the input data of the temporal convolution network (i.e., the second TSM model), so that the RGB image provides static information of a certain time, provides time sequence information of the front and rear frames through the optical flow field image, and superimposes the feature data output by the first TSM model and the feature data output by the second TSM model, so that more information is added to the feature data in the time dimension, that is, more abundant and comprehensive feature information can be obtained under the condition of the same receptive field, thereby being beneficial to improving the recognition capability of the model.

Wherein, in the training stage of the sequence model, the method comprises the following steps:

s523, training the initially established sequence model by taking the characteristic data and the classification label information of the sample to be trained as sample data.

The model structure of the sequence model can be seen in fig. 4B. The sequence model comprises a whitening layer, a NetVLAD layer or a NextVLAD layer, a filtering layer, a full connection layer, a context gating layer and a classifier which are connected in sequence.

A context gating layer for introducing an attention mechanism in the channel dimension for context gating, aiming at modeling the dependency between classification results to learn better feature representation, thereby ensuring better differentiation of different features to different classes. Alternatively, a SE context gating layer (SE Context Gating) may be employed.

See the schematic structure of NetVLAD layer shown in fig. 5D. Given an N-dimensional feature vector x of M-frame video, in NetVLAD aggregation of K clusters, each frame of video is first converted into an n×k-dimensional feature vector using the following formula:

v _ijk ＝α _k (x _i )(x _ij -c _kj )

i∈{1,2,…,M}，j∈{1,2,…,N}，k∈{1,2,…,K}

wherein x is _ij Is the j-th feature value of the i-th frame video; c _kj Is the j-th eigenvalue of the k-th cluster center.

α _k (x _i ) It can be understood that the ith frame video belongs to the weight of the kth cluster and is used for representing x _ij Proximity to cluster k.

Wherein alpha is _k (x _i ) Can be modeled using a single full-connection layer activated by softmax, e.g., can be employedThe method is constructed by the following formula:

wherein w is _k And b _k Is a parameter to be trained of the model.

Second, a video-level feature vector can be obtained by summing all frame-level features:

then, by intra-frame normalization (l ₂ ) Is suppressed by means of (a) a burst. Meanwhile, the video-level feature vector is reduced to an H-dimensional hidden layer vector through a full connection layer (FC).

Specifically, the total amount of parameters for the NetVLAD layer is: n x K x (h+2).

See the structure schematic of the NextVLAD layer shown in FIG. 5E. In the NextVLAD layer, the input vector is first of allExpansion to +.>The dimension is lambdan, which is a width coefficient and can be determined according to an empirical value or trial and error. Wherein the shaping operation takes the shape of a high-dimensional eigenvector +. >Conversion to a low-dimensional eigenvector +.>Wherein G is the number of low-dimensional feature vectors.

Secondly, determining the residual error of each low-dimensional feature vector from the clustering center:

wherein,,the proximity to cluster k consists of two parts:

wherein σ () is an activation function, and outputs a value between 0 and 1. Wherein,,for measuring->Soft assignment of cluster k, and +.>For use as an attention function for each set of low-dimensional feature vectors; wherein w is _g And b _g Is a parameter to be trained of the model.

Then, video-level feature vectors are obtained by aggregation in the time dimension and the intra-group dimension:

finally, by intra-frame normalization (l ₂ ) Is suppressed by means of (a) a burst. At the same time byA full connection layer (FC) similar to the NetVLAD layer reduces video-level feature vectors.

Specifically, the total amount of parameters used for the NextVLAD layer is: λN (N+G+K (G+ (H+1)/G)). Because G is much smaller than H and N, the parameters of NextVLAD are generally about G/lambda times smaller than NetVLAD.

It should be noted that, the NextVLAD layer can decompose the input feature vector into more than 8 (i.e. G > 8) low-dimensional feature vectors based on NetVLAD aggregation, so that the sequence model has reduced time complexity and space complexity.

Wherein, the model use stage includes:

s531, inputting RGB images of a plurality of video frames of the video to be classified into a trained first TSM model, and outputting first characteristic data;

s532, inputting optical flow field images of a plurality of video frames of the video to be classified into a trained second TSM model, and outputting second characteristic data;

s533, overlapping the first characteristic data and the second characteristic data to obtain target characteristic data.

S534, inputting the target characteristic data into the trained sequence model to obtain a classification result of the video to be classified.

Example six

Fig. 6 is a block diagram of a video classification apparatus according to a sixth embodiment of the present application, where the embodiment of the present application is applicable to a case of classifying video (for example, animation) in combination with a TSM model, and the apparatus is implemented by software and/or hardware, and is specifically configured in an electronic device having a certain data computing capability.

A video classification apparatus 600 as shown in fig. 6, comprising: a video extraction module 601, a feature extraction module 602, and a classification module 603. Wherein,,

the video frame extraction module 601 is configured to perform frame extraction processing on a video to be classified to obtain frame data of a plurality of video frames of the video to be classified;

the feature extraction module 602 is configured to input frame data of a plurality of video frames of a video to be classified into a feature extraction model trained in advance, and obtain feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a time sequence conversion model TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer;

and the classification module 603 is configured to input the feature data to a pre-trained sequence model, and obtain a classification result of the video to be classified, which is output by the sequence model.

According to the embodiment of the application, the video to be classified is subjected to frame extraction processing through the video frame extraction module, so that frame data of a plurality of video frames of the video to be classified are obtained; inputting frame data of a plurality of video frames of the video to be classified into a feature extraction model trained in advance through a feature extraction module to obtain feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer; and inputting the characteristic data into a pre-trained sequence model through a classification module to obtain a classification result of the video to be classified, which is output by the sequence model. According to the technical scheme, the feature extraction model comprising the TSM model is introduced, in the feature extraction process, the time sequence convolution layers of the TSM model carry out time sequence offset operation with random directions on input data of each layer, so that data augmentation is facilitated, hidden information in the input data is conveniently mined, extracted feature data is richer and more comprehensive, and the accuracy of video classification results is further improved.

Further, each time sequence convolution layer in the TSM model performs time sequence offset operation with random direction aiming at the first 1/N element of the input data of the layer, wherein the value of N is 2 or 3.

Further, the device also comprises a feature extraction model training module for:

before frame extraction processing is carried out on the video to be classified, frame extraction processing is carried out on the first sample video, and frame data of a plurality of video frames of the first sample video are obtained;

Further, the feature extraction model includes a first TSM model; training the feature extraction model includes training a first TSM model;

Further, the feature extraction model comprises a first TSM model and a second TSM model; training the feature extraction model, including training the first TSM model and the second TSM model;

Further, the device also comprises a sequence model training module for:

before the characteristic data are input into a pre-trained sequence model, performing frame extraction processing on the second sample video to obtain frame data of a plurality of video frames of the second sample video;

Further, the sequence model includes a NeXtVLAD layer that decomposes an input feature vector into a set number of low-dimensional feature vectors.

Further, the sequence model includes: the reverse whitening layer, the NeXtVLAD layer, the filtering layer, the full connection layer, the context gating layer and the classifier are connected in sequence.

Further, the set number is an integer value greater than 8.

The video classification device can execute the video classification method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of executing the video classification method.

Example seven

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in fig. 7, a block diagram of an electronic device performing a video classification method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 7, the electronic device includes: one or more processors 701, memory 702, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 701 is illustrated in fig. 7.

Memory 702 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the video classification method provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the video classification method provided by the present application.

The memory 702 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the video extraction module 601, the feature extraction module 602, and the classification module 603 shown in fig. 6) corresponding to the video classification method according to the embodiment of the present application. The processor 701 executes various functional applications of the server and data processing, i.e., implements the video classification method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 702.

Memory 702 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by use of an electronic device performing the video classification method, and the like. In addition, the memory 702 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 702 optionally includes memory remotely located relative to processor 701, which may be connected to an electronic device performing the video classification method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device performing the video classification method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or otherwise, in fig. 7 by way of example.

The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device performing the video classification method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. input devices. The output device 704 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, frame data of a plurality of video frames of the video to be classified are obtained by performing frame extraction processing on the video to be classified; inputting frame data of a plurality of video frames of the video to be classified into a feature extraction model trained in advance to obtain feature data of the video to be classified output by the feature extraction model; the feature extraction model comprises a TSM model; each time sequence convolution layer of the TSM model carries out time sequence offset operation with random direction on input data of the layer; and inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model. According to the technical scheme, the feature extraction model comprising the TSM model is introduced, in the feature extraction process, the time sequence convolution layers of the TSM model carry out time sequence offset operation with random directions on input data of each layer, so that data augmentation is facilitated, hidden information in the input data is conveniently mined, extracted feature data is richer and more comprehensive, and the accuracy of video classification results is further improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of video classification, comprising:

Inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model;

when the feature extraction model is independently subjected to model training, frame data of a plurality of video frames of a first sample video and feature data of the first sample video are used as sample data, and model parameters of the feature extraction model which is initially established are trained;

the feature extraction model comprises a first TSM model and a second TSM model; training the feature extraction model, including training the first TSM model and the second TSM model;

the feature data of the video to be classified, which is output by the feature extraction model, is feature data obtained by superposing the feature data output by the first TSM model and the feature data output by the second TSM model;

when the sequence model is independently subjected to model training, performing frame extraction processing on the second sample video to obtain frame data of a plurality of video frames of the second sample video; inputting frame data of a plurality of video frames of the second sample video into a trained feature extraction module to obtain feature data of the second sample video; and training the initially established sequence model by taking the characteristic data of the second sample video and the classification label information of the second sample video as sample data.

2. The method of claim 1, wherein each time sequence convolution layer in the TSM model performs a time sequence offset operation with random direction for the first 1/N element of the input data of the layer, where N has a value of 2 or 3.

3. The method of claim 1, wherein the feature extraction model comprises a first TSM model; training the feature extraction model includes training a first TSM model;

4. The method of claim 1, wherein the sequence model includes a nextvld layer that decomposes an input feature vector into a set number of low-dimensional feature vectors.

5. The method of claim 4, wherein the sequence model comprises: the system comprises a whitening layer, a NeXtVLAD layer, a filtering layer, a full connection layer, a context gating layer and a classifier which are connected in sequence.

6. The method of claim 4, wherein the set number is an integer value greater than 8.

7. A video classification apparatus, comprising:

the classification module is used for inputting the characteristic data into a pre-trained sequence model to obtain a classification result of the video to be classified, which is output by the sequence model;

wherein, the video classification device further includes: the feature extraction model training module and the sequence model training module;

the feature extraction model training module is used for training an initially established feature extraction model by taking frame data of a plurality of video frames of a first sample video and feature data of the first sample video as sample data;

the sequence model training module is used for performing frame extraction processing on the second sample video to obtain frame data of a plurality of video frames of the second sample video; inputting frame data of a plurality of video frames of the second sample video into a trained feature extraction module to obtain feature data of the second sample video; and training the initially established sequence model by taking the characteristic data of the second sample video and the classification label information of the second sample video as sample data.

8. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a video classification method according to any one of claims 1-6.

9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform a video classification method according to any one of claims 1-6.