CN115861901B

CN115861901B - Video classification method, device, equipment and storage medium

Info

Publication number: CN115861901B
Application number: CN202211721670.3A
Authority: CN
Inventors: 骆剑平; 杨玉琪
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-06-30
Anticipated expiration: 2042-12-30
Also published as: CN115861901A

Abstract

The embodiment of the disclosure provides a video classification method, a video classification device, video classification equipment and a storage medium. The method comprises the following steps: acquiring videos to be classified; the content of the video to be classified comprises at least one behavior action of a target object; inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade. According to the embodiment, through the two-way excitation channel grouping layer, huge time consumption of optical flow calculation and occupation of storage resources are avoided, difficulties caused by independent training of a multi-flow network are avoided, the calculated amount can be greatly reduced, and meanwhile, the reasoning speed and the classification accuracy are further improved.

Description

Video classification method, device, equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the field of artificial intelligence, in particular to a video classification method, a device, equipment and a storage medium.

Background

One of the goals of artificial intelligence is: a machine is built that can accurately understand human behavior and intent to better serve humans. It is the problem that behavior recognition needs to be studied and discussed to construct a model that can understand human behavior.

When the human body behaviors are identified on the video, due to the richness and complexity of the human body behaviors, the factors such as the shielding of the visual field, the clutter of the background and the like, the human body behaviors are more difficult and challenging compared with the human body behaviors in the image. One of the mainstream technologies of the human behavior recognition method is a deep learning technology. Currently, the mainstream human behavior recognition technology based on deep learning can be divided into two types: one is to independently learn the characteristics of space, continuous optical flow and the like through a double-flow network, and perform characteristic fusion in the later stage; the other is to extract context information between adjacent frames in the video frame by high-dimensional convolution modeling time dimension.

However, the multi-stream network performs feature fusion after each branch independently extracts features in the training process, so that the training difficulty is high, the process of calculating the inter-frame optical flow information is very time-consuming, the extracted optical flow features must be stored in a disk, and the requirements on the storage cost and the calculation cost are high; the high-dimensional convolution such as the 3-dimensional convolution has large parameter quantity and calculation amount, and only local information of the video can be learned. In the practical application process, the behavior characteristics are directly extracted through the 3-dimensional convolutional neural network, and the problems of gradient disappearance, gradient explosion, overfitting and the like are also easy to be caused.

Disclosure of Invention

The embodiment of the disclosure provides a video classification method, a device, equipment and a storage medium, which can improve the speed and precision of video classification.

In a first aspect, an embodiment of the present disclosure provides a video classification method, including: acquiring videos to be classified; the content of the video to be classified comprises at least one behavior action of a target object; inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade.

In a second aspect, an embodiment of the present disclosure further provides a video classification apparatus, including: the video to be classified acquisition module is used for acquiring videos to be classified; the content of the video to be classified comprises at least one behavior action of a target object; the action classification result obtaining module is used for inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the video classification method as described in embodiments of the present disclosure.

In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing the video classification method as described in the disclosed embodiments.

According to the technical scheme, videos to be classified are obtained; the content of the video to be classified comprises at least one behavior action of a target object; inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade. According to the embodiment of the disclosure, through the two-way excitation channel grouping layer, not only is the key motion information among video frames, the time dependence relationship among channels and the long-distance space-time information of the video utilized, but also the end-to-end efficient video classification is realized with fewer input frames. According to the embodiment, through the two-way excitation channel grouping layer, huge time consumption of optical flow calculation and occupation of storage resources are avoided, difficulties caused by independent training of a multi-flow network are avoided, the calculated amount can be greatly reduced, and meanwhile, the reasoning speed and the classification accuracy are further improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a video classification method according to an embodiment of the disclosure;

fig. 2 is a schematic diagram of a video classification method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a bottleneck unit network structure according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a video classification device according to an embodiment of the disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Fig. 1 is a schematic flow chart of a video classification method provided by an embodiment of the present disclosure, where the embodiment of the present disclosure is applicable to a case of video classification, for example, classification of behaviors of a target object in a video, and the method may be performed by a video classification device, where the device may be implemented in a form of software and/or hardware, and optionally, implemented by an electronic device, where the electronic device may be a mobile terminal, a PC side, a server, or the like.

As shown in fig. 1, the method includes:

s110, acquiring videos to be classified.

The content of the video to be classified comprises at least one behavior action of the target object. The target object may be a person, an animal, or the like, and the behavior of the target object may be an "open door" action, a "close door" action, or the like, as an example of a person. The present embodiment is not limited to this, nor is it limited to the type of behavior action.

S120, inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified.

The target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus module, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade.

In this embodiment, the duration of classifying the video may be arbitrary, which is not limited in this embodiment, and the number of the first video frames may be arbitrary, which is not limited in this embodiment. In this embodiment, the video to be classified may be extracted through a script to obtain a first video frame. For example, the duration of the video to be classified is 3 seconds, and the first video frame may be 80 frames. After the first video frame is obtained, the first video frame is input into a target video classification model, and an action classification result corresponding to the classified video is obtained.

The target video classification model may be a time-series segmented network-based model (Temporal Segment Networks, TSN) with inputs of video frames (images) only, and the backbone of the TSN network may be represented by a ResNet50 network. It should be noted that the two-way excitation channel packet layer may be understood as a layer that is an improvement over the ResNet50 network.

Optionally, inputting a first video frame corresponding to the video to be classified into the target video classification model to obtain an action classification result corresponding to the video to be classified, including: the sparse sampling layer performs random sampling on the first video frame to obtain a second video frame, and performs data enhancement processing on the second video frame to obtain an enhanced second video frame; the two-way excitation channel grouping layer performs deep feature extraction based on the enhanced second video frame to obtain deep features; the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified on the same category according to the deep features; converting the average score into a probability value based on a set function; based on the probability values of the videos to be classified on all the categories, taking the action category corresponding to the maximum probability value as an action classification result, and outputting the action classification result.

Specifically, the sparse sampling layer divides the first video frame into several segments by a sparse sampling strategy, and randomly samples one frame (picture) from each segment. For example, the first video frame is 80 frames, the 80 frames (pictures) are divided into 8 segments by a sparse sampling strategy, each segment comprises 10 frames (pictures), and one frame is randomly sampled from each segment, so that 8 frames (pictures) are obtained. The 8 frames (pictures) may be the second video frame. After obtaining the second video frame, the second video frame is processed through a data enhancement strategy, and the height and width of the video frame are adjusted to be uniform, for example 224×224, so as to obtain the enhanced second video frame. Wherein the data enhancement includes random flipping and/or angle cropping operations. Wherein the video frames comprise time information, for example 8 frames of the second video frame, it can be understood that 8 time information.

In this embodiment, after deep features are obtained through the two-way excitation channel grouping layer, the deep features are mapped to specific sample spaces (the number of sample spaces is the total number of categories of the data set) through the full connection layer, so as to obtain features of the full connection layer. And the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified on the same category according to the characteristics of the full connection layer through the segmentation consensus function. The segment consensus function is an Average Pooling mean function.

By way of example, if the category is 2, the second video frame is 3 frames, the prediction score of the first frame picture on the a category is 0.5, and the prediction score on the B category is 0.3; the second frame picture has a prediction score of 0.4 on the a-category, the B-category, the third frame picture has a prediction score of 0.6 on the a-category, and the B-category, the prediction score of 0.4 on the B-category, the average score of 3 video frames on the a-category is (0.5+0.4+0.6)/3=0.5. The average score of 3 video frames over category B is (0.3+0.5+0.4)/3=0.4.

In this embodiment, after calculating the average score of each video frame on the same category, the average score is converted into a probability value by a softmax function (normalization function), so as to obtain the probability value of the video to be classified on each category, and the category corresponding to the maximum probability value is used as the action classification result, and the action classification result is output.

Optionally, the two-way excitation channel grouping layer includes at least four two-way excitation channel grouping modules, and an input of a subsequent two-way excitation channel grouping module in the adjacent two-way excitation channel grouping modules is an output of a previous two-way excitation channel grouping module. Optionally, the two-way excitation channel grouping layer performs deep feature extraction based on the enhanced second video frame to obtain deep features, including: and the two-way excitation channel grouping module performs deep feature extraction based on the enhanced second video frame to obtain deep sub-features.

It should be noted that, for the number of two-way excitation channel grouping modules, the setting may be performed according to the backbone network of the TSN network, for example, if the backbone network of the TSN network may be represented by the ResNet50 network, the two-way excitation channel grouping layer may include 4 two-way excitation channel grouping modules. For the input of the two-way excitation channel grouping module, except that the first two-way excitation channel grouping module is the enhanced second video frame, the input of the next two-way excitation channel grouping module in the adjacent two-way excitation channel grouping modules is the output of the previous two-way excitation channel grouping module. Fig. 2 is a schematic diagram of a video classification method according to an embodiment of the present invention, as shown in fig. 2. The two-way excitation channel grouping layer comprises 4 two-way excitation channel grouping modules, and category distribution represents distribution of scores of videos to be classified on each category.

Optionally, the two-way excitation channel grouping module includes a plurality of bottleneck units, each bottleneck unit is connected in cascade, and an input of a subsequent bottleneck unit in the adjacent bottleneck units is an output of a previous bottleneck unit; the bottleneck unit comprises a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit and a second two-dimensional convolution subunit; the input of the motion excitation subunit and the input of the channel excitation subunit are both the output of the first two-dimensional convolution subunit, the output of the motion excitation subunit and the output of the channel excitation subunit are added, the added output is used as the input of the channel grouping subunit, and the output of the channel grouping subunit is the input of the second two-dimensional convolution subunit; the two-way excitation channel grouping module performs deep feature extraction based on the enhanced second video frame to obtain deep sub-features, including: if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the enhanced second video frame to obtain a first convolution feature; otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature; the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature; the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature; the channel grouping subunit performs feature extraction based on the motion feature and the feature after the channel feature addition to obtain long-distance space-time features; and the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time features to obtain second convolution features.

It should be noted that, for the number of bottleneck units, the setting may be performed according to the backbone network of the TSN network, for example, if the backbone network of the TSN network may be represented by the res net50 network, the number of bottleneck units of the 4 two-way excitation channel packet modules is respectively: 3. 4, 6 and 3, and all bottleneck units are connected in cascade. The input of the next bottleneck unit in the adjacent bottleneck units is the output of the previous bottleneck unit.

In this embodiment, the bottleneck unit includes a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit, and a second two-dimensional convolution subunit; the first two-dimensional convolution subunit and the second two-dimensional convolution subunit may each be a two-dimensional convolution having a convolution kernel size 1*1.

Specifically, first, quaternary operation framing preprocessing is performed on the enhanced second video frame, then the preprocessed features are input into a first two-way excitation channel grouping module for feature extraction, then the first two-way excitation channel grouping module is output as a second two-way excitation channel grouping module for input, and so on until the last two-way excitation channel grouping module performs feature extraction. Wherein, the quaternary operation group frame is Conv-BN-ReLU-MaxPool, conv can be a convolution kernel with the size of 7 multiplied by 7 and the step length of 2, BN is processed by batch normalization (Batch Normalization, BN), reLU is a modified linear unit activation function, maxPool is a pooling kernel with the size of 3 multiplied by 3 and the step length of 2. Wherein the feature sequence shape of the second video frame may be [ N, T, C, H, W ], N being a batch size, e.g., 128, T and C representing a time dimension and a channel dimension, respectively, H and W being the height and width of the spatial shape.

Specifically, for a first two-dimensional convolution subunit in bottleneck units in the two-way excitation channel grouping module, if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on features after quaternary operation framing preprocessing to obtain a first convolution feature; otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature. After the first convolution feature is obtained, the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature; the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature; the channel grouping subunit performs feature extraction based on the motion feature and the feature after the channel feature addition to obtain long-distance space-time features; and the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time features to obtain second convolution features. Fig. 3 is a schematic diagram of a bottleneck unit network according to an embodiment of the present invention, as shown in fig. 3. In the figure "+" indicates that the output of the motion-excitation subunit (motion feature) and the output of the channel-excitation subunit (channel feature) are added, and the channel-grouping subunit performs feature extraction based on the feature obtained by adding the motion feature and the channel feature.

Optionally, the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature, including: compressing the channel number of the first convolution feature through third two-dimensional convolution to obtain a channel compression feature; for the channel compression characteristics at the adjacent moment, extracting the characteristics of the channel compression characteristics at the moment t+1 through fourth convolution to obtain fourth convolution characteristics; subtracting the fourth convolution characteristic from the channel compression characteristic at the time t to obtain a plurality of motion sub-characteristics; wherein t is a positive integer, and the value range of t is between a first set value and a second set value; splicing the plurality of motion sub-features in the time dimension to obtain a first complete motion sub-feature; setting the motion characteristic at the last moment as a third set value, and obtaining the motion sub-characteristic at the last moment; the first complete motion sub-feature is connected with the motion sub-feature at the last moment in series to obtain a second complete motion sub-feature; processing the second complete sub-motion features through global average pooling to obtain first pooled features; adjusting the channel number of the first pooling feature based on the fifth convolution to obtain an adjusted first pooling feature; extracting the first pooled feature by the feature extraction of the attention mechanism to obtain an enhanced motion sub-feature; and carrying out residual connection based on the first convolution characteristic and the enhanced motion sub-characteristic to obtain the motion characteristic.

Specifically, the motion excitation subunit performs feature extraction on the first convolution feature, and the process of obtaining the motion feature is as follows: and (3) compressing the channel number of the first convolution characteristic to 1/16 of the original channel number through a third two-dimensional convolution (two-dimensional convolution with the size of 1 multiplied by 1 and the step length of 1), so as to reduce the calculation cost, improve the calculation efficiency and obtain the channel compression characteristic. The formula is as follows:

X ^r ＝conv _red *X,X ^r ∈R ^{N×T×C/r×H×W} (1)

wherein X is _r Channel compression feature, X is a first convolution feature, conv _red For the third two-dimensional convolution, the convolution operation is represented. 1/r is the channel reduction ratio, which may be 16.

Specifically, a fourth two-dimensional convolution is firstly applied to the channel compression characteristic at the time t+1 to obtain a fourth convolution characteristic, and then the channel compression characteristic at the time t is subtracted from the fourth convolution characteristic to obtain a motion sub-characteristic at the time t. And executing the operation on the channel compression characteristics at each adjacent moment to obtain a plurality of motion sub-characteristics. Wherein the first setting value is 1, and the second setting value is the number of second video frames minus 1. The specific formula is as follows:

M(t)＝conv _trans *X ^r (t+1)-X ^r (t),1≤t≤T-1 (2)

wherein M (t) ∈R ^N×C/r×H×W Representing the motion sub-feature at time T, T representing the number of second video frames minus 1, i.e. the second set point, conv _trans Is a two-dimensional convolution of size 3 x 3 with a step size of 1. Compression features for every two adjacent channels And (3) executing the formula (2) to obtain t-1 motion characteristic representations, and splicing a plurality of motion sub-characteristics in the time dimension to obtain a first complete motion sub-characteristic. In order to make the time dimension of the first complete motion sub-feature be the same as that of the first convolution feature, the motion feature at the last moment is set as a third set value, and the motion sub-feature at the last moment is obtained. Wherein the third set value is 0, that is, M (T) =0, and the first complete motion sub-feature is connected in series with the motion sub-feature at the last moment, obtaining a second complete motor sub-feature, constructing a final second complete motor sub-feature M, i.e. M (T) = [ M (1), M (2), & M (T)]。

Because the goal of the motion-activated subunit is to activate the motion-sensitive channels, the network is more aware of the motion information, without regard to detailed spatial layout. Thus, the second complete sub-motion feature may be processed by global averaging pooling as follows:

M ^s ＝Pool(M),M ^s ∈R ^{N×T×C/r×1×1} (3)

wherein M is the second complete motion sub-feature, M ^s Representing a first pooling feature.

Then, the channel number of the first pooling feature is adjusted based on the fifth convolution, and the adjusted first pooling feature is obtained, so that the channel number of the first pooling feature is restored to the original size. And obtains the motion attention weight by activating the function Sigmoid (). The fifth convolution is a two-dimensional convolution with a size of 1×1 and a step size of 1. The formula is as follows:

A＝2δ(conv _exp *M ^s )-1,A∈R ^{N×T×C×1×1} (4)

Wherein conv _exp For the fifth convolution, δ is the activation function, and a is the exercise attention weight a, i.e., the exercise attention mechanism weight.

Then, multiplying the first convolution feature with the motion attention mechanism weight may result in an enhanced motion sub-feature. And finally, carrying out residual connection based on the first convolution characteristic and the enhanced motion sub-characteristic to obtain the motion characteristic, and retaining original information and enhancing motion information through residual connection. The formula is as follows:

wherein,,

is the output of the motion-excitation subunit, +..

Optionally, the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature, including: processing the first convolution feature through global average pooling to obtain a second pooled feature; adjusting the channel number of the second pooling feature through sixth convolution to obtain an adjusted second pooling feature; exchanging the positions of the channel dimension and the time dimension of the adjusted second pooling feature to obtain a third pooling feature; wherein the second pooling feature includes a batch size dimension, a time dimension, a channel dimension, a height dimension, and a width dimension; processing the third pooling feature through seventh convolution to obtain a first channel sub-feature; exchanging the positions of the channel dimension and the time dimension of the first channel sub-feature to obtain a second channel sub-feature; extracting the characteristics of the attention mechanism from the second channel sub-characteristics to obtain enhanced channel sub-characteristics; and carrying out residual connection based on the first convolution characteristic and the enhanced channel sub-characteristic to obtain the channel characteristic.

Specifically, the channel excitation subunit performs feature extraction on the first convolution feature, and the process of obtaining the channel feature is as follows: first, the first convolution feature is processed through global average pooling to obtain a second pooled feature, and the formula is as follows:

where ":" represents the full value of the feature, the first ":" represents the full value of the batch size, the second ":" represents the full feature of the time dimension, and the third ":" represents the full feature of the channel dimension. F is a second pooling feature.

Next, the number of channels of the second pooled feature is compressed by a sixth convolution, which is a two-dimensional convolution of size 1×1, to obtain an adjusted second pooled feature. The formula is as follows:

F _r ＝K ₁ *F,F _r ∈R ^N,T,C/r,1,1 (7)

wherein K is ₁ Is a sixth convolution, 1/r is the compression ratio, and r is 16.

And then, exchanging the positions of the channel dimension and the time dimension of the adjusted second pooling feature to obtain a third pooling feature so as to support time reasoning. The third pooling feature has a shape of [ N, C/r, T,1]And is recorded as

The third pooled feature is then processed by a seventh convolution to obtain the first channel sub-feature. Wherein the seventh convolution is a one-dimensional convolution with a kernel size of 3. The formula is as follows:

Wherein K is ₂ For the seventh convolution of the data set,

is a first channel sub-feature.

Next, the process will

Dimension is adjusted to [ N, T, C/r,1]Obtain a second channel sub-feature and record as F _temp 。

And then, channel activation is carried out on the second channel sub-feature through an eighth convolution and activation function Sigmoid, so that the channel attention mechanism weight is obtained. Wherein the eighth convolution is a two-dimensional convolution of size 1 x 1. The formula is as follows:

wherein K is ₃ For the eighth convolution, F _o An eighth convolved feature is applied to the second channel sub-feature, M being the channel attention mechanism weight.

Finally, extracting the characteristics of the attention mechanism from the second channel sub-characteristics to obtain enhanced channel sub-characteristics; and carrying out residual connection based on the first convolution characteristic and the enhanced channel sub-characteristic to obtain the channel characteristic. The formula is as follows:

wherein,,

the output of the subunit is stimulated for the channel.

Optionally, the channel grouping subunit performs feature extraction based on the motion feature and the feature after the channel feature is added to obtain a long-distance space-time feature, including: dividing the feature obtained by adding the motion feature and the channel feature in the channel dimension to obtain a set number of long-distance space-time sub-features; for the second long-distance space-time sub-feature, sequentially processing through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature; for the N long-distance space-time sub-feature, adding the new previous-distance space-time sub-feature and the N long-distance space-time sub-feature to obtain a residual feature; processing the residual features sequentially through channel-level sequential sub-convolution and space sub-convolution to obtain new N long-distance space-time sub-features; wherein N is a positive integer greater than 2; and splicing the first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new N long-distance space-time sub-feature in the channel dimension to obtain the long-distance space-time feature.

Specifically, the channel grouping subunit performs feature extraction based on the motion feature and the feature after the channel feature is added, and the process of obtaining the long-distance space-time feature is as follows: firstly, the feature obtained by adding the motion feature and the channel feature is divided in the channel dimension to obtain a set number of long-distance space-time sub-features. Wherein the set number is 4. Each long-range spatio-temporal sub-feature has a shape of [ N, T, C/4, H, W ],

for the first long-range spatio-temporal sub-feature, the formula is as follows:

wherein, when i=1, X _i Representing a first long-range spatiotemporal sub-feature,

is a new first long-range spatiotemporal sub-feature. I.e. the new first long-range spatiotemporal sub-feature is the same as the first long-range spatiotemporal sub-feature, the receptive field of the new first long-range spatiotemporal sub-feature is therefore 1 x 1.

And for the second long-distance space-time sub-feature, sequentially processing through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature. Wherein the channel level sequential sub-convolution represents a one-dimensional convolution of size 3 and the spatial sub-convolution represents a two-dimensional convolution of size 3 x 3. The formula is as follows:

wherein conv _temp For channel-level sequential sub-convolution, conv _spa Is a spatial sub-convolution. When i=2, X _i Representing a second long-range spatiotemporal sub-feature.

Is a new first long-range spatiotemporal sub-feature.

And for the third long-distance space-time sub-feature, adding the third long-distance space-time sub-feature and the new second long-distance space-time sub-feature to obtain a residual feature, and processing the residual feature sequentially through channel-level time sequence sub-convolution and space sub-convolution to obtain the new third long-distance space-time sub-feature.

And for the fourth long-distance space-time sub-feature, adding the fourth long-distance space-time sub-feature and the new third long-distance space-time sub-feature to obtain a residual feature, and processing the residual feature sequentially through channel-level time sequence sub-convolution and space sub-convolution to obtain the new fourth long-distance space-time sub-feature. The formula is as follows:

wherein,,

representing a new previous distance spatiotemporal sub-feature, when i=3,/is->

Representing a new third long-range spatio-temporal sub-feature, when i=4,/is>

Representing a new fourth long-range spatiotemporal sub-feature.

In this embodiment, for the nth long-distance spatio-temporal sub-feature, the channel grouping sub-unit is converted from the parallel structure to the stacked structure by adding the new previous-distance spatio-temporal sub-feature to the nth long-distance spatio-temporal sub-feature, i.e., adding the residual connection. The receptive field of the new fourth long-range spatiotemporal sub-feature is tripled by residual connection. I.e. different long distance time space features have different receptive fields.

And finally, splicing the first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new Nth long-distance space-time sub-feature in the channel dimension in a serial mode to obtain the long-distance space-time feature. The formula is as follows:

wherein,,

and->

Representing a new first long-distance spatiotemporal sub-feature, a new second long-distance spatiotemporal sub-feature, a new third long-distance spatiotemporal sub-feature, and a new fourth long-distance spatiotemporal sub-feature, respectively, X ^o Is a long distance space-time feature. Long-range spatiotemporal features capture spatiotemporal information at different times.

It should be noted that, the second bottleneck unit, the third bottleneck unit, and the subsequent second two-way excitation channel grouping module, the third two-way excitation channel grouping module, and the fourth two-way excitation channel grouping module in the first two-way excitation channel grouping module perform further feature extraction, where the extraction process is consistent with the process of the above formula (1) -formula (15), and the difference is that: the number of channels and the shape dimensions vary continuously.

In this embodiment, the training mode of the target video classification model is: training the target video classification model based on the training set to obtain a trained target video classification model; and testing the trained target image processing model based on the test set. The loss function in this embodiment is a cross entropy loss function.

In the embodiment, key motion information, channel information and long-distance space-time information in the classification process are enhanced through a two-way excitation channel grouping layer. When a motion excitation subunit is constructed, the motion information in the time dimension is enhanced by segmenting the features in the time dimension and carrying out adjacent frame difference, the feature activation function is used for carrying out feature activation on local motion between adjacent frames, so that the motion features of the second video inter-frame short time sequence are extracted efficiently, and meanwhile, the residual structure is adopted to store the static scene information of the original frame (the first convolution feature); when a channel excitation subunit is constructed, time information of channel characteristics is represented by one-dimensional convolution, the channel characteristics are adaptively calibrated by using a Sigmoid activation function, and time dependence among channels is represented; when the channel grouping subunit is constructed, the long-distance space-time sub-feature and the corresponding local convolution (channel level time sequence sub-convolution and space sub-convolution) are divided into a group of subsets, and a multi-level residual error structure is added, so that the original multi-level cascade structure is converted into a multi-level parallel structure, the multi-scale representation capability of a convolution kernel is improved, and the equivalent receptive field of the time dimension is correspondingly enlarged.

In this embodiment, the motion excitation subunit extracts motion information of short time sequence between adjacent video frames, the channel excitation subunit adaptively adjusts time dependency relationship between channels, and the channel grouping subunit extracts time-space information of long time sequence, integrates the three subunits in the bottleneck unit, and builds an efficient target video classification model by stacking the bottleneck units.

Fig. 4 is a schematic structural diagram of a video classification device according to an embodiment of the disclosure, as shown in fig. 4, where the device includes: a video acquisition module 410 to be classified and an action classification result acquisition module 420;

The video to be classified acquisition module 410 is configured to acquire a video to be classified; the content of the video to be classified comprises at least one behavior action of a target object;

the action classification result obtaining module 420 is configured to input a first video frame corresponding to the video to be classified into a target video classification model, and obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade.

According to the technical scheme, the video to be classified is acquired through the video to be classified acquisition module; the content of the video to be classified comprises at least one behavior action of a target object; inputting a first video frame corresponding to the video to be classified into a target video classification model through an action classification result obtaining module to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade. According to the embodiment of the disclosure, through the two-way excitation channel grouping layer, not only is the key motion information among video frames, the time dependence relationship among channels and the long-distance space-time information of the video utilized, but also the end-to-end efficient video classification is realized with fewer input frames. According to the embodiment, through the two-way excitation channel grouping layer, huge time consumption of optical flow calculation and occupation of storage resources are avoided, difficulties caused by independent training of a multi-flow network are avoided, the calculated amount can be greatly reduced, and meanwhile, the reasoning speed and the classification accuracy are further improved.

Optionally, the action classification result obtaining module is specifically configured to: the sparse sampling layer performs random sampling on the first video frame to obtain a second video frame, and performs data enhancement processing on the second video frame to obtain an enhanced video frame; the data enhancement includes random flipping and/or angle cropping operations; wherein the video frame includes time information; the two-way excitation channel grouping layer performs deep feature extraction based on the enhanced second video frame to obtain deep features; the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified on the same category according to the deep features; converting the average score into a probability value based on a set function; and taking the action category corresponding to the maximum probability value as an action classification result based on the probability values of the videos to be classified on all categories, and outputting the action classification result.

Optionally, the two-way excitation channel grouping layer includes at least four two-way excitation channel grouping modules, and an input of a subsequent two-way excitation channel grouping module in the adjacent two-way excitation channel grouping modules is an output of a previous two-way excitation channel grouping module. The action classification result obtaining module is further used for: and the two-way excitation channel grouping module performs deep feature extraction based on the enhanced second video frame to obtain deep sub-features.

Optionally, the two-way excitation channel grouping module includes a plurality of bottleneck units, each bottleneck unit is connected in cascade, and an input of a subsequent bottleneck unit in the adjacent bottleneck units is an output of a previous bottleneck unit; the bottleneck unit comprises a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit and a second two-dimensional convolution subunit; the input of the motion excitation subunit and the input of the channel excitation subunit are both the output of the first two-dimensional convolution subunit, the motion excitation subunit output and the output of the channel excitation subunit are added, the added output is used as the input of the channel grouping subunit, and the output of the channel grouping subunit is the input of the second two-dimensional convolution subunit.

Optionally, the action classification result obtaining module is further configured to: if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the enhanced second video frame to obtain a first convolution feature; otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature; the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature; the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature; the channel grouping subunit performs feature extraction based on the motion feature and the feature added by the channel feature to obtain long-distance space-time feature; and the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time feature to obtain a second convolution feature.

Optionally, the action classification result obtaining module is further configured to: carrying out channel number compression on the first convolution characteristic through third two-dimensional convolution to obtain a channel compression characteristic; for the channel compression characteristics at the adjacent moment, extracting the characteristics of the channel compression characteristics at the moment t+1 through fourth two-dimensional convolution to obtain fourth convolution characteristics; subtracting the fourth convolution characteristic from the channel compression characteristic at the time t to obtain a plurality of motion sub-characteristics; wherein t is a positive integer, and the value range of t is between a first set value and a second set value; splicing the plurality of motion sub-features in the time dimension to obtain a first complete motion sub-feature; setting the motion characteristic at the last moment as a third set value, and obtaining the motion sub-characteristic at the last moment; the first complete motion sub-feature and the motion sub-feature at the last moment are connected in series to obtain a second complete motion sub-feature; processing the second complete sub-motion features through global average pooling to obtain first pooled features; adjusting the channel number of the first pooling feature based on a fifth convolution to obtain an adjusted first pooling feature; extracting the first pooled feature by a feature extraction mechanism to obtain an enhanced motion sub-feature; and carrying out residual connection based on the first convolution characteristic and the enhanced motion sub-characteristic to obtain a motion characteristic.

Optionally, the action classification result obtaining module is further configured to: processing the first convolution feature through global average pooling to obtain a second pooled feature; adjusting the channel number of the second pooling feature through sixth convolution to obtain an adjusted second pooling feature; exchanging the positions of the channel dimension and the time dimension of the adjusted second pooling feature to obtain a third pooling feature; wherein the second pooling feature includes a batch size dimension, a time dimension, a channel dimension, a height dimension, and a width dimension; processing the third pooling feature through seventh convolution to obtain a first channel sub-feature; exchanging the positions of the channel dimension and the time dimension of the first channel sub-feature to obtain a second channel sub-feature; extracting the characteristics of the attention mechanism from the second channel sub-characteristics to obtain enhanced channel sub-characteristics; and carrying out residual connection based on the first convolution characteristic and the enhanced channel sub-characteristic to obtain a channel characteristic.

Optionally, the action classification result obtaining module is further configured to: dividing the feature obtained by adding the motion feature and the channel feature in the channel dimension to obtain a set number of long-distance space-time sub-features; for the second long-distance space-time sub-feature, sequentially processing through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature; for the N long-distance space-time sub-feature, adding the new previous-distance space-time sub-feature and the N long-distance space-time sub-feature to obtain a residual feature; processing the residual features sequentially through channel-level sequential sub-convolution and space sub-convolution to obtain new N long-distance space-time sub-features; wherein N is a positive integer greater than 2; and splicing the first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new N long-distance space-time sub-feature in the channel dimension to obtain the long-distance space-time feature.

The video classification device provided by the embodiment of the disclosure can execute the video classification method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 5, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 5) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An edit/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The electronic device provided by the embodiment of the present disclosure and the video classification method provided by the foregoing embodiment belong to the same inventive concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The embodiment of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the video classification method provided by the above embodiment.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring videos to be classified; the content of the video to be classified comprises at least one behavior action of a target object; inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of video classification, comprising:

acquiring videos to be classified; the content of the video to be classified comprises at least one behavior action of a target object;

inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade; the two-way excitation channel grouping layer comprises a two-way excitation channel grouping module;

the two-way excitation channel grouping module comprises a plurality of bottleneck units, wherein each bottleneck unit is connected in cascade, and the input of the next bottleneck unit in the adjacent bottleneck units is the output of the previous bottleneck unit; the bottleneck unit comprises a first two-dimensional convolution subunit, a motion excitation subunit, a channel grouping subunit and a second two-dimensional convolution subunit; the input of the motion excitation subunit and the input of the channel excitation subunit are both the output of the first two-dimensional convolution subunit, the motion excitation subunit output and the output of the channel excitation subunit are added, the added output is used as the input of the channel grouping subunit, and the output of the channel grouping subunit is the input of the second two-dimensional convolution subunit;

If the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the enhanced second video frame to obtain a first convolution feature; the enhanced second video frame is obtained by the following steps: processing the first video frame through a sparse sampling layer to obtain the first video frame;

otherwise, the first two-dimensional convolution subunit performs feature extraction based on the output of the previous bottleneck unit to obtain a first convolution feature;

the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature;

the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature;

the channel grouping subunit performs feature extraction based on the motion feature and the feature added by the channel feature to obtain long-distance space-time feature;

the second two-dimensional convolution subunit performs feature extraction based on the long-distance space-time feature to obtain a second convolution feature;

the channel excitation subunit performs feature extraction based on the first convolution feature to obtain a channel feature, including:

processing the first convolution feature through global average pooling to obtain a second pooled feature;

Adjusting the channel number of the second pooling feature through sixth convolution to obtain an adjusted second pooling feature;

exchanging the positions of the channel dimension and the time dimension of the adjusted second pooling feature to obtain a third pooling feature; wherein the second pooling feature includes a batch size dimension, a time dimension, a channel dimension, a height dimension, and a width dimension;

processing the third pooling feature through seventh convolution to obtain a first channel sub-feature;

exchanging the positions of the channel dimension and the time dimension of the first channel sub-feature to obtain a second channel sub-feature;

extracting the characteristics of the attention mechanism from the second channel sub-characteristics to obtain enhanced channel sub-characteristics;

residual connection is carried out based on the first convolution characteristic and the enhanced channel sub-characteristic, so that a channel characteristic is obtained;

the channel grouping subunit performs feature extraction based on the motion feature and the feature after the channel feature is added to obtain a long-distance space-time feature, and the method comprises the following steps:

dividing the feature obtained by adding the motion feature and the channel feature in the channel dimension to obtain a set number of long-distance space-time sub-features;

Regarding the first long-distance space-time sub-feature, taking the first long-distance space-time sub-feature as a new first long-distance space-time sub-feature;

for the second long-distance space-time sub-feature, sequentially processing through channel-level time sequence sub-convolution and space sub-convolution to obtain a new second long-distance space-time sub-feature;

for the N long-distance space-time sub-feature, adding the new previous-distance space-time sub-feature and the N long-distance space-time sub-feature to obtain a residual feature;

processing the residual features sequentially through channel-level sequential sub-convolution and space sub-convolution to obtain new N long-distance space-time sub-features; wherein N is a positive integer greater than 2;

and splicing the new first long-distance space-time sub-feature, the new second long-distance space-time sub-feature and the new N long-distance space-time sub-feature in the channel dimension to obtain the long-distance space-time feature.

2. The method of claim 1, wherein inputting the first video frame corresponding to the video to be classified into a target video classification model to obtain the action classification result corresponding to the video to be classified, comprises:

the sparse sampling layer performs random sampling on the first video frame to obtain a second video frame, and performs data enhancement processing on the second video frame to obtain an enhanced second video frame; the data enhancement includes random flipping and/or angle cropping operations; wherein the video frame includes time information;

The two-way excitation channel grouping layer performs deep feature extraction based on the enhanced second video frame to obtain deep features;

the segmentation consensus layer calculates the average score of each video frame corresponding to the video to be classified on the same category according to the deep features;

converting the average score into a probability value based on a set function;

and taking the action category corresponding to the maximum probability value as an action classification result based on the probability values of the videos to be classified on all categories, and outputting the action classification result.

3. The method of claim 2, wherein the two-way excitation channel group layer includes at least four two-way excitation channel group modules, an input of a subsequent two-way excitation channel group module of the adjacent two-way excitation channel group modules being an output of a previous two-way excitation channel group module.

4. The method of claim 1, wherein the motion excitation subunit performs feature extraction based on the first convolution feature to obtain a motion feature, comprising:

carrying out channel number compression on the first convolution characteristic through third two-dimensional convolution to obtain a channel compression characteristic;

for the channel compression characteristics at the adjacent moment, extracting the characteristics of the channel compression characteristics at the moment t+1 through fourth two-dimensional convolution to obtain fourth convolution characteristics;

Subtracting the fourth convolution characteristic from the channel compression characteristic at the time t to obtain a plurality of motion sub-characteristics; wherein t is a positive integer, and the value range of t is between a first set value and a second set value;

splicing the plurality of motion sub-features in the time dimension to obtain a first complete motion sub-feature;

setting the motion characteristic at the last moment as a third set value, and obtaining the motion sub-characteristic at the last moment;

the first complete motion sub-feature and the motion sub-feature at the last moment are connected in series to obtain a second complete motion sub-feature;

processing the second complete motion sub-feature through global average pooling to obtain a first pooled feature;

adjusting the channel number of the first pooling feature based on a fifth convolution to obtain an adjusted first pooling feature;

extracting the first pooled feature by a feature extraction mechanism to obtain an enhanced motion sub-feature;

and carrying out residual connection based on the first convolution characteristic and the enhanced motion sub-characteristic to obtain a motion characteristic.

5. A video classification apparatus, comprising:

the video to be classified acquisition module is used for acquiring videos to be classified; the content of the video to be classified comprises at least one behavior action of a target object;

The action classification result obtaining module is used for inputting a first video frame corresponding to the video to be classified into a target video classification model to obtain an action classification result corresponding to the video to be classified; the target video classification model sequentially comprises a sparse sampling layer, a two-way excitation channel grouping layer and a segmentation consensus layer, wherein the sparse sampling layer, the two-way excitation channel grouping layer and the segmentation consensus layer are connected in cascade; the two-way excitation channel grouping layer comprises a two-way excitation channel grouping module;

The action classification result obtaining module is further used for: if the bottleneck unit to which the first two-dimensional convolution subunit belongs is the first bottleneck unit, the first two-dimensional convolution subunit performs feature extraction based on the enhanced second video frame to obtain a first convolution feature; the enhanced second video frame is obtained by the following steps: processing the first video frame through a sparse sampling layer to obtain the first video frame;

the action classification result obtaining module is further used for: dividing the feature obtained by adding the motion feature and the channel feature in the channel dimension to obtain a set number of long-distance space-time sub-features;

6. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the video classification method of any of claims 1-4.

7. A storage medium containing computer executable instructions for performing the video classification method of any of claims 1-4 when executed by a computer processor.