CN112733595A

CN112733595A - Video action recognition method based on time segmentation network and storage medium

Info

Publication number: CN112733595A
Application number: CN202011388953.1A
Authority: CN
Inventors: 欧阳黎; 程莺; 彭冰莉; 符娅娅; 刘扬华; 杨蓓; 贺浩; 周小艳
Original assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Metering Center of State Grid Hunan Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Metering Center of State Grid Hunan Electric Power Co Ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-04-30

Abstract

The invention relates to a video motion recognition method based on a time segmentation network and a storage medium, wherein the method comprises the following steps: firstly, segmenting input video data at equal intervals, and randomly acquiring a plurality of sub-segments from each video segment; secondly, modeling a plurality of sub-segments of each video segment by using a time segmentation network to obtain a plurality of modeled sub-segments; then, setting initial parameters of the time segmentation network; then, training a time segmentation network, and dynamically adjusting initial parameters based on a random gradient optimization method until a segment consensus loss function is minimum; and finally, inputting the plurality of modeled sub-segments into a trained time segmentation network, combining the output of the plurality of video segments through a segment consensus loss function, and obtaining an action type with the highest probability through Softmax fusion, wherein the action type is an action recognition result. The method effectively solves the problem that the conventional video motion recognition method cannot model long-time video information.

Description

Video action recognition method based on time segmentation network and storage medium

Technical Field

The invention relates to the technical field of motion recognition, in particular to a video motion recognition method based on a time segmentation network and a storage medium.

Background

The rapid rise in the field of computer vision in recent years lays a foundation for the development of human action behavior recognition. The human body action recognition method is gradually closed to the aspect of a deep neural network from the original traditional method, the traditional method needs manual extraction, the deep neural network can achieve end-to-end training and recognition, and the accuracy is kept at a high level.

In video motion recognition, there are two important and complementary aspects, image and optical flow. The performance of an identification system depends to a large extent on whether the relevant information can be extracted and utilized from the video. However, due to some complexities, such as scale changes, view angle changes, and camera motion, there is some difficulty in extracting such information. In recent years, the deep convolutional neural network has achieved great success in the recognition of targets, scenes and other complex objects in images, and has demonstrated that the deep convolutional neural network has strong modeling capability, which proves that the deep convolutional neural network can learn recognizable representations of targets from original visual data with the help of large-scale supervision data sets, however, the mainstream convolutional neural network framework usually focuses on image and optical flow motion, lacks the capability of modeling long-term time structures, and some researchers have proposed some methods for the problem, but primarily relies on dense temporal sampling with a predefined sampling interval, which, when applied to longer video sequences, the method generates excessive calculation cost, limits the application in reality, and has the risk of losing important information for the video exceeding the maximum sequence length; secondly, in practice, training the deep neural network requires a large number of training samples to achieve the optimal performance; however, the publicly available motion recognition data sets are still limited in size and diversity due to difficulties in data collection and annotation. Thus, deep neural networks have had significant success in image classification, but also face the risk of overfitting.

Disclosure of Invention

Technical problem to be solved

In view of the above drawbacks and deficiencies of the prior art, the present invention provides a video motion recognition method based on a time-slicing network and a storage medium, which solve the technical problem that the existing video motion recognition method cannot model video information for a long time.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

in a first aspect, an embodiment of the present invention provides a video action identification method based on a time-segmentation network, which includes:

s1, dividing the input video data into a plurality of video segments at equal intervals, and then executing the same random sampling operation on each video segment to obtain a plurality of sub-segments of each video segment;

s2, modeling the plurality of sub-segments of each video segment by using a time segmentation network to obtain a plurality of modeled sub-segments;

s3, setting initial parameters of the time-division network based on the parameters of the BN-inclusion network;

s4, training a time segmentation network, and dynamically adjusting initial parameters based on a random gradient optimization method until a segment consensus loss function is minimum;

and S5, inputting the plurality of modeled sub-segments into the trained time segmentation network, combining the output of the plurality of video segments through a segment consensus loss function, and obtaining the action type with the highest probability in the video data through a Softmax fusion function, namely the action recognition result of the video data.

Optionally, step S1 includes:

s11, dividing the input video data into K video segments at equal intervals, and expressing:

{S₁,S₂,…,S_K}，3≤K≤10；

s12, randomly sampling K sub-segments from each segmented video by using a sparse sampling strategy, wherein the expression is as follows:

{T₁,T₂,…,T_K},

the sub-segment comprises a frame of RGB image and optical flow sequence;

and S13, performing the same data enhancement operation on each sub-fragment to obtain an RGB image and an optical flow sequence after data enhancement.

Optionally, in step S13, the data enhancement operation is:

s131, angle clipping is carried out on each sub-segment based on the RGB image and the corner points or the centers of the optical flow images in the optical flow sequence;

s132, randomly selecting the size of the image and the optical flow sequence after fixed-angle clipping, and randomly selecting the width and the height of a clipping area from {256,224,192,168} to perform scale dithering;

s133, determining the fixed size of the cutting area subjected to the scale dithering, and obtaining the RGB image subjected to the data enhancement and the optical flow sequence.

Optionally, in step S2, the sub-segments are modeled using a time-slicing network as follows:

TSN(T₁,T₂,…,T_K)＝Softmax(g(F(T₁,W),F(T₂,W),…F(T_K,W)))，

G＝g(F(T₁,W),F(T₂,W),…,F(T_K,W))，

wherein TSN is a time-slicing network including a spatial stream network and a time stream network, F (T)_KW) is a two-dimensional convolution function with parameter W, G is a segment consensus function, G is an aggregation function, and Softmax is a dual stream fusion function.

Optionally, step S3 includes:

s31, pre-training the BN-inclusion network on the ImageNet data set;

s32, taking the pre-trained BN-inclusion network parameters as the start-up parameters of the spatial stream network, and adjusting the start-up parameters of the spatial stream network by using the RGB images after data enhancement to obtain adjusted spatial stream network parameters;

s33, taking the average value of the first convolution layer weight in the adjusted spatial stream network parameters as the initial adjustment parameters of the time stream network, and adjusting the initial adjustment parameters of the time stream network by using the optical stream sequence after data enhancement to obtain the adjusted time stream network parameters;

and S34, taking the adjusted spatial stream network parameters and the adjusted time stream network parameters as initial parameters of the spatial stream network and initial parameters of the time stream network respectively.

Optionally, step S4 includes:

s41, training a time segmentation network based on the RGB image and the optical flow sequence after data enhancement, and dynamically adjusting initial parameters of the time segmentation network by adopting a random gradient optimization method, wherein the batch size is set to be 256, and the momentum is set to be 0.9;

s42, for the space flow network, the learning rate is initialized to 0.001, when the training time reaches 4500 times, the segment consensus loss function is minimum, and the iteration stops; for the time-flow network, the learning rate is initialized to 0.005, the segment consensus loss function is minimum when the training times reach 20000, and the iteration stops.

Optionally, in step S4, the formula of the segment consensus loss function is:

where C is the total number of action classes i, y_iIs the truth label of action class i, G_iRepresenting the mean of the scores of the same category over K sub-segments, G_i＝g(F_i(T₁),F_i(T₂),…,F_i(T_K))，G_jIs the value of the j-th dimension of G, 1 ≦ j ≦ K, expressed at T_jAnd judging the probability score of the ith class under the segment.

Optionally, step S5 includes:

s51, inputting the plurality of modeled sub-segments into a trained time segmentation network for calculating action classification scores, wherein RGB images in the sub-segments are sent into a spatial stream network for calculating the action classification scores, and an optical stream sequence is sent into the time stream network for calculating the action classification scores;

s52, combining the spatial stream network outputs of the K video segments with the time stream network outputs through a segment consensus function G to obtain consensus of action types;

and S53, combining the consensus of the action categories by a double-current fusion function Softmax in a weighted average mode, wherein the action category with the highest probability is the action recognition result of the video segment.

Optionally, the weight ratio of the weighted average is a ratio h of the score output by the spatial stream network to the score output by the temporal stream network, where h is greater than or equal to 0.5 and less than or equal to 1.

In a second aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform a video action recognition method based on a time-slicing network as described above.

(III) advantageous effects

The invention has the beneficial effects that: the video action identification method based on the time segmentation network provided by the invention solves the problem that the traditional double-flow network is difficult to learn the long-time information of the video. The method adopts a sparse sampling strategy and a video level supervision method, improves the accuracy rate, does not increase the calculated amount, performs initial prediction on the video segment result through a time segmentation network, and obtains the final result by using a fusion function. The method is simple in calculation process, accurate in long-time video motion prediction result, and provides a new method for video motion recognition.

Drawings

Fig. 1 is a schematic flow chart of a video motion recognition method based on a time-slicing network according to the present invention;

fig. 2 is a detailed flowchart of step S1 of a video motion recognition method based on a time-slice network according to the present invention;

fig. 3 is a specific flowchart of the data enhancement operation of step S13 of the video motion recognition method based on the time-slice network according to the present invention;

fig. 4 is a detailed flowchart of step S3 of a video motion recognition method based on a time-slice network according to the present invention;

fig. 5 is a flowchart illustrating a step S4 of a video motion recognition method based on a time-slice network according to the present invention;

fig. 6 is a detailed flowchart of step S5 of a video motion recognition method based on a time-slice network according to the present invention;

fig. 7 is a network diagram of a video motion recognition method based on a time-segment network according to the present invention.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

Fig. 1 is a schematic flow diagram of a video motion recognition method based on a time-segment network according to an embodiment of the present invention, and as shown in fig. 1, the method includes: firstly, segmenting input video data at equal intervals, and randomly acquiring a plurality of sub-segments from each video segment; secondly, modeling a plurality of sub-segments by using a time segmentation network; then, setting initial parameters of the time-segmentation network based on the parameters of the BN-inclusion network; then, training a time segmentation network, and dynamically adjusting initial parameters based on a random gradient optimization method until a segment consensus loss function is minimum; and finally, fusing the time segmentation network to obtain the action recognition result of the video segmentation.

The video action identification method based on the time segmentation network provided by the invention solves the problem that the traditional double-flow network is difficult to learn the long-time information of the video. The time segmentation network is an end-to-end deep learning network and also an unsupervised deep learning network, and a specific supervision method is embodied in the training stage of the network. The invention adopts a sparse sampling strategy and a video level supervision method, improves the accuracy rate, does not increase the calculated amount, preliminarily predicts the video segment result through a time segmentation network, and obtains the final result by utilizing a fusion function. The method has simple calculation process and accurate long-time video motion prediction result, and provides a new method for video motion recognition.

For a better understanding of the above-described technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Specifically, the invention discloses a video motion identification method based on a time segmentation network, which comprises the following steps:

and S1, dividing the input video data into a plurality of video segments at equal intervals, and executing the same random sampling operation on each video segment to obtain a plurality of sub-segments of each video segment.

Fig. 2 is a detailed flowchart of step S1 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 2, step S1 includes:

{S₁,S₂,…,S_K}，3≤K≤10；

s12, randomly acquiring K sub-segments from each segmented video by using a sparse sampling strategy, wherein the K sub-segments are represented as follows:

{T₁,T₂,…,T_K},

in an embodiment of the invention, a slave video segment S₁Middle random sampling sub-segment T₁From video segment S₂Middle random sampling sub-segment T₂By analogy, segment S from video_KMiddle random sampling sub-segment T_KThat is to say T_KIs from { S₁,S₂,…,S_KThe corresponding video clip S in_KThe result of random sampling.

The sub-segment comprises an RGB image and an optical flow sequence, wherein the RGB image is a frame image, and the optical flow sequence comprises stacked optical flow and distorted optical flow. In general, optical flow is due to movement of the foreground objects themselves in the scene, motion of the camera, or both. The distorted optical flow is used for estimating the displacement of the camera and compensating the optical flow generated by the motion of the camera, so that the motion information represented by the optical flow is more concentrated on the foreground object.

And S13, performing the same data enhancement operation on each sub-fragment to obtain the RGB image and the optical flow sequence after data enhancement, providing more extension data for subsequent network training through the data enhancement operation, and preventing the overfitting phenomenon caused by insufficient sample quantity.

Fig. 3 is a specific flowchart of the data enhancement operation of step S13 of the video motion recognition method based on the time-slicing network according to the present invention, as shown in fig. 3, in step S13, the data enhancement operation is:

s131, angle clipping is carried out on each sub-segment based on the RGB image and the corner points or the centers of the optical flow images in the optical flow sequence.

S132, the size of the image and the optical flow sequence after fixed-angle clipping, the width and the height of the clipping area are randomly selected from {256,224,192,168}, and the scaling is carried out.

S133, the cutting area after the scale dithering is zoomed to a fixed size, and an RGB image and an optical flow sequence after data enhancement are obtained.

In the embodiment of the invention, a 5s high-jump video is selected from a UCF101 data set high-jump category as input, the video size is 320 × 240, the frame rate is 25fps, the input video is averagely divided into 3 sections, one frame of RGB image is randomly sampled from each section by using a sparse sampling strategy, 5 sheets are sequentially taken from the sampled RGB images to the back, stacked optical flows and distorted optical flows among the RGB images are extracted, angular clipping is firstly carried out on the sampled RGB images and optical flows, then scale dithering is carried out, and finally the RGB images and optical flows are scaled to 224 × 224, so that a foundation is laid for subsequent network training.

S2, modeling the plurality of sub-segments of each video segment by using a time segmentation network to obtain a plurality of modeled sub-segments. Specifically, the modeling approach is as follows:

TSN(T₁,T₂,…,T_K)＝Softmax(g(F(T₁,W),F(T₂,W),…F(T_K,W)))，

G＝g(F(T₁,W),F(T₂,W),…,F(T_K,W))，

where TSN is a time-segmented network, F (T)_KW) is a two-dimensional convolution function with parameter W, G is a segment consensus function, G is an aggregation function, and Softmax is a dual stream fusion function.

In step S2, feature extraction and motion recognition are performed using a pre-trained BN-inclusion network on the large image dataset, the spatial stream network outputs scores for each motion class from the RGB images of the K sub-segments, and the temporal stream network outputs scores for each motion class in the K sub-segments from the optical stream sequence. In the embodiment of the invention, T₁And T₂The run score in the sub-segment output result is higher than other action categories,T₃the jumps in the sub-segment output result score higher than other action categories. In particular, in this implementation the jump is made up of two parts, one running and one jumping, defined here: jumping is the movement of both feet together and the body is mainly upward. The running is that one foot lifts off the ground and strides forwards, and then drives the body to move forwards.

And S3, setting initial parameters of the time segmentation network based on the parameters of the BN-inclusion network. The initial parameters of the time segmentation network are set by utilizing a space flow network to initialize the time flow network, firstly, an optical flow sequence is discretized into an interval from 0 to 255 through linear transformation, so that the range of the optical flow sequence is the same as that of an RGB image; the weights of the first convolutional layer of the RGB model are then averaged and the average is copied as the initial parameters for each channel input to the time-flow network.

Fig. 4 is a detailed flowchart of step S3 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 4, step S3 includes:

s31: the BN-inclusion network was pre-trained on the ImageNet dataset. The use of the ImageNet dataset in the embodiments of the present invention is based on the following considerations: if viewed from the bottom logic, the deep learning network has a great difficulty in convergence, and the pre-training can be regarded as a relatively complete parameter initialization process, so as to try to prevent the network from failing to train convergence due to poor initialization.

S32: and taking the pre-trained BN-inclusion network parameters as the start-up parameters of the spatial stream network, and adjusting the start-up parameters of the spatial stream network by using the RGB images after data enhancement to obtain the adjusted spatial stream network parameters.

S33: and taking the average value of the first convolution layer weight in the adjusted spatial stream network parameters as the initial adjustment parameter of the time stream network, and adjusting the initial adjustment parameter of the time stream network by using the optical stream sequence after data enhancement to obtain the adjusted time stream network parameters.

S34: and taking the adjusted spatial stream network parameters and the adjusted time stream network parameters as initial parameters of the spatial stream network and initial parameters of the time stream network respectively.

S4, training the time segmentation network, and dynamically adjusting the initial parameters based on a random gradient optimization method until the segment consensus loss function is minimum.

Fig. 5 is a detailed flowchart of step S4 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 5, step S4 includes:

s41, training a time segmentation network based on the RGB image and the optical flow sequence after data enhancement, and dynamically adjusting initial parameters of the time segmentation network by adopting a small batch random gradient descent method, wherein the batch size is set to be 256, and the momentum is set to be 0.9;

s42, for the spatial flow network, the learning rate is initialized to 0.001, the period consensus loss function is minimum when the training is carried out for 4500 times, and the iteration is stopped; for the time flow network, the learning rate is initialized to 0.005, the consensus loss function is minimized after the training for 20000 times, and the iteration is stopped.

And S5, inputting the plurality of modeled sub-segments into the trained time segmentation network, combining the output of the plurality of video segments through a segment consensus loss function, and obtaining the action type with the highest probability in the video data through a Softmax fusion function, namely the action recognition result of the video data. In an embodiment of the present invention, the piecewise consensus function G combines the outputs of K sub-segments. Segment consensus function G from T₁And T₂In which the score, T, of the running category is deduced by means of averaging₃As a second category of actions identified, the score of the jump is lower than the score of the run. The dual-stream fusion function Softmax combines the results of the segmentation consensus function G by means of weighted averaging, and preferably the weight ratio is set as { space: time is 1: 1.5, when the stacked optical flow and the distorted optical flow are used simultaneously, the weight 1.5 is given 1 to the stacked optical flow and 0.5 to the distorted optical flow; the Softmax fusion process is a normalization process, and the probability of the high jump action category in the input video is obtained through Softmax fusion, so that the high jump is the category to which the action of the video segment belongs.

Fig. 6 is a detailed flowchart of step S5 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 6, step S5 includes:

and S51, inputting the plurality of modeled sub-segments into the trained time segmentation network for calculating the action classification score, wherein the RGB images in the sub-segments are sent into a space flow network for calculating the action classification score, and the optical flow sequence is sent into the time flow network for calculating the action classification score.

And S52, combining the spatial stream network output of the K video segments with the time stream network output through a segment consensus function G to obtain the consensus of the action types.

And S53, combining the consensus of the action types by a double-current fusion function Softmax in a weighted average mode, wherein the action type with the highest probability is the action recognition result of the video segment.

Further, the weight proportion of the weighted average is the proportion h of the score output by the spatial stream network to the score output by the temporal stream network, wherein h is more than or equal to 0.5 and less than or equal to 1.

Further, the present invention also provides a non-transitory computer-readable storage medium comprising: at least one processor; and at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform a video motion recognition method based on a time-slicing network as described above.

To sum up, the present invention discloses a video motion recognition method based on a time-slice network and a storage medium, fig. 7 is a network diagram of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 7, the method includes: firstly, segmenting input video data, randomly acquiring sub-segments from each segmented video by using a sparse sampling strategy, and modeling the sub-segments by using a time segmentation network; then, feature extraction and action recognition are carried out by utilizing a space flow network and a time flow network; and combining the outputs of all the spatial flow networks and the outputs of the time flow networks into a consensus of action types, and finally performing the consensus fusion of the action types through a double-flow fusion function Soft max fusion function to obtain a result of video action recognition.

The method solves the problem that the long-time information of the video is difficult to learn in the traditional double-current network. The time segmentation network is based on long-time video structure modeling, and combines a sparse sampling strategy and video-level supervision to ensure that the action identification of the whole video is efficient. The method is end-to-end training, is highly intelligent, has a wide application range and high identification accuracy.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words may be understood as part of the name of the component.

Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims

1. A video motion recognition method based on a time segmentation network is characterized by comprising the following steps:

2. The video motion recognition method based on the time-slicing network as claimed in claim 1, wherein the step S1 comprises:

{S₁,S₂,…,S_K}，3≤K≤10；

{T₁,T₂,…,T_K},

the sub-segment comprises a frame of RGB image and optical flow sequence;

3. The video motion recognition method based on time-slicing network as claimed in claim 2, wherein in step S13, the data enhancement operation is:

s132, selecting the size of the image and optical flow sequence after fixed-angle clipping, randomly selecting the width and height of a clipping area from {256,224,192,168}, and carrying out scale dithering;

4. The video motion recognition method based on time-slicing network as claimed in claim 3, wherein in step S2, the sub-segments are modeled by using the time-slicing network as follows:

TSN(T₁,T₂,…,T_K)＝Softmax(g(F(T₁,W),F(T₂,W),…F(T_K,W)))，

G＝g(F(T₁,W),F(T₂,W),…,F(T_K,W))，

5. The video motion recognition method based on the time-slicing network as claimed in claim 4, wherein the step S3 comprises:

s31, pre-training the BN-inclusion network on the ImageNet data set;

s33, taking the average value of the first convolution layer weight in the adjusted spatial stream network parameters as the initial tuning parameters of the time stream network, and adjusting the initial tuning parameters of the time stream network by using the optical stream sequence after data enhancement to obtain the adjusted time stream network parameters;

6. The video motion recognition method based on the time-slicing network as claimed in claim 5, wherein the step S4 comprises:

7. The method according to claim 6, wherein in step S4, the formula of the segment consensus loss function is:

8. The video motion recognition method based on the time-slicing network as claimed in claim 7, wherein the step S5 comprises:

s51, inputting the plurality of modeled sub-segments into a trained time segmentation network for calculating action category scores, wherein RGB images in the sub-segments are sent into a space flow network for calculating the action category scores, and an optical flow sequence is sent into the time flow network for calculating the action category scores;

s52, combining the spatial stream network output and the time stream network output of the K video segments through a segment consensus function G to obtain consensus of action types;

9. The method as claimed in claim 8, wherein the weighted average has a weight ratio h of the score of the spatial stream network output to the score of the temporal stream network output, wherein h is greater than or equal to 0.5 and less than or equal to 1.

10. A non-transitory computer-readable storage medium, comprising:

at least one processor;

and at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform a method for video motion recognition based on a time-slicing network according to any one of claims 1 to 9.