CN112733595A - Video action recognition method based on time segmentation network and storage medium - Google Patents

Video action recognition method based on time segmentation network and storage medium Download PDF

Info

Publication number
CN112733595A
CN112733595A CN202011388953.1A CN202011388953A CN112733595A CN 112733595 A CN112733595 A CN 112733595A CN 202011388953 A CN202011388953 A CN 202011388953A CN 112733595 A CN112733595 A CN 112733595A
Authority
CN
China
Prior art keywords
network
time
video
segments
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011388953.1A
Other languages
Chinese (zh)
Inventor
欧阳黎
程莺
彭冰莉
符娅娅
刘扬华
杨蓓
贺浩
周小艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Hunan Electric Power Co Ltd
Metering Center of State Grid Hunan Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Hunan Electric Power Co Ltd
Metering Center of State Grid Hunan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Hunan Electric Power Co Ltd, Metering Center of State Grid Hunan Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202011388953.1A priority Critical patent/CN112733595A/en
Publication of CN112733595A publication Critical patent/CN112733595A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video motion recognition method based on a time segmentation network and a storage medium, wherein the method comprises the following steps: firstly, segmenting input video data at equal intervals, and randomly acquiring a plurality of sub-segments from each video segment; secondly, modeling a plurality of sub-segments of each video segment by using a time segmentation network to obtain a plurality of modeled sub-segments; then, setting initial parameters of the time segmentation network; then, training a time segmentation network, and dynamically adjusting initial parameters based on a random gradient optimization method until a segment consensus loss function is minimum; and finally, inputting the plurality of modeled sub-segments into a trained time segmentation network, combining the output of the plurality of video segments through a segment consensus loss function, and obtaining an action type with the highest probability through Softmax fusion, wherein the action type is an action recognition result. The method effectively solves the problem that the conventional video motion recognition method cannot model long-time video information.

Description

Video action recognition method based on time segmentation network and storage medium
Technical Field
The invention relates to the technical field of motion recognition, in particular to a video motion recognition method based on a time segmentation network and a storage medium.
Background
The rapid rise in the field of computer vision in recent years lays a foundation for the development of human action behavior recognition. The human body action recognition method is gradually closed to the aspect of a deep neural network from the original traditional method, the traditional method needs manual extraction, the deep neural network can achieve end-to-end training and recognition, and the accuracy is kept at a high level.
In video motion recognition, there are two important and complementary aspects, image and optical flow. The performance of an identification system depends to a large extent on whether the relevant information can be extracted and utilized from the video. However, due to some complexities, such as scale changes, view angle changes, and camera motion, there is some difficulty in extracting such information. In recent years, the deep convolutional neural network has achieved great success in the recognition of targets, scenes and other complex objects in images, and has demonstrated that the deep convolutional neural network has strong modeling capability, which proves that the deep convolutional neural network can learn recognizable representations of targets from original visual data with the help of large-scale supervision data sets, however, the mainstream convolutional neural network framework usually focuses on image and optical flow motion, lacks the capability of modeling long-term time structures, and some researchers have proposed some methods for the problem, but primarily relies on dense temporal sampling with a predefined sampling interval, which, when applied to longer video sequences, the method generates excessive calculation cost, limits the application in reality, and has the risk of losing important information for the video exceeding the maximum sequence length; secondly, in practice, training the deep neural network requires a large number of training samples to achieve the optimal performance; however, the publicly available motion recognition data sets are still limited in size and diversity due to difficulties in data collection and annotation. Thus, deep neural networks have had significant success in image classification, but also face the risk of overfitting.
Disclosure of Invention
Technical problem to be solved
In view of the above drawbacks and deficiencies of the prior art, the present invention provides a video motion recognition method based on a time-slicing network and a storage medium, which solve the technical problem that the existing video motion recognition method cannot model video information for a long time.
(II) technical scheme
In order to achieve the purpose, the invention adopts the main technical scheme that:
in a first aspect, an embodiment of the present invention provides a video action identification method based on a time-segmentation network, which includes:
s1, dividing the input video data into a plurality of video segments at equal intervals, and then executing the same random sampling operation on each video segment to obtain a plurality of sub-segments of each video segment;
s2, modeling the plurality of sub-segments of each video segment by using a time segmentation network to obtain a plurality of modeled sub-segments;
s3, setting initial parameters of the time-division network based on the parameters of the BN-inclusion network;
s4, training a time segmentation network, and dynamically adjusting initial parameters based on a random gradient optimization method until a segment consensus loss function is minimum;
and S5, inputting the plurality of modeled sub-segments into the trained time segmentation network, combining the output of the plurality of video segments through a segment consensus loss function, and obtaining the action type with the highest probability in the video data through a Softmax fusion function, namely the action recognition result of the video data.
Optionally, step S1 includes:
s11, dividing the input video data into K video segments at equal intervals, and expressing:
{S1,S2,…,SK},3≤K≤10;
s12, randomly sampling K sub-segments from each segmented video by using a sparse sampling strategy, wherein the expression is as follows:
{T1,T2,…,TK},
the sub-segment comprises a frame of RGB image and optical flow sequence;
and S13, performing the same data enhancement operation on each sub-fragment to obtain an RGB image and an optical flow sequence after data enhancement.
Optionally, in step S13, the data enhancement operation is:
s131, angle clipping is carried out on each sub-segment based on the RGB image and the corner points or the centers of the optical flow images in the optical flow sequence;
s132, randomly selecting the size of the image and the optical flow sequence after fixed-angle clipping, and randomly selecting the width and the height of a clipping area from {256,224,192,168} to perform scale dithering;
s133, determining the fixed size of the cutting area subjected to the scale dithering, and obtaining the RGB image subjected to the data enhancement and the optical flow sequence.
Optionally, in step S2, the sub-segments are modeled using a time-slicing network as follows:
TSN(T1,T2,…,TK)=Softmax(g(F(T1,W),F(T2,W),…F(TK,W))),
G=g(F(T1,W),F(T2,W),…,F(TK,W)),
wherein TSN is a time-slicing network including a spatial stream network and a time stream network, F (T)KW) is a two-dimensional convolution function with parameter W, G is a segment consensus function, G is an aggregation function, and Softmax is a dual stream fusion function.
Optionally, step S3 includes:
s31, pre-training the BN-inclusion network on the ImageNet data set;
s32, taking the pre-trained BN-inclusion network parameters as the start-up parameters of the spatial stream network, and adjusting the start-up parameters of the spatial stream network by using the RGB images after data enhancement to obtain adjusted spatial stream network parameters;
s33, taking the average value of the first convolution layer weight in the adjusted spatial stream network parameters as the initial adjustment parameters of the time stream network, and adjusting the initial adjustment parameters of the time stream network by using the optical stream sequence after data enhancement to obtain the adjusted time stream network parameters;
and S34, taking the adjusted spatial stream network parameters and the adjusted time stream network parameters as initial parameters of the spatial stream network and initial parameters of the time stream network respectively.
Optionally, step S4 includes:
s41, training a time segmentation network based on the RGB image and the optical flow sequence after data enhancement, and dynamically adjusting initial parameters of the time segmentation network by adopting a random gradient optimization method, wherein the batch size is set to be 256, and the momentum is set to be 0.9;
s42, for the space flow network, the learning rate is initialized to 0.001, when the training time reaches 4500 times, the segment consensus loss function is minimum, and the iteration stops; for the time-flow network, the learning rate is initialized to 0.005, the segment consensus loss function is minimum when the training times reach 20000, and the iteration stops.
Optionally, in step S4, the formula of the segment consensus loss function is:
Figure RE-GDA0002979412040000041
where C is the total number of action classes i, yiIs the truth label of action class i, GiRepresenting the mean of the scores of the same category over K sub-segments, Gi=g(Fi(T1),Fi(T2),…,Fi(TK)),GjIs the value of the j-th dimension of G, 1 ≦ j ≦ K, expressed at TjAnd judging the probability score of the ith class under the segment.
Optionally, step S5 includes:
s51, inputting the plurality of modeled sub-segments into a trained time segmentation network for calculating action classification scores, wherein RGB images in the sub-segments are sent into a spatial stream network for calculating the action classification scores, and an optical stream sequence is sent into the time stream network for calculating the action classification scores;
s52, combining the spatial stream network outputs of the K video segments with the time stream network outputs through a segment consensus function G to obtain consensus of action types;
and S53, combining the consensus of the action categories by a double-current fusion function Softmax in a weighted average mode, wherein the action category with the highest probability is the action recognition result of the video segment.
Optionally, the weight ratio of the weighted average is a ratio h of the score output by the spatial stream network to the score output by the temporal stream network, where h is greater than or equal to 0.5 and less than or equal to 1.
In a second aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform a video action recognition method based on a time-slicing network as described above.
(III) advantageous effects
The invention has the beneficial effects that: the video action identification method based on the time segmentation network provided by the invention solves the problem that the traditional double-flow network is difficult to learn the long-time information of the video. The method adopts a sparse sampling strategy and a video level supervision method, improves the accuracy rate, does not increase the calculated amount, performs initial prediction on the video segment result through a time segmentation network, and obtains the final result by using a fusion function. The method is simple in calculation process, accurate in long-time video motion prediction result, and provides a new method for video motion recognition.
Drawings
Fig. 1 is a schematic flow chart of a video motion recognition method based on a time-slicing network according to the present invention;
fig. 2 is a detailed flowchart of step S1 of a video motion recognition method based on a time-slice network according to the present invention;
fig. 3 is a specific flowchart of the data enhancement operation of step S13 of the video motion recognition method based on the time-slice network according to the present invention;
fig. 4 is a detailed flowchart of step S3 of a video motion recognition method based on a time-slice network according to the present invention;
fig. 5 is a flowchart illustrating a step S4 of a video motion recognition method based on a time-slice network according to the present invention;
fig. 6 is a detailed flowchart of step S5 of a video motion recognition method based on a time-slice network according to the present invention;
fig. 7 is a network diagram of a video motion recognition method based on a time-segment network according to the present invention.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
Fig. 1 is a schematic flow diagram of a video motion recognition method based on a time-segment network according to an embodiment of the present invention, and as shown in fig. 1, the method includes: firstly, segmenting input video data at equal intervals, and randomly acquiring a plurality of sub-segments from each video segment; secondly, modeling a plurality of sub-segments by using a time segmentation network; then, setting initial parameters of the time-segmentation network based on the parameters of the BN-inclusion network; then, training a time segmentation network, and dynamically adjusting initial parameters based on a random gradient optimization method until a segment consensus loss function is minimum; and finally, fusing the time segmentation network to obtain the action recognition result of the video segmentation.
The video action identification method based on the time segmentation network provided by the invention solves the problem that the traditional double-flow network is difficult to learn the long-time information of the video. The time segmentation network is an end-to-end deep learning network and also an unsupervised deep learning network, and a specific supervision method is embodied in the training stage of the network. The invention adopts a sparse sampling strategy and a video level supervision method, improves the accuracy rate, does not increase the calculated amount, preliminarily predicts the video segment result through a time segmentation network, and obtains the final result by utilizing a fusion function. The method has simple calculation process and accurate long-time video motion prediction result, and provides a new method for video motion recognition.
For a better understanding of the above-described technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Specifically, the invention discloses a video motion identification method based on a time segmentation network, which comprises the following steps:
and S1, dividing the input video data into a plurality of video segments at equal intervals, and executing the same random sampling operation on each video segment to obtain a plurality of sub-segments of each video segment.
Fig. 2 is a detailed flowchart of step S1 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 2, step S1 includes:
s11, dividing the input video data into K video segments at equal intervals, and expressing:
{S1,S2,…,SK},3≤K≤10;
s12, randomly acquiring K sub-segments from each segmented video by using a sparse sampling strategy, wherein the K sub-segments are represented as follows:
{T1,T2,…,TK},
in an embodiment of the invention, a slave video segment S1Middle random sampling sub-segment T1From video segment S2Middle random sampling sub-segment T2By analogy, segment S from videoKMiddle random sampling sub-segment TKThat is to say TKIs from { S1,S2,…,SKThe corresponding video clip S inKThe result of random sampling.
The sub-segment comprises an RGB image and an optical flow sequence, wherein the RGB image is a frame image, and the optical flow sequence comprises stacked optical flow and distorted optical flow. In general, optical flow is due to movement of the foreground objects themselves in the scene, motion of the camera, or both. The distorted optical flow is used for estimating the displacement of the camera and compensating the optical flow generated by the motion of the camera, so that the motion information represented by the optical flow is more concentrated on the foreground object.
And S13, performing the same data enhancement operation on each sub-fragment to obtain the RGB image and the optical flow sequence after data enhancement, providing more extension data for subsequent network training through the data enhancement operation, and preventing the overfitting phenomenon caused by insufficient sample quantity.
Fig. 3 is a specific flowchart of the data enhancement operation of step S13 of the video motion recognition method based on the time-slicing network according to the present invention, as shown in fig. 3, in step S13, the data enhancement operation is:
s131, angle clipping is carried out on each sub-segment based on the RGB image and the corner points or the centers of the optical flow images in the optical flow sequence.
S132, the size of the image and the optical flow sequence after fixed-angle clipping, the width and the height of the clipping area are randomly selected from {256,224,192,168}, and the scaling is carried out.
S133, the cutting area after the scale dithering is zoomed to a fixed size, and an RGB image and an optical flow sequence after data enhancement are obtained.
In the embodiment of the invention, a 5s high-jump video is selected from a UCF101 data set high-jump category as input, the video size is 320 × 240, the frame rate is 25fps, the input video is averagely divided into 3 sections, one frame of RGB image is randomly sampled from each section by using a sparse sampling strategy, 5 sheets are sequentially taken from the sampled RGB images to the back, stacked optical flows and distorted optical flows among the RGB images are extracted, angular clipping is firstly carried out on the sampled RGB images and optical flows, then scale dithering is carried out, and finally the RGB images and optical flows are scaled to 224 × 224, so that a foundation is laid for subsequent network training.
S2, modeling the plurality of sub-segments of each video segment by using a time segmentation network to obtain a plurality of modeled sub-segments. Specifically, the modeling approach is as follows:
TSN(T1,T2,…,TK)=Softmax(g(F(T1,W),F(T2,W),…F(TK,W))),
G=g(F(T1,W),F(T2,W),…,F(TK,W)),
where TSN is a time-segmented network, F (T)KW) is a two-dimensional convolution function with parameter W, G is a segment consensus function, G is an aggregation function, and Softmax is a dual stream fusion function.
In step S2, feature extraction and motion recognition are performed using a pre-trained BN-inclusion network on the large image dataset, the spatial stream network outputs scores for each motion class from the RGB images of the K sub-segments, and the temporal stream network outputs scores for each motion class in the K sub-segments from the optical stream sequence. In the embodiment of the invention, T1And T2The run score in the sub-segment output result is higher than other action categories,T3the jumps in the sub-segment output result score higher than other action categories. In particular, in this implementation the jump is made up of two parts, one running and one jumping, defined here: jumping is the movement of both feet together and the body is mainly upward. The running is that one foot lifts off the ground and strides forwards, and then drives the body to move forwards.
And S3, setting initial parameters of the time segmentation network based on the parameters of the BN-inclusion network. The initial parameters of the time segmentation network are set by utilizing a space flow network to initialize the time flow network, firstly, an optical flow sequence is discretized into an interval from 0 to 255 through linear transformation, so that the range of the optical flow sequence is the same as that of an RGB image; the weights of the first convolutional layer of the RGB model are then averaged and the average is copied as the initial parameters for each channel input to the time-flow network.
Fig. 4 is a detailed flowchart of step S3 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 4, step S3 includes:
s31: the BN-inclusion network was pre-trained on the ImageNet dataset. The use of the ImageNet dataset in the embodiments of the present invention is based on the following considerations: if viewed from the bottom logic, the deep learning network has a great difficulty in convergence, and the pre-training can be regarded as a relatively complete parameter initialization process, so as to try to prevent the network from failing to train convergence due to poor initialization.
S32: and taking the pre-trained BN-inclusion network parameters as the start-up parameters of the spatial stream network, and adjusting the start-up parameters of the spatial stream network by using the RGB images after data enhancement to obtain the adjusted spatial stream network parameters.
S33: and taking the average value of the first convolution layer weight in the adjusted spatial stream network parameters as the initial adjustment parameter of the time stream network, and adjusting the initial adjustment parameter of the time stream network by using the optical stream sequence after data enhancement to obtain the adjusted time stream network parameters.
S34: and taking the adjusted spatial stream network parameters and the adjusted time stream network parameters as initial parameters of the spatial stream network and initial parameters of the time stream network respectively.
S4, training the time segmentation network, and dynamically adjusting the initial parameters based on a random gradient optimization method until the segment consensus loss function is minimum.
Fig. 5 is a detailed flowchart of step S4 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 5, step S4 includes:
s41, training a time segmentation network based on the RGB image and the optical flow sequence after data enhancement, and dynamically adjusting initial parameters of the time segmentation network by adopting a small batch random gradient descent method, wherein the batch size is set to be 256, and the momentum is set to be 0.9;
s42, for the spatial flow network, the learning rate is initialized to 0.001, the period consensus loss function is minimum when the training is carried out for 4500 times, and the iteration is stopped; for the time flow network, the learning rate is initialized to 0.005, the consensus loss function is minimized after the training for 20000 times, and the iteration is stopped.
And S5, inputting the plurality of modeled sub-segments into the trained time segmentation network, combining the output of the plurality of video segments through a segment consensus loss function, and obtaining the action type with the highest probability in the video data through a Softmax fusion function, namely the action recognition result of the video data. In an embodiment of the present invention, the piecewise consensus function G combines the outputs of K sub-segments. Segment consensus function G from T1And T2In which the score, T, of the running category is deduced by means of averaging3As a second category of actions identified, the score of the jump is lower than the score of the run. The dual-stream fusion function Softmax combines the results of the segmentation consensus function G by means of weighted averaging, and preferably the weight ratio is set as { space: time is 1: 1.5, when the stacked optical flow and the distorted optical flow are used simultaneously, the weight 1.5 is given 1 to the stacked optical flow and 0.5 to the distorted optical flow; the Softmax fusion process is a normalization process, and the probability of the high jump action category in the input video is obtained through Softmax fusion, so that the high jump is the category to which the action of the video segment belongs.
Fig. 6 is a detailed flowchart of step S5 of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 6, step S5 includes:
and S51, inputting the plurality of modeled sub-segments into the trained time segmentation network for calculating the action classification score, wherein the RGB images in the sub-segments are sent into a space flow network for calculating the action classification score, and the optical flow sequence is sent into the time flow network for calculating the action classification score.
And S52, combining the spatial stream network output of the K video segments with the time stream network output through a segment consensus function G to obtain the consensus of the action types.
And S53, combining the consensus of the action types by a double-current fusion function Softmax in a weighted average mode, wherein the action type with the highest probability is the action recognition result of the video segment.
Further, the weight proportion of the weighted average is the proportion h of the score output by the spatial stream network to the score output by the temporal stream network, wherein h is more than or equal to 0.5 and less than or equal to 1.
Further, the present invention also provides a non-transitory computer-readable storage medium comprising: at least one processor; and at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform a video motion recognition method based on a time-slicing network as described above.
To sum up, the present invention discloses a video motion recognition method based on a time-slice network and a storage medium, fig. 7 is a network diagram of the video motion recognition method based on the time-slice network according to the present invention, as shown in fig. 7, the method includes: firstly, segmenting input video data, randomly acquiring sub-segments from each segmented video by using a sparse sampling strategy, and modeling the sub-segments by using a time segmentation network; then, feature extraction and action recognition are carried out by utilizing a space flow network and a time flow network; and combining the outputs of all the spatial flow networks and the outputs of the time flow networks into a consensus of action types, and finally performing the consensus fusion of the action types through a double-flow fusion function Soft max fusion function to obtain a result of video action recognition.
The method solves the problem that the long-time information of the video is difficult to learn in the traditional double-current network. The time segmentation network is based on long-time video structure modeling, and combines a sparse sampling strategy and video-level supervision to ensure that the action identification of the whole video is efficient. The method is end-to-end training, is highly intelligent, has a wide application range and high identification accuracy.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words may be understood as part of the name of the component.
Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims (10)

1. A video motion recognition method based on a time segmentation network is characterized by comprising the following steps:
s1, dividing the input video data into a plurality of video segments at equal intervals, and then executing the same random sampling operation on each video segment to obtain a plurality of sub-segments of each video segment;
s2, modeling the plurality of sub-segments of each video segment by using a time segmentation network to obtain a plurality of modeled sub-segments;
s3, setting initial parameters of the time-division network based on the parameters of the BN-inclusion network;
s4, training a time segmentation network, and dynamically adjusting initial parameters based on a random gradient optimization method until a segment consensus loss function is minimum;
and S5, inputting the plurality of modeled sub-segments into the trained time segmentation network, combining the output of the plurality of video segments through a segment consensus loss function, and obtaining the action type with the highest probability in the video data through a Softmax fusion function, namely the action recognition result of the video data.
2. The video motion recognition method based on the time-slicing network as claimed in claim 1, wherein the step S1 comprises:
s11, dividing the input video data into K video segments at equal intervals, and expressing:
{S1,S2,…,SK},3≤K≤10;
s12, randomly sampling K sub-segments from each segmented video by using a sparse sampling strategy, wherein the expression is as follows:
{T1,T2,…,TK},
the sub-segment comprises a frame of RGB image and optical flow sequence;
and S13, performing the same data enhancement operation on each sub-fragment to obtain an RGB image and an optical flow sequence after data enhancement.
3. The video motion recognition method based on time-slicing network as claimed in claim 2, wherein in step S13, the data enhancement operation is:
s131, angle clipping is carried out on each sub-segment based on the RGB image and the corner points or the centers of the optical flow images in the optical flow sequence;
s132, selecting the size of the image and optical flow sequence after fixed-angle clipping, randomly selecting the width and height of a clipping area from {256,224,192,168}, and carrying out scale dithering;
s133, determining the fixed size of the cutting area subjected to the scale dithering, and obtaining the RGB image subjected to the data enhancement and the optical flow sequence.
4. The video motion recognition method based on time-slicing network as claimed in claim 3, wherein in step S2, the sub-segments are modeled by using the time-slicing network as follows:
TSN(T1,T2,…,TK)=Softmax(g(F(T1,W),F(T2,W),…F(TK,W))),
G=g(F(T1,W),F(T2,W),…,F(TK,W)),
wherein TSN is a time-slicing network including a spatial stream network and a time stream network, F (T)KW) is a two-dimensional convolution function with parameter W, G is a segment consensus function, G is an aggregation function, and Softmax is a dual stream fusion function.
5. The video motion recognition method based on the time-slicing network as claimed in claim 4, wherein the step S3 comprises:
s31, pre-training the BN-inclusion network on the ImageNet data set;
s32, taking the pre-trained BN-inclusion network parameters as the start-up parameters of the spatial stream network, and adjusting the start-up parameters of the spatial stream network by using the RGB images after data enhancement to obtain adjusted spatial stream network parameters;
s33, taking the average value of the first convolution layer weight in the adjusted spatial stream network parameters as the initial tuning parameters of the time stream network, and adjusting the initial tuning parameters of the time stream network by using the optical stream sequence after data enhancement to obtain the adjusted time stream network parameters;
and S34, taking the adjusted spatial stream network parameters and the adjusted time stream network parameters as initial parameters of the spatial stream network and initial parameters of the time stream network respectively.
6. The video motion recognition method based on the time-slicing network as claimed in claim 5, wherein the step S4 comprises:
s41, training a time segmentation network based on the RGB image and the optical flow sequence after data enhancement, and dynamically adjusting initial parameters of the time segmentation network by adopting a random gradient optimization method, wherein the batch size is set to be 256, and the momentum is set to be 0.9;
s42, for the space flow network, the learning rate is initialized to 0.001, when the training time reaches 4500 times, the segment consensus loss function is minimum, and the iteration stops; for the time-flow network, the learning rate is initialized to 0.005, the segment consensus loss function is minimum when the training times reach 20000, and the iteration stops.
7. The method according to claim 6, wherein in step S4, the formula of the segment consensus loss function is:
Figure RE-FDA0002979412030000031
where C is the total number of action classes i, yiIs the truth label of action class i, GiRepresenting the mean of the scores of the same category over K sub-segments, Gi=g(Fi(T1),Fi(T2),…,Fi(TK)),GjIs the value of the j-th dimension of G, 1 ≦ j ≦ K, expressed at TjAnd judging the probability score of the ith class under the segment.
8. The video motion recognition method based on the time-slicing network as claimed in claim 7, wherein the step S5 comprises:
s51, inputting the plurality of modeled sub-segments into a trained time segmentation network for calculating action category scores, wherein RGB images in the sub-segments are sent into a space flow network for calculating the action category scores, and an optical flow sequence is sent into the time flow network for calculating the action category scores;
s52, combining the spatial stream network output and the time stream network output of the K video segments through a segment consensus function G to obtain consensus of action types;
and S53, combining the consensus of the action categories by a double-current fusion function Softmax in a weighted average mode, wherein the action category with the highest probability is the action recognition result of the video segment.
9. The method as claimed in claim 8, wherein the weighted average has a weight ratio h of the score of the spatial stream network output to the score of the temporal stream network output, wherein h is greater than or equal to 0.5 and less than or equal to 1.
10. A non-transitory computer-readable storage medium, comprising:
at least one processor;
and at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform a method for video motion recognition based on a time-slicing network according to any one of claims 1 to 9.
CN202011388953.1A 2020-12-02 2020-12-02 Video action recognition method based on time segmentation network and storage medium Pending CN112733595A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011388953.1A CN112733595A (en) 2020-12-02 2020-12-02 Video action recognition method based on time segmentation network and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011388953.1A CN112733595A (en) 2020-12-02 2020-12-02 Video action recognition method based on time segmentation network and storage medium

Publications (1)

Publication Number Publication Date
CN112733595A true CN112733595A (en) 2021-04-30

Family

ID=75598109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011388953.1A Pending CN112733595A (en) 2020-12-02 2020-12-02 Video action recognition method based on time segmentation network and storage medium

Country Status (1)

Country Link
CN (1) CN112733595A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170255832A1 (en) * 2016-03-02 2017-09-07 Mitsubishi Electric Research Laboratories, Inc. Method and System for Detecting Actions in Videos
CN107480642A (en) * 2017-08-18 2017-12-15 深圳市唯特视科技有限公司 A kind of video actions recognition methods based on Time Domain Piecewise network
CN108764128A (en) * 2018-05-25 2018-11-06 华中科技大学 A kind of video actions recognition methods based on sparse time slice network
US20190147235A1 (en) * 2016-06-02 2019-05-16 Intel Corporation Recognition of activity in a video image sequence using depth information
CN109993077A (en) * 2019-03-18 2019-07-09 南京信息工程大学 A kind of Activity recognition method based on binary-flow network
CN110032942A (en) * 2019-03-15 2019-07-19 中山大学 Action identification method based on Time Domain Piecewise and signature differential
CN110188654A (en) * 2019-05-27 2019-08-30 东南大学 A kind of video behavior recognition methods not cutting network based on movement
CN110647903A (en) * 2019-06-20 2020-01-03 杭州趣维科技有限公司 Short video frequency classification method
CN111462183A (en) * 2020-03-31 2020-07-28 山东大学 Behavior identification method and system based on attention mechanism double-current network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170255832A1 (en) * 2016-03-02 2017-09-07 Mitsubishi Electric Research Laboratories, Inc. Method and System for Detecting Actions in Videos
US20190147235A1 (en) * 2016-06-02 2019-05-16 Intel Corporation Recognition of activity in a video image sequence using depth information
CN107480642A (en) * 2017-08-18 2017-12-15 深圳市唯特视科技有限公司 A kind of video actions recognition methods based on Time Domain Piecewise network
CN108764128A (en) * 2018-05-25 2018-11-06 华中科技大学 A kind of video actions recognition methods based on sparse time slice network
CN110032942A (en) * 2019-03-15 2019-07-19 中山大学 Action identification method based on Time Domain Piecewise and signature differential
CN109993077A (en) * 2019-03-18 2019-07-09 南京信息工程大学 A kind of Activity recognition method based on binary-flow network
CN110188654A (en) * 2019-05-27 2019-08-30 东南大学 A kind of video behavior recognition methods not cutting network based on movement
CN110647903A (en) * 2019-06-20 2020-01-03 杭州趣维科技有限公司 Short video frequency classification method
CN111462183A (en) * 2020-03-31 2020-07-28 山东大学 Behavior identification method and system based on attention mechanism double-current network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIMIN WANG 等: "Temporal Segment Networks for Action Recognition in Videos", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 *
焦红虹 等: "基于光流场的时间分段网络行为识别", 《云南大学学报 (自然科学版 )》 *

Similar Documents

Publication Publication Date Title
CN110321813B (en) Cross-domain pedestrian re-identification method based on pedestrian segmentation
CN110032926B (en) Video classification method and device based on deep learning
CN108681752B (en) Image scene labeling method based on deep learning
CN106875406B (en) Image-guided video semantic object segmentation method and device
CN110555434B (en) Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN109815826B (en) Method and device for generating face attribute model
CN112784698B (en) No-reference video quality evaluation method based on deep space-time information
CN108830252A (en) A kind of convolutional neural networks human motion recognition method of amalgamation of global space-time characteristic
CN113688723A (en) Infrared image pedestrian target detection method based on improved YOLOv5
CN108776796B (en) Action identification method based on global space-time attention model
CN103942751B (en) A kind of video key frame extracting method
CN105825502B (en) A kind of Weakly supervised method for analyzing image of the dictionary study based on conspicuousness guidance
CN113240691A (en) Medical image segmentation method based on U-shaped network
CN106326857A (en) Gender identification method and gender identification device based on face image
CN112115967B (en) Image increment learning method based on data protection
CN111582230A (en) Video behavior classification method based on space-time characteristics
CN104616005A (en) Domain-self-adaptive facial expression analysis method
Chaabouni et al. ChaboNet: Design of a deep CNN for prediction of visual saliency in natural video
CN111967399A (en) Improved fast RCNN behavior identification method
CN111797705A (en) Action recognition method based on character relation modeling
CN114708615A (en) Human body detection method based on image enhancement in low-illumination environment, electronic equipment and storage medium
CN113039561A (en) Aligning sequences by generating encoded representations of data items
CN111489373B (en) Occlusion object segmentation method based on deep learning
CN117454971A (en) Projection type knowledge distillation method based on self-adaptive mask weighting
JP6600288B2 (en) Integrated apparatus and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination