CN113673559B

CN113673559B - Video character space-time characteristic extraction method based on residual error network

Info

Publication number: CN113673559B
Application number: CN202110793379.6A
Authority: CN
Inventors: 陈志�; 江婧; 岳文静
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2023-08-25
Anticipated expiration: 2041-07-14
Also published as: CN113673559A

Abstract

The invention discloses a method for extracting space-time characteristics of video characters based on a residual error network, which solves the problems of high calculation cost and high memory requirement in extracting space-time characteristics in video. The invention first decomposes the 3D filter into spatial and temporal forms, then designs three different forms of residual blocks based on residual networks for the decomposed (2d+1d) form convolution kernel, and then designs each residual block at a different location throughout the res net structure. And finally, combining the video character space-time characteristics with a designed hourglass structure added with depth convolution at the tail end of a residual path to form a new 3D residual network. The invention can enhance the diversity of network structure, so that the whole network can be generally used for various video analysis tasks and the performance and time efficiency are improved.

Description

Video character space-time characteristic extraction method based on residual error network

Technical Field

The invention belongs to the technical field of computer vision fields, and particularly relates to a video character space-time characteristic extraction method based on a residual error network.

Background

At present, in the field of image recognition, particularly, a feature extraction part is a big hot spot in the field of computer vision. The quality of feature extraction has a great impact on generalization ability, and works to build features that provide some image information and non-redundancy from an initial set of data, thereby facilitating subsequent detection or classification tasks.

Most of the feature extraction methods are to perform feature extraction processing on images, and the main methods are as follows: HOG (histogram of directional gradients), SIFT (scale invariant feature transform), HAAR, etc. Current methods for directly extracting features from video are TSN (time slot network), C3D.

The TSN network is composed of a time stream convolution network and a space stream convolution network, the TSN firstly samples some fragments randomly from a given video, then each selected fragment makes preliminary judgment on the category according to the information of the fragment, and finally the prediction result of the final video is obtained comprehensively according to the fragments. TSN is modeled for a long range time organization, using sparse sampling strategies and video level supervision to make a given video learning efficiency most efficient and effective.

Another type of C3D uses a 3D convolutional neural network to construct a network structure, which is more suitable for extracting space-time characteristics than a 2D convolutional neural network, the 2D convolutional neural network ignores time information after each operation, whereas the 3D convolution and pooling operations are more efficient in terms of modeling time information, C3D is the best learner with the best convolution kernel size being 3 x 3.

With the intellectualization of various devices and the rapid growth of multimedia on the internet, videos gradually become a brand new communication mode between users, which not only encourages the development of leading edge technology, but also is a great test for the development of advanced technology. The video is composed of a number of time series frames, is more complex than the picture video, and frequent in shot switching, which adds difficulty to our training of a general powerful classifier for extracting spatio-temporal features, can use a more common method for extracting spatio-temporal information from video, can train a new 3D convolutional neural network, can access the temporal information presented by each video frame and between successive frames, but trains a 3DCNN network from scratch, is computationally expensive, and the size of the model grows 2 times compared to 2 DCNN. These problems are all now in need of solution.

Disclosure of Invention

Technical problems: the invention aims to solve the technical problem of improving the robustness of the problem of multi-person gesture estimation by designing a novel loss function in a crowded scene, and provides a video character space-time characteristic extraction method based on a residual error network.

The technical scheme is as follows: in order to achieve the above purpose, the present invention adopts the following technical scheme:

a video character space-time characteristic extraction method based on a residual error network comprises the following steps:

step 1), inputting a video V, wherein the video V is a multi-person video containing two or more persons, the video size is c multiplied by n multiplied by h multiplied by w, c is the number of channels, n is the number of frames in a single video, and h and w are the height and width of each frame;

step 2) decomposing a 3D convolution filter of size 3 x 3 into spatial and temporal (2d+1d) forms, i.e. a spatial 2-dimensional convolution filter and a temporal 1-dimensional convolution filter, using a 1 x 3 convolution filter and a 3 x 1 convolution filter instead of the 3 x 3 convolution filter;

step 3) combining the decoupled spatially 2-dimensional convolution filter and the temporally 1-dimensional convolution filter with a residual network, designing 3 different 3D residual blocks: 3D serial residual block, 3D parallel residual block, 3D serial parallel residual block;

step 4) respectively combining the 3 residual blocks in the step 3) with an hourglass structure, and positioning the shortcuts to be connected with a high-dimensional representation to obtain 3 hourglass residual structures: an hourglass residual series structure HRS-I, an hourglass residual parallel structure HRS-II, and an hourglass residual series-parallel structure HRS-III;

step 5) respectively integrating the 3 hourglass residual structures in the step 4) into residual networks to form three new residual networks; combining the 3 hourglass residual structures in the step 4) and then merging the combined residual structures into a residual network to form another new residual network; comparing the four residual error networks to obtain the residual error network with the best performance;

step 6) training the residual network with the best performance obtained in the step 5) on a gpu of 1080ti by using a data set, wherein 70% of the data set is used as a training set, 10% is used as a verification set, and 20% is used as a test set;

and 7) carrying out space-time feature extraction on the video V by using the trained new residual error network.

Further, the step 3) specifically includes the following steps:

step 31) let the residual function be F (x) _l )，H(x _l )＝F(x _l )+x _l+1 Wherein H (x) _l ) For residual network learned features, x _l+1 The output of the first residual unit;

step 32) set F (x) _l ) When=0, H (x _l )＝x _l The output x of the first residual unit can be obtained _l+1 ＝x _l +F'*x _l Wherein F' ×x _l Representing the result of performing the residual function F on x;

step 33) designing a serial residual block to connect the one-dimensional convolution filter with the two-dimensional convolution filter in a serial manner; let the residual function be T (S (x) _l ) With output expressed as x _l+1 ＝x _l (1+t 'S'), where T denotes the use of a one-dimensional filter, S denotes the use of a two-dimensional filter, and T 'S' are the results of performing the residual function T, S, respectively;

step 34) designing a parallel residual block, arranging two convolution filters in parallel on different paths, so that no direct influence exists between the two convolution filters, only indirect influence exists, and accumulating the two convolution filters into a final output; let the residual function be T (x) _l )+S(x _l ) The output is denoted as x _l+1 ＝x _l (1+T'+s’)；

Step 35) designing serial-parallel residual blocks, and simultaneously constructing direct influences between a one-dimensional convolution filter and a two-dimensional convolution filter and final output, so as to realize quick connection of space dimension sum of the serial-parallel residual blocks; let the residual function be S (x _l )+T(S(x _l ) With output expressed as x _l+1 ＝x _l (1+T's'+s')。

Further, the step 4) specifically includes the following steps:

step 41) in order to ensure the shortcut connection high-dimensional representation, reversing the order of two point-by-point convolutions, wherein the point-by-point convolutions are convolutions of 1*1, and feature extraction on a single point is performed to obtain a feature map;

step 42) providingFor inputting tensors>Output tensor as residual structure, where D _f ×D _f XM is the size of the feature map obtained in input step 41), regardless of the depth convolution layer and the activation layer, the hourglass structure is expressed as:>wherein->Point convolution for channel expansion, +.>Convolving the reduced channel points;

step 43) adding a depth convolution at the tail end of the residual path, and designing a point direction convolution in the middle of the depth convolution; the hourglass structure may be expressed as:wherein->For the 1 st point direction convolution, +.>Convolving for the 1 st depth direction; wherein->For the 2 nd point direction convolution, +.>Is the 2 nd depthThe degree direction is convolved.

Further, the step 5) specifically includes the following steps:

step 51), after three residual blocks are combined with an hourglass structure, the three residual blocks are respectively called a serial hourglass residual structure HRS-I, a parallel hourglass residual structure HRS-II and a serial hourglass residual structure HRS-III, and all residual units in ResNet-50 are respectively replaced by the HRS-I, HRS-II and the HRS-III to form three new residual networks;

step 52) forming a new hourglass residual structure chain to replace all residual units in ResNet-50 by HRS-I, HRS-II and HRS-III in sequence to obtain a new residual network

Step 53) compares the three new residual networks formed in step 51) with the residual network obtained in step 52) to obtain the residual network with the best performance.

Further, in the step 6), the residual network with the best performance obtained in the step 5) is trained efficiently, and 5 short videos with 5 seconds are selected randomly from each video.

Further, the new residual network is trained in the step 6), and the drop rate is empirically set to 0.1.

Further, the new residual network is trained in the step 6), and the learning rate is initialized to 0.001 empirically.

The beneficial effects are that: compared with the prior art, the invention has the following beneficial effects:

according to the invention, the 3D filter is decomposed into a spatial form and a temporal form, then the decomposed (2D+1D) form residual blocks are designed, three forms of residual blocks are designed, and then the residual blocks are combined with a designed hourglass structure with deep convolution added at the tail end of a residual path to form a new 3D residual network for space-time characteristic extraction.

Drawings

Fig. 1 is a flow chart of a method for extracting temporal and spatial characteristics of a video character based on a residual network.

Fig. 2 is a diagram of a decoupled (2d+1d) form in combination with a residual network.

Fig. 3 is a diagram of a residual block in combination with an hourglass structure.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:

fig. 1 is a flowchart of a method for extracting spatiotemporal features of video characters based on a residual network, wherein a clipped video is input first, the video size is c×n×h×w, c is the number of channels, n is the number of frames in a single video, and h and w are the height and width of each frame. The video is obtained in a large dataset of sport-1 m, then a 3D convolution filter of size 3 x 3 is decomposed into spatial and temporal (2d+1d) forms, and a convolution filter of 1 x 3 and a convolution filter of 3 x 1 are used instead of the convolution filter of 3 x 3. The decoupled (2d+1d) form is then combined with the residual network, the combined network being as shown in fig. 2. 3 different 3D residual blocks were designed: 3D serial residual block, 3D parallel residual block, 3D serial parallel residual block;

then designing a residual network with an hourglass structure similar to a classical bottleneck structure, wherein the hourglass residual structure is different from the bottleneck structure, and adding deep convolution at the tail end of a residual path. The deconstructing shown in fig. 2 is combined with the hourglass structure and shortcuts are put into the connected high-dimensional representation. Ensuring the shortcut connection high-dimensional representation, reversing the order of two point-by-point convolutions, namely performing 1*1 convolution, performing feature extraction on a single point, adding depth convolutions at the tail end of a residual path, and designing the point direction convolution in the middle of the depth convolutions.

And then integrating the hourglass residual structure into a residual network to form a new 3D residual network, training the new network on the Sprots-1M data set, and randomly selecting 5 short videos of 5 seconds from each video. During training, mini-batch is set to 128 frames/clip and dropout rate is set to 0.1. The learning rate is also initialized to 0.001, 60K divided by 10 per iteration.

And finally, extracting the space-time characteristics of the video by the trained network.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. The method for extracting the space-time characteristics of the video characters based on the residual error network is characterized by comprising the following steps of:

step 1), inputting a video V, wherein the video V is a multi-person video containing more than two persons, the video size is c multiplied by n multiplied by h multiplied by w, c is the number of channels, n is the number of frames in a single video, and h and w are the height and width of each frame;

step 3), combining the decoupled spatial 2-dimensional convolution filter and the temporal 1-dimensional convolution filter with a residual network, and designing 3 different 3D residual blocks: 3D serial residual block, 3D parallel residual block, 3D serial parallel residual block;

step 4), the 3 residual blocks in the step 3) are respectively combined with the hourglass structure, and shortcuts are positioned to be connected with high-dimensional representation, so that 3 hourglass residual structures are obtained: an hourglass residual series structure HRS-I, an hourglass residual parallel structure HRS-II, and an hourglass residual series-parallel structure HRS-III;

step 5), respectively integrating the 3 hourglass residual structures in the step 4) into residual networks to form three new residual networks; combining the 3 hourglass residual structures in the step 4) and then merging the combined residual structures into a residual network to form another new residual network; comparing the four residual error networks to obtain the residual error network with the best performance;

step 6), training the residual network with the best performance obtained in the step 5) on a gpu of 1080ti by using a data set, wherein 70% of the data set is used as a training set, 10% is used as a verification set, and 20% is used as a test set;

step 7), extracting space-time characteristics of the video V by using a trained new residual error network;

the step 3) specifically comprises the following steps:

step 31), setting the residual function as F (x) _l )，H(x _l )＝F(x _l )+x _l+1 Wherein H (x) _l ) For residual network learned features, x _l+1 The output of the first residual unit;

step 32), set F (x) _l ) When=0, H (x _l )＝x _l The output x of the first residual unit can be obtained _l+1 ＝x _l +F′*x _l Wherein F' ×x _l Representing the result of performing the residual function F on x;

step 33), designing a serial residual block to connect the one-dimensional convolution filter with the two-dimensional convolution filter in a serial manner; let the residual function be T (S (x) _l ) With output expressed as x _l+1 ＝x _l (1+T ^′ s ^′ ) Wherein T represents the use of a one-dimensional filter, S represents the use of a two-dimensional filter, T ^′ 、s ^′ The results of performing the residual function T, S, respectively;

step 34), designing a parallel residual block, arranging two convolution filters in parallel on different paths, so that no direct influence exists between the two convolution filters, only indirect influence exists, and accumulating the two convolution filters into a final output; let the residual function be T (x) _l )+S(x _l ) The output is denoted as x _l+1 ＝x _l (1+T ^′ +s ^′ )；

Step 35), designing serial-parallel residual blocks, and simultaneously constructing direct influences between a one-dimensional convolution filter and a two-dimensional convolution filter and final output, so as to realize quick connection of space dimensions and sum of the serial-parallel residual blocks; let the residual function be S (x _l )+T(S(x _l ) With output expressed as x _l+1 ＝x _l (1+T ^′ s ^′ +s ^′ )；

The step 4) specifically comprises the following steps:

step 41), in order to ensure the shortcut connection high-dimensional representation, reversing the order of two point-by-point convolutions, wherein the point-by-point convolutions are convolutions of 1*1, and feature extraction on a single point is performed to obtain a feature map;

step 42), set upFor inputting tensors>Output tensor as residual structure, where D _f ×D _f XM is the size of the feature map obtained in input step 41), regardless of the depth convolution layer and the activation layer, the hourglass structure is expressed as:>wherein->Point convolution for channel expansion, +.>Convolving the reduced channel points;

step 43), adding a depth convolution at the tail end of the residual error path, and designing a point direction convolution in the middle of the depth convolution; the hourglass structure may be expressed as:wherein->For the 1 st point direction convolution, +.>Convolving for the 1 st depth direction; wherein->For the 2 nd point direction convolution, +.>Is the 2 nd depth direction convolution.

2. The method for extracting the spatial and temporal characteristics of the video character based on the residual network according to claim 1, wherein the step 5) specifically comprises the following steps:

step 51), three residual blocks are set to be respectively called a serial hourglass residual structure HRS-I, a parallel hourglass residual structure HRS-II and a serial hourglass residual structure HRS-III after being combined with the hourglass structure, and all residual units in ResNet-50 are respectively replaced by the HRS-I, HRS-II and the HRS-III to form three new residual networks;

step 52), forming a new hourglass residual structure chain to replace all residual units in ResNet-50 by HRS-I, HRS-II and HRS-III in sequence to obtain a new residual network

Step 53), comparing the three new residual networks formed in step 51) with the residual network obtained in step 52) to obtain the residual network with the best performance.

3. The method for extracting the spatial and temporal characteristics of the video character based on the residual network according to claim 1, wherein in the step 6), the residual network with the best performance obtained in the step 5) is trained efficiently, and 5 short videos of 5 seconds are selected randomly from each video.

4. The method for extracting spatiotemporal features of video characters based on residual network according to claim 1, wherein said step 6) trains a new residual network, and empirically sets a dropout rate to 0.1.

5. The method for extracting spatiotemporal features of video characters based on residual network of claim 1, wherein the new residual network is trained in the step 6), and the learning rate is initialized to 0.001 empirically.