CN113298017A

CN113298017A - Behavior proposal generation method for video behavior detection

Info

Publication number: CN113298017A
Application number: CN202110647905.8A
Authority: CN
Inventors: 姚莉; 范文鸿; 杨俊宴; 吴含前
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-08-24
Anticipated expiration: 2041-06-10
Also published as: CN113298017B

Abstract

The invention discloses a behavior proposal generation method aiming at video behavior detection. In the feature extraction stage, spatial information and time information of a video are respectively extracted by using a slow channel and a fast channel; in the behavior proposal generation stage, different preprocessing processes are firstly used for the extracted spatial information and time information, and the extracted spatial information and time information are fused in two different stages, then PFG layer is used for sampling each behavior proposal, proposal characteristics are generated and are respectively transmitted to TEM and PEM for predicting a boundary possibility sequence and a boundary matching confidence map, and finally, confidence fusion is carried out on the prediction results to generate candidate behavior proposals, and a Soft-NMS algorithm is used for screening. The method can generate behavior suggestions for the uncut video under the condition that the original video is uncut, segment video segments containing behaviors in the video, and position the starting time and the ending time of the trip.

Description

Behavior proposal generation method for video behavior detection

Technical Field

The invention relates to a behavior proposal generation method, in particular to a behavior proposal generation method aiming at video behavior detection, belonging to the field of image processing and computer vision.

Background

With the development of the information age, short video APPs such as tremble and fast hands are more and more popular with people, which generates a large amount of video data, and meanwhile, the demand for video behavior detection is more and more intense. In video monitoring, whether abnormal conditions such as violence, fighting and the like exist in a video or not is judged through behavior detection, so that possible dangerous behaviors are detected in real time, and danger prompt information is sent to supervision personnel; during automatic driving, the object in the image is subjected to behavior detection through image information captured by the vehicle, and the next motion trail of the object is predicted, so that a safe and reliable driving route is worked out, pedestrians are avoided, and the safety of automatic driving is improved; in the event commentary, real-time robot commentary becomes infinitely possible by performing behavior detection on athletes in the game, such as thirds in a basketball game, caps, snatching and the like. In video behavior detection, behavior proposal generation is the most critical technology, and through the behavior proposal generation, segments of a video where behaviors are likely to occur are positioned, noise segments of an original uncut video are removed, and the video segments are divided into video segments only containing the behaviors.

The currently mainstream behavior proposal generation method is divided into two processes: firstly, feature extraction is carried out on an original uncut video, a method of a double-current convolution neural network is usually adopted in the process, but the double-current network needs to calculate optical flow information between continuous video frames as input, and needs to occupy a large amount of calculation time and optical flow information storage cost, so that the efficiency is very low; second, the proposed generation of extracted depth features is not yet mature.

The existing behavior proposal generation method mainly faces the following difficulties:

1. time-sequence of video: video needs to focus on timing information more than if the image contains only spatial information.

2. Complexity of the calculation: the video is formed by stacking a series of frame images, and most of the current algorithms need to perform complex optical flow calculation or process the time dimension of the video through a three-dimensional convolution kernel. The optical flow calculation is a complex process, which needs a large amount of calculation time, and the three-dimensional convolution kernel increases a time dimension, which greatly increases the number of parameters of the network, and thus puts higher requirements on computer hardware.

3. Proposal generation: at present, research related to proposal generation is too little, most methods are evolved from target detection algorithms of images, the obtained effects are not satisfactory, and reasonable network prediction output and a proposal generation method as accurate as possible need to be designed.

Disclosure of Invention

The invention solves the problems and difficulties, and provides a behavior proposal generation method aiming at video behavior detection, which can generate behavior proposals for an original video, remove noise segments of the video, segment video segments containing behaviors in the video and position the starting time and the ending time of the behavior in the video aiming at the video behavior detection problem under the condition that the original video is not cut.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a method of behavior proposal generation for video behavior detection, the method comprising the steps of:

step 1: constructing and designing a SlowFast neural network, designing the SlowFast neural network into a slow channel and a fast channel, wherein each channel adopts 3DResnet-50 as a main network, and training the SlowFast network on a Kinetics-600 data set until convergence to obtain a SlowFast depth feature extraction model;

step 2: performing feature extraction on an activityNet data set by using the SlowFast depth feature extraction model trained in the step 1 to obtain an activityNet depth feature data set;

and step 3: constructing and designing a BMNPlus neural network and a specific loss function, and training the BMNPlus network on the activityNet depth characteristic data set in the step 2 until convergence is achieved to obtain a behavior proposal generation model;

and 4, step 4: sampling an original uncut video by using two different frame rates to respectively obtain a low frame rate sampling video and a high frame rate sampling video;

and 5: inputting the low frame rate sampling video in the step 4 into the slow channel in the step 1 to obtain a slow depth characteristic sequence, and inputting the high frame rate sampling video in the step 4 into the fast channel in the step 1 to obtain a fast depth characteristic sequence;

step 6: respectively preprocessing the slow depth characteristic sequence and the fast depth characteristic sequence in the step 5 by using different three convolution layers, fusing the slow depth characteristic sequence and the fast depth characteristic sequence after the second convolution layer to obtain a PEM fusion characteristic sequence, and fusing the PEM fusion characteristic sequence for the second time after the third convolution layer to obtain a TEM fusion characteristic sequence;

and 7: designing a PFG layer to respectively sample a TEM fusion characteristic sequence and a PEM fusion characteristic sequence, respectively sampling 8 points in a starting time region and an ending time region, sampling 16 points in a duration region, and respectively generating a TEM proposal characteristic sequence and a PEM proposal characteristic sequence;

and 8: inputting the TEM proposed feature sequence in the step 7 into a TEM, outputting to obtain a boundary possibility sequence, inputting the PEM proposed feature sequence in the step 7 into a PEM, and outputting to obtain a boundary matching confidence map;

and step 9: and (4) combining the boundary possibility sequence and the boundary matching confidence map in the step 8 to generate a fusion confidence for each behavior proposal, and screening the candidate behavior proposals by using a Soft-NMS algorithm to generate a final behavior proposal.

Further, the specific process of step 3 is as follows: constructing a BMNPlus network for behavior proposal generation, dividing the whole BMNPlus network into BM, TEM and PEM three modules, designing a Loss function, training and adjusting parameters of the network on an activityNet depth characteristic data set to obtain a converged model, wherein the Loss function Loss is designed as follows:

Loss＝L_TEM+λ₁·L_PEM+λ₂·L₂(θ)

wherein L is_TEMRepresenting the loss of TEM-generated boundary likelihood sequences to constrain the likelihood of each time node as a proposed start or end time point, L_PEMRepresenting PEM generated boundary match confidenceLoss of graph, confidence score, L, to constrain each behavior proposal₂(θ) represents the L2 regularization term, preventing model overfitting, λ₁Is set to 1, lambda₂Is set to 0.0001, L_TEMThe constitution of (a) is as follows:

L_TEM＝L_s(P_S，G_S)+L_s(P_E，G_E)

wherein P is_SPrediction value representing a starting possibility sequence, G_SGroup-truth, P, representing a start probability sequence_EPredicted value representing ending probability sequence, G_EGroup-truth, L indicating a sequence of end possibilities_sThe constitution of (a) is as follows:

where T represents the number of time nodes of the video, b_i＝sign(g_i- γ) is a binary function of g_iFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5, assuming n is n⁺＝∑b_i，n^-＝T-n⁺Then, then

L_PEMThe constitution of (a) is as follows:

L_PEM＝L_s(M_CC，G_C)+λL_R(M_CR，C_C)

wherein M is_CCAnd M_CRTwo BM confidence maps for PEM prediction, G_CIs M_CCAnd M_CRGroup-route, L_Rλ is set to 10 for the L2 regression loss function.

Further, the specific process of step 6 is as follows: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, and respectively performing different preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, wherein the preprocessing process comprises three convolution layers, performing characteristic fusion after a second convolution layer to obtain a PEM fusion characteristic sequence, and performing second characteristic fusion after a third convolution layer to obtain a TEM fusion characteristic sequence, and the whole BM module can be expressed as the following process:

assuming that the slow signature sequence and fast signature sequence inputted by BM are respectively recorded as sf1 and ff1, then sf1 obtains a depth signature sequence sf2 after passing through conv1d11 and conv1d12 convolution layers, and the structure of sf2 is shown as follows:

sf2＝F_conv1d12(F_conv1d11(sf1))

wherein F represents the convolutional layer operation, and the lower right corner of the F symbol represents the convolutional layer name;

ff1 after passing through conv1d21 and conv1d22 two convolutional layers, a depth characteristic sequence ff2 is obtained, and the construction of ff2 is expressed as follows:

ff2＝F_conv1d22(F_conv1d21(ff1))

sf2 and ff2 are summed by sum to obtain a PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:

pemf＝sf2+ff2

sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence F_conv1d13(sf2)、F_conv1d23(ff2) and F_conv1d33(pemf), averaging the three new signature sequences to obtain the final TEM fusion signature sequence, which is denoted as temf, and the structure of temf is represented as follows:

further, the specific process of step 7 is as follows: for each behavior proposal, designing a PFG layer sampling method, sampling 8 points from a proposed start time region, sampling 8 points from a proposed end time region, sampling 16 points from a proposed duration region, sampling 32 points altogether, generating a proposed feature sequence for each behavior proposal, obtaining a TEM proposed feature sequence after PFG layer sampling of a TEM fusion feature sequence, and obtaining a PEM proposed feature sequence after PFG layer sampling of a PEM fusion feature sequence;the sampling process of the PFG layer is as follows: first, for each behavior proposal

Wherein t is_sIndicating a proposed start time, t_eRepresenting the proposed end time, from the left time region r by linear interpolation^s＝[t_s-d_g/k,t_s+d_g/k]Middle sampling 8 points, time region r from right^e＝[t_e-d_g/k,t_e+d_g/k]Middle sampling 8 points, from the middle region r^a＝[t_s,t_e]Middle sampling 16 points, where d_g＝t_e-t_sK is 5; then, the behavior proposal using these 32 sampling points

Generating an offer feature, assuming an offer

The generated proposal features that

All T behavior proposals are generated with the feature f^pThe input characteristic of the PFG layer is fⁱⁿWherein

Has a dimension of NxC, f^pHas dimensions of T × T × N × C, fⁱⁿThe dimension of (1) is T multiplied by C, C represents the number of feature channels, and the specific proposed feature construction process is as follows:

where n denotes the nth sample point,

presenting offer features

The value at the coordinates (n, c),

representing input features fⁱⁿAt the coordinate (t)_lThe value in c),

representing input features fⁱⁿAt the coordinate (t)_rThe value in c), w_lTo represent

Weight of (1), w_rRepresentation of

Weight of (1), t_l、w_l、t_r、w_rThe construction of (a) is represented as follows:

t_r＝1+t_l

w_r＝1-w_l

wherein, set N_l＝N_r＝8，N_c＝16，N＝N_l+N_r+N_cSince the start time of a behavior proposal cannot be later than the end time, 32, if a proposal is made

T in (1)_s≥t_eThe proposed features of the proposal need to be combined

Is set to 0.

Further, the specific process of step 8 is as follows: the PEM module takes the PEM proposal characteristic sequence as input, and obtains a T multiplied by 2 BM confidence map, namely two T multiplied by T BM confidence maps which are marked as M after passing through the PEM module_CC∈R^T×TAnd M_CR∈R^T×TThe TEM module takes the TEM proposed feature sequence as input, and obtains a Tx 2 boundary possibility sequence, namely two Tx 1 boundary possibility sequences after passing through the TEM module, and the two boundary possibility sequences are recorded as

And

wherein

Representing the probability of the likelihood of the ith time node being the proposed start time of the behavior,

representing the probability of the likelihood of the ith time node being the behavior proposal end time.

Further, the specific process of step 9 is as follows: four outputs, P, generated from the BMNPlus network_S、P_E、M_CCAnd M_CRPerforming confidence fusion, and then screening all candidate proposals by using a Soft-NMS algorithm to generate a final behavior proposal, specifically:

9.1 starting with P_STo select time node t_nForm a new set

N_SRepresenting the number of time nodes finally selected, i being taken from 1 to N_SWherein t is_nMust satisfy

max denotes the max operation, k is taken from 1 to T,

representing a time node t_nAs a probability of likelihood of the start time of the action proposal,

represents the probability of the k time node being the probability of the behavior proposal starting time, similarly from P_ETo select time node t_mForm a new set

N_ERepresenting the number of time nodes finally selected, j being taken from 1 to N_EIn which it must satisfy

Representing a time node t_mAs a probability of possibility of the action proposal end time,

representing the probability of the k-th time node being the probability of a behavior proposal ending time.

9.2 from B_SIn which a time node t is selected_sAs a start time, from B_EIn which a time node t is selected_eAs the end time, a behavior proposal is constructed and recorded as

Wherein t is_sAnd t_eNeeds to satisfy t_s<t_e，

Represents t_sProbability of likelihood of a time node as a behavior proposal start time,

represents t_eTime node as behavior proposalProbability of likelihood of ending time, p_ccIs shown at M_CCConfidence map coordinates (t)_e-t_s,t_s) The value of (a), p_crIs shown at M_CRConfidence map coordinates (t)_e-t_s,t_s) The value of (A) to finally obtain N_pAnd (5) candidate proposals.

9.3 for each candidate proposal

Performing confidence fusion to obtain fusion confidence p_fThe fusion mode is as follows:

each candidate offer after fusion can be represented as

9.4 screening the candidate proposals by using the Soft-NMS algorithm to generate a final action proposal set.

The method comprises the steps of extracting slow features and fast features of the video, predicting and generating a boundary possibility sequence and a boundary matching confidence map, and finally generating a behavior proposal. After an original uncut video is input, slow and fast characteristics of the video are extracted through a SlowFast model and respectively represent spatial information and time information of the video, and then the characteristics are input into BMNPlus to predict and obtain two boundary possibility sequences P_SAnd P_EAnd two boundary match confidence maps M_CCAnd M_CRAnd finally generating a final behavior proposal according to the prediction result.

The invention has the following advantages: 1) the depth feature extraction network SlowFast designed by the invention is divided into two channels, namely a slow channel and a fast channel, wherein the two channels adopt original video frames as input without calculating extra optical flow information, so that a large amount of calculation time and storage cost are saved, and the efficiency is higher; 2) the method uses different preprocessing processes aiming at the extracted slow characteristic and fast characteristic, and performs fusion in different stages to obtain a more reasonable preprocessing characteristic sequence; 3) aiming at the generation of the proposed features, the invention designs a more accurate sampling mode and a proposed feature calculation mode PFG layer, fully utilizes the start time region, the end time region and the duration time region of the behavior proposal, improves the effect of generating the behavior proposal and ensures that the video is more accurately segmented.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Fig. 2 is a flow chart of the depth feature extraction network Slowfast of the present invention.

Fig. 3 is a detailed view of the 3d rescnet-50 network of the present invention.

Fig. 4 is a flow chart of the proposed generation network BMNPlus of the present invention.

Fig. 5 is a detailed diagram of the BM preprocessing module of the present invention.

Detailed Description

The invention is explained in detail below with reference to the drawings, and the specific steps are as follows.

Example 1:

step 1: and constructing and designing a Slowfast neural network, designing the Slowfast neural network into a slow channel and a fast channel, and using 3D-Resnet50 as a backbone network for the two channels. The 3D-Resnet50 comprises 1 convolutional layer conv3D1, 1 pooling layer pool, 3 res1 residual blocks, 4 res2 residual blocks, 6 res3 residual blocks, 3 res4 residual blocks and 16 residual blocks in total, wherein each residual block is formed by stacking 3 three-dimensional convolutional layers by using a bottleeck design mode, and the total number of the network is 50. In the calculation of each convolution layer, a batch normalization operation is used, and each residual block calculates an activation mapping value in a pre-activation mode. The whole 3D-Resnet50 network is input as a video frame sequence 224 × 224 × T, and output as an extracted video feature sequence T × C, where T represents the number of input video frames, i.e., the time dimension, and C represents the number of feature channels finally extracted for each frame. For a slow channel, sampling an original video by using a low frame rate for an input video frame sequence, sampling one frame every 16 frames, namely if the original video has f frames, sampling to obtain f/16 frames as input, and setting the number C of characteristic channels to be 2048; for the fast channel, the input video frame sequence samples the original video with a high frame rate, one frame is sampled every 2 frames, that is, if the original video has f frames, f/2 frames are obtained by sampling as input, and the number C of the feature channels is set to 256. And carrying out convergence training on the constructed Slowfast network on a Kinetics-600 data set, wherein a Cross Engine Loss is adopted as a Loss function, and finally a depth feature extraction model is obtained.

Step 2: inputting each video sample of the activityNet data set into a depth feature extraction model, and extracting to obtain depth feature data corresponding to each sample. The resolution of the original video samples is first scaled to 224 x 224. Then, sampling one frame of the original video every 16 frames, inputting the sampled video frame sequence into a slow channel, and extracting to obtain a slow depth characteristic sequence; sampling one frame of original video every 2 frames, inputting the sampled video frame sequence into a fast channel, and extracting to obtain a fast depth feature sequence. Therefore, for each video sample in the ActivityNet dataset, after passing through the depth feature extraction model, a slow depth feature sequence and a fast depth feature sequence are respectively obtained, and finally, all samples and the corresponding feature sequences thereof form the depth feature dataset of ActivityNet.

And step 3: the loss functions for designing the BMNPlus proposed generating network and training were constructed. The whole BMNPlus network is divided into a BM preprocessing module, a TEM module and a PEM module, wherein the BM is used for preprocessing depth features, the TEM is used for generating a boundary possibility sequence, and the PEM is used for generating a boundary matching confidence map. The Loss function Loss for BMNPlus is designed to consist of three parts: l is_TEMLoss of boundary likelihood sequence, L_PEMLoss of boundary matching confidence maps, L2 regularization term. The composition of the Loss of Loss function is expressed as follows:

Loss＝L_TEM+λ₁·L_PEM+λ₂·L₂(θ)

wherein L is_TEMRepresenting the loss of TEM-generated boundary likelihood sequences, constraining the likelihood of proposed start and end times, L_PEMTo representLoss of the BM confidence map generated by the PEM, constrains the probability of each behavior proposal being correct, L₂(θ) represents the L2 regularization term to prevent overfitting. Wherein λ₁Is set to 1, lambda₂Set to 0.0001. L is_TEMThe constitution of (a) is as follows:

L_TEM＝L_s(P_s，G_s)+L_s(P_E，G_E)

where T represents the time dimension of the video, i.e. the number of frames, b_i＝sign(g_i- γ) is a binary function of g_iFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5. Suppose n is⁺＝∑b_i，n^-＝T-n⁺Then, then

L_PEMThe constitution of (a) is as follows:

L_pEM＝L_s(M_cc,C_c)+λL_R(M_CR，C_C)

In order to realize the supervised training of the model, the original ground-route label of the ActivinyNet data set needs to be converted into G needed by the BMNPlus network_C、G_S、G_EThree group-route tags. Assume the original ground-truth label of each video sample of the ActivityNet dataset is:

is marked as

Wherein

t_sStarting time, t, representing a proposal for a behavior_eRepresenting the end time of the behavior proposal, k takes 1,2, 3. For each behavior proposal

Duration d of the behaviour is calculated first_g＝t_e-t_sRecalculating the start time region r of the behavior^s＝[t_s-d_g/5,t_s+d_g/5]End time region r of behavior^e＝[t_e-d_g/5,t_e+d_g/5]Finally, n start time regions r are obtained^sN end time regions r^e。

P_SAnd P_EGroup-route generation of (i.e. G)_SAnd G_EGeneration of (1): for time node T of T time nodes_iCalculating the time span r of the time nodeⁱ＝[t_i-d_f/2,t_i+d_f/2]Wherein d is_f＝t_i-t_i-1. Get rⁱAnd n is r^sMaximum IOR value of

Namely, it is

Representing a time node t_iGroup-route as a probability of likelihood of a behavior proposal start time; get rⁱAnd n is r^eMaximum IOR value of

Namely, it is

Representing a time node t_iGround-route as probability of likelihood of the end time of the behavior proposal, where IOR is defined as the overlap ratio, at G_SIn the generation of (1), IOR is rⁱAnd r^sIs divided by rⁱLength of region of G_EIs formed of IOR of rⁱAnd r^eIs divided by rⁱThe length of the zone.

M_CcAnd M_CRGroup-route generation of (i.e. G)_CGeneration of (1): m_CCAnd M_CRFor BM confidence maps, both use the same ground-truth tag G_CFor G_cThe value of the upper point (d, t), i.e. the action proposal

Get the ground-route of

And

as the value of point (d, t).

And (3) carrying out model training on the BMNPlus on the activityNet depth characteristic data set in the step (2) to be convergent on the basis of the BMNPlus model, the loss function and the ground-truth label, and finally obtaining a behavior proposal generation model.

And 4, step 4: sampling one frame every 16 frames of an original video to obtain a sampling video with a low frame rate, sampling one frame every 2 frames of the original video to obtain a sampling video with a high frame rate, and assuming that the original video has f frames, obtaining a low frame rate video frame sequence with f/16 frames after sampling at the low frame rate, and obtaining a high frame rate video frame sequence with f/2 frames after sampling at the high frame rate.

And 5: inputting the low frame rate sampling video obtained in the step 4 into a slow channel in a depth feature extraction model to obtain a slow depth feature sequence of an original video, wherein the dimension is (f/16) × 2048, inputting the high frame rate sampling video obtained in the step 4 into a fast channel in the depth feature extraction model to obtain a fast depth feature sequence of the original video, wherein the dimension is (f/2) × 256, in order to input the fast depth feature sequence into BMNPlus for behavior proposal feature generation, time dimension average sampling is also needed, fast features are mapped to the same time dimension as slow features, and finally the fast feature sequence obtained by the fast channel is (f/16) × 256. And if f/16 is T, the slow depth feature sequence dimension is T multiplied by 2048, and the fast depth feature sequence dimension is T multiplied by 256.

Step 6: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, respectively carrying out different three-layer convolution layer preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, carrying out first characteristic fusion after the two convolution layer preprocessing to obtain a PEM fusion characteristic sequence, and carrying out second characteristic fusion by combining the PEM fusion diagram characteristic sequence after the three convolution layer preprocessing to obtain a TEM fusion characteristic sequence. The whole BM module can be represented as the following process:

assuming that the slow feature sequence and the fast feature sequence input by the BM are respectively denoted as sf1 and ff1, sf1 obtains a depth feature sequence sf2 after passing through two convolution layers of conv1d11 and conv1d12, where conv1d11 uses a one-dimensional convolution kernel size of 3, the number of feature channels is designed to be 256, conv1d12 uses a convolution kernel size of 3, the number of feature channels is designed to be 128, and the structure of sf2 is expressed as follows:

sf2＝F_conv1d12(F_conv1d11(sf1))

where F represents the convolutional layer operation and the lower right hand corner of the F symbol represents the particular convolutional layer name.

ff1 is subjected to conv1d21 and conv1d22 two convolution layers to obtain a depth feature sequence ff2, wherein conv1d21 uses a one-dimensional convolution kernel size of 3, the number of feature channels is 256, conv1d22 uses a one-dimensional convolution kernel size of 3, the number of feature channels is 128, and the ff2 structure is represented as follows:

ff2＝F_conv1d22(F_cOnv1dz1(ff1))

the sum of sf2 and ff2 gives the PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:

pemf＝sf2+ff2

sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence F_conv1d13(sf2)、F_conv1d23(ff2) and F_cOnv1d33(pemf) wherein conv1d13, conv1d23, conv1d33 all use a one-dimensional convolution kernel size of 1 and the number of eigen-channels are all set to 1. Averaging the three new feature sequences to obtain the final TEM fusion feature sequence, which is denoted as temf, and the structure of temf is represented as follows:

after BM pretreatment, a PEM fusion characteristic sequence with T x 128 dimension and a TEM fusion characteristic sequence with T x 1 dimension are obtained.

And 7: for each action proposal, a PFG layer is used to sample 8 points from the start time region of the proposal, 8 points from the end time region of the proposal, 16 points from the duration region of the proposal, for a total of 32 points, to generate a proposal feature for each action proposal.

First, for each behavior proposal

Wherein t is_sIndicating a proposed start time, t_eRepresenting the proposed end time, from the left time region r by linear interpolation^s＝[t_s-d_g/k,t_s+d_g/k]Middle sampling 8 points, time region r from right^e＝[t_e-d_g/k,t_e+d_g/k]Middle sampling 8 points, from the middle region r^a＝[t_s,t_e]Middle sampling 16 points, where d_g＝t_e-t_sAnd k is 5. Then, the 32 sampling points are used as suggestions

An offer feature is generated. Hypothesis proposition

The generated proposal features that

Has a dimension of NxC, f^pHas dimensions of T × T × N × C, fⁱⁿThe dimension of (a) is T multiplied by C, and C represents the number of characteristic channels. The specific proposed feature construction process is as follows:

where n denotes the nth sample point,

presenting offer features

The value at the coordinates (n, c),

representing input features fⁱⁿAt the coordinate (t)_lThe value in c),

Weight of (1), w_rRepresentation of

t_r＝1+t_l

w_r＝1-w_l

wherein, set N_l＝N_r＝8，N_c＝16，N＝N_l+N_r+N_c32. Since the start time of a behavioral proposal may not be later than the end time, if a proposal is made

T in (1)_s≥t_eThe proposed features of the proposal need to be combined

Is set to 0.

And (3) obtaining a TEM proposed characteristic sequence T multiplied by 32 after the TEM fusion characteristic sequence T multiplied by 1 in the step 7 is subjected to PFG layer sampling, and obtaining a PEM proposed characteristic sequence T multiplied by 32 multiplied by 128 after the PEM fusion characteristic sequence T multiplied by 128 is subjected to PFG layer sampling.

And 8: inputting the PEM proposal characteristic sequence of T multiplied by 32 multiplied by 128 into a PEM module, outputting and obtaining a BM confidence map of T multiplied by 2, namely two BM confidence maps of T multiplied by T, which are marked as M_CC∈R^T×TAnd M_CR∈R^T×TThe PEM module consists of three convolutional layers, conv3d uses a three-dimensional convolutional kernel size of 1 × 1 × 32, the number of characteristic channels is set to 512, conv2d2 uses a two-dimensional convolutional kernel size of 1 × 1, the number of characteristic channels is set to 256, conv2d3 uses a two-dimensional convolutional kernel size of 1 × 1, and the number of characteristic channels is set to 2; inputting the TEM proposal characteristic sequence of T multiplied by 32 into a TEM module, outputting two boundary possibility sequences of T multiplied by 1 which are boundary possibility sequences of T multiplied by 2, and recording the boundary possibility sequences as

And

the TEM consists of one feature compression operation and two convolutional layers, the squeeze operation averages the second dimension of the TEM proposed feature sequence T × 32 to compress the feature information, conv1d1 uses a one-dimensional convolution kernel size of 1, the number of feature channels is set to 256, conv1d2 is set to a one-dimensional convolution kernel size of 3, and the number of feature channels is set to 2.

And step 9: four outputs, two boundary likelihood sequences P, generated from a BMNPlus network_SAnd P_EAnd two BM confidence maps M_CCAnd M_CRFusion confidence is generated, and then the Soft-NMS algorithm screens all behavior proposals. Specifically, the method comprises the following steps:

9.1 starting with P_STo select time node t_nForm a new set

max denotes the max operation, k is taken from 1 to T,

Wherein t is_sAnd t_eNeeds to satisfy t_s<t_e，

represents t_eProbability of a time node being the end time of a behavior proposal, p_ccIs shown at M_CCConfidence map coordinates (t)_e-t_s,t_s) The value of (a), p_crIs shown at M_CRConfidence map coordinates (t)_e-t_s,t_s) The value of (A) to finally obtain N_pAnd (5) candidate proposals.

9.3 for each candidate proposal

each candidate offer after fusion can be represented as

9.4 screening candidate proposals by using Soft-NMS algorithm to generate final action proposal set

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.

Claims

1. A method of behavior proposal generation for video behavior detection, the method comprising the steps of:

2. The behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 3 is as follows: constructing a BMNPlus network for behavior proposal generation, dividing the whole BMNPlus network into BM, TEM and PEM three modules, designing a Loss function, training and adjusting parameters of the network on an activityNet depth characteristic data set to obtain a converged model, wherein the Loss function Loss is designed as follows:

Loss＝L_TEM+λ₁·L_PEM+λ₂·L₂(θ)

wherein L is_TEMRepresenting the loss of TEM-generated boundary likelihood sequences to constrain the likelihood of each time node as a proposed start or end time point, L_PEMRepresents the loss of the PEM-generated boundary match confidence map, the confidence score, L, used to constrain each behavior proposal₂(θ) represents the L2 regularization term, prevention modelOverfitting, λ₁Is set to 1, lambda₂Is set to 0.0001, L_TEMThe constitution of (a) is as follows:

L_TEM＝L_s(P_S，G_S)+L_s(P_E，G_E)

LPE_MThe constitution of (a) is as follows:

L_PEM＝L_s(M_CC，G_C)+λL_R(M_CR，G_C)

3. The behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 6 is as follows: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, and respectively performing different preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, wherein the preprocessing process comprises three convolution layers, performing characteristic fusion after a second convolution layer to obtain a PEM fusion characteristic sequence, and performing second characteristic fusion after a third convolution layer to obtain a TEM fusion characteristic sequence, and the whole BM module can be expressed as the following process:

assuming that the slow signature sequence and fast signature sequence inputted by BM are respectively recorded as sf1 and ff1, then sf1 obtains a depth signature sequence sf2 after passing through conv1d11 and convld12 convolution layers, and the structure of sf2 is expressed as follows:

sf2＝F_conv1d12(F_conv1d11(sf1))

ff2＝F_conv1d22(F_conv1d21(ff1))

pemf＝sf2+ff2

4. the behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 7 is as follows: for each behavior proposal, a PFG layer sampling method is designed, 8 points are sampled from the start time region of the proposal, 8 points are sampled from the end time region of the proposal, 16 points are sampled from the duration region of the proposal, and 32 points are sampled in total for each behavior proposalObtaining a proposed feature sequence, obtaining a TEM proposed feature sequence after the TEM fusion feature sequence is subjected to PFG layer sampling, and obtaining a PEM proposed feature sequence after the PEM fusion feature sequence is subjected to PFG layer sampling; the sampling process of the PFG layer is as follows: first, for each behavior proposal

Wherein t is_sIndicating a proposed start time, t_eRepresenting the proposed end time, from the left time region r by linear interpolation^s＝[t_s-d_g/k，t_s+d_g/k]Middle sampling 8 points, time region r from right^e＝[t_e-d_g/k，t_e+d_g/k]Middle sampling 8 points, from the middle region r^a＝[t_s，t_e]Middle sampling 16 points, where d_g＝t_e-t_sK is 5; then, the behavior proposal using these 32 sampling points

Generating an offer feature, assuming an offer

The generated proposal features that

where n denotes the nth sample point,

presenting offer features

The value at the coordinates (n, c),

representing input features fⁱⁿAt the coordinate (t)_lThe value in c),

Weight of (1), w_rRepresentation of

t_r＝1+t_l

w_r＝1-w_l

wherein, set N_l＝N_r＝8，N_c＝16，N＝N_l+N_r+N_c32, since the start time of a behavior proposal cannot be later than the end time,thus, if proposed

T in (1)_s≥t_eThe proposed features of the proposal need to be combined

Is set to 0.

5. The method for generating behavior proposal aiming at video behavior detection as claimed in claim 1, characterized in that the specific process of step 8 is as follows: the PEM module takes the PEM proposal characteristic sequence as input, and obtains a T multiplied by 2 BM confidence map, namely two T multiplied by T BM confidence maps which are marked as M after passing through the PEM module_CC∈R^T×TAnd M_CR∈R^T×TThe TEM module takes the TEM proposed feature sequence as input, and obtains a Tx 2 boundary possibility sequence, namely two Tx 1 boundary possibility sequences after passing through the TEM module, and the two boundary possibility sequences are recorded as

And

wherein

6. The method for generating behavior proposal aiming at video behavior detection according to claim 1, characterized in that the specific process of step 9 is as follows: four outputs, P, generated from the BMNPlus network_S、P_E、M_CCAnd M_CRProceed to confidenceAnd (3) fusing, and then screening all candidate proposals by using a Soft-NMS algorithm to generate a final behavior proposal, specifically:

9.1 starting with P_STo select time node t_nForm a new set

max denotes the max operation, k is taken from 1 to T,

Wherein t is_sAnd t_eNeeds to satisfy t_s＜t_e，

represents t_eProbability of a time node being the end time of a behavior proposal, p_ccIs shown at M_CCConfidence map coordinates (t)_e-t_s，t_s) The value of (a), p_crIs shown at M_CRConfidence map coordinates (t)_e-t_s，t_s) The value of (A) to finally obtain N_pAnd (5) candidate proposals.

9.3 for each candidate proposal

each candidate offer after fusion can be represented as