CN113298017A - Behavior proposal generation method for video behavior detection - Google Patents

Behavior proposal generation method for video behavior detection Download PDF

Info

Publication number
CN113298017A
CN113298017A CN202110647905.8A CN202110647905A CN113298017A CN 113298017 A CN113298017 A CN 113298017A CN 202110647905 A CN202110647905 A CN 202110647905A CN 113298017 A CN113298017 A CN 113298017A
Authority
CN
China
Prior art keywords
proposal
behavior
sequence
time
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110647905.8A
Other languages
Chinese (zh)
Other versions
CN113298017B (en
Inventor
姚莉
范文鸿
杨俊宴
吴含前
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110647905.8A priority Critical patent/CN113298017B/en
Publication of CN113298017A publication Critical patent/CN113298017A/en
Application granted granted Critical
Publication of CN113298017B publication Critical patent/CN113298017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a behavior proposal generation method aiming at video behavior detection. In the feature extraction stage, spatial information and time information of a video are respectively extracted by using a slow channel and a fast channel; in the behavior proposal generation stage, different preprocessing processes are firstly used for the extracted spatial information and time information, and the extracted spatial information and time information are fused in two different stages, then PFG layer is used for sampling each behavior proposal, proposal characteristics are generated and are respectively transmitted to TEM and PEM for predicting a boundary possibility sequence and a boundary matching confidence map, and finally, confidence fusion is carried out on the prediction results to generate candidate behavior proposals, and a Soft-NMS algorithm is used for screening. The method can generate behavior suggestions for the uncut video under the condition that the original video is uncut, segment video segments containing behaviors in the video, and position the starting time and the ending time of the trip.

Description

Behavior proposal generation method for video behavior detection
Technical Field
The invention relates to a behavior proposal generation method, in particular to a behavior proposal generation method aiming at video behavior detection, belonging to the field of image processing and computer vision.
Background
With the development of the information age, short video APPs such as tremble and fast hands are more and more popular with people, which generates a large amount of video data, and meanwhile, the demand for video behavior detection is more and more intense. In video monitoring, whether abnormal conditions such as violence, fighting and the like exist in a video or not is judged through behavior detection, so that possible dangerous behaviors are detected in real time, and danger prompt information is sent to supervision personnel; during automatic driving, the object in the image is subjected to behavior detection through image information captured by the vehicle, and the next motion trail of the object is predicted, so that a safe and reliable driving route is worked out, pedestrians are avoided, and the safety of automatic driving is improved; in the event commentary, real-time robot commentary becomes infinitely possible by performing behavior detection on athletes in the game, such as thirds in a basketball game, caps, snatching and the like. In video behavior detection, behavior proposal generation is the most critical technology, and through the behavior proposal generation, segments of a video where behaviors are likely to occur are positioned, noise segments of an original uncut video are removed, and the video segments are divided into video segments only containing the behaviors.
The currently mainstream behavior proposal generation method is divided into two processes: firstly, feature extraction is carried out on an original uncut video, a method of a double-current convolution neural network is usually adopted in the process, but the double-current network needs to calculate optical flow information between continuous video frames as input, and needs to occupy a large amount of calculation time and optical flow information storage cost, so that the efficiency is very low; second, the proposed generation of extracted depth features is not yet mature.
The existing behavior proposal generation method mainly faces the following difficulties:
1. time-sequence of video: video needs to focus on timing information more than if the image contains only spatial information.
2. Complexity of the calculation: the video is formed by stacking a series of frame images, and most of the current algorithms need to perform complex optical flow calculation or process the time dimension of the video through a three-dimensional convolution kernel. The optical flow calculation is a complex process, which needs a large amount of calculation time, and the three-dimensional convolution kernel increases a time dimension, which greatly increases the number of parameters of the network, and thus puts higher requirements on computer hardware.
3. Proposal generation: at present, research related to proposal generation is too little, most methods are evolved from target detection algorithms of images, the obtained effects are not satisfactory, and reasonable network prediction output and a proposal generation method as accurate as possible need to be designed.
Disclosure of Invention
The invention solves the problems and difficulties, and provides a behavior proposal generation method aiming at video behavior detection, which can generate behavior proposals for an original video, remove noise segments of the video, segment video segments containing behaviors in the video and position the starting time and the ending time of the behavior in the video aiming at the video behavior detection problem under the condition that the original video is not cut.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a method of behavior proposal generation for video behavior detection, the method comprising the steps of:
step 1: constructing and designing a SlowFast neural network, designing the SlowFast neural network into a slow channel and a fast channel, wherein each channel adopts 3DResnet-50 as a main network, and training the SlowFast network on a Kinetics-600 data set until convergence to obtain a SlowFast depth feature extraction model;
step 2: performing feature extraction on an activityNet data set by using the SlowFast depth feature extraction model trained in the step 1 to obtain an activityNet depth feature data set;
and step 3: constructing and designing a BMNPlus neural network and a specific loss function, and training the BMNPlus network on the activityNet depth characteristic data set in the step 2 until convergence is achieved to obtain a behavior proposal generation model;
and 4, step 4: sampling an original uncut video by using two different frame rates to respectively obtain a low frame rate sampling video and a high frame rate sampling video;
and 5: inputting the low frame rate sampling video in the step 4 into the slow channel in the step 1 to obtain a slow depth characteristic sequence, and inputting the high frame rate sampling video in the step 4 into the fast channel in the step 1 to obtain a fast depth characteristic sequence;
step 6: respectively preprocessing the slow depth characteristic sequence and the fast depth characteristic sequence in the step 5 by using different three convolution layers, fusing the slow depth characteristic sequence and the fast depth characteristic sequence after the second convolution layer to obtain a PEM fusion characteristic sequence, and fusing the PEM fusion characteristic sequence for the second time after the third convolution layer to obtain a TEM fusion characteristic sequence;
and 7: designing a PFG layer to respectively sample a TEM fusion characteristic sequence and a PEM fusion characteristic sequence, respectively sampling 8 points in a starting time region and an ending time region, sampling 16 points in a duration region, and respectively generating a TEM proposal characteristic sequence and a PEM proposal characteristic sequence;
and 8: inputting the TEM proposed feature sequence in the step 7 into a TEM, outputting to obtain a boundary possibility sequence, inputting the PEM proposed feature sequence in the step 7 into a PEM, and outputting to obtain a boundary matching confidence map;
and step 9: and (4) combining the boundary possibility sequence and the boundary matching confidence map in the step 8 to generate a fusion confidence for each behavior proposal, and screening the candidate behavior proposals by using a Soft-NMS algorithm to generate a final behavior proposal.
Further, the specific process of step 3 is as follows: constructing a BMNPlus network for behavior proposal generation, dividing the whole BMNPlus network into BM, TEM and PEM three modules, designing a Loss function, training and adjusting parameters of the network on an activityNet depth characteristic data set to obtain a converged model, wherein the Loss function Loss is designed as follows:
Loss=LTEM1·LPEM2·L2(θ)
wherein L isTEMRepresenting the loss of TEM-generated boundary likelihood sequences to constrain the likelihood of each time node as a proposed start or end time point, LPEMRepresenting PEM generated boundary match confidenceLoss of graph, confidence score, L, to constrain each behavior proposal2(θ) represents the L2 regularization term, preventing model overfitting, λ1Is set to 1, lambda2Is set to 0.0001, LTEMThe constitution of (a) is as follows:
LTEM=Ls(PS,GS)+Ls(PE,GE)
wherein P isSPrediction value representing a starting possibility sequence, GSGroup-truth, P, representing a start probability sequenceEPredicted value representing ending probability sequence, GEGroup-truth, L indicating a sequence of end possibilitiessThe constitution of (a) is as follows:
Figure BDA0003109940120000031
where T represents the number of time nodes of the video, bi=sign(gi- γ) is a binary function of giFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5, assuming n is n+=∑bi,n-=T-n+Then, then
Figure BDA0003109940120000032
LPEMThe constitution of (a) is as follows:
LPEM=Ls(MCC,GC)+λLR(MCR,CC)
wherein M isCCAnd MCRTwo BM confidence maps for PEM prediction, GCIs MCCAnd MCRGroup-route, LRλ is set to 10 for the L2 regression loss function.
Further, the specific process of step 6 is as follows: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, and respectively performing different preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, wherein the preprocessing process comprises three convolution layers, performing characteristic fusion after a second convolution layer to obtain a PEM fusion characteristic sequence, and performing second characteristic fusion after a third convolution layer to obtain a TEM fusion characteristic sequence, and the whole BM module can be expressed as the following process:
assuming that the slow signature sequence and fast signature sequence inputted by BM are respectively recorded as sf1 and ff1, then sf1 obtains a depth signature sequence sf2 after passing through conv1d11 and conv1d12 convolution layers, and the structure of sf2 is shown as follows:
sf2=Fconv1d12(Fconv1d11(sf1))
wherein F represents the convolutional layer operation, and the lower right corner of the F symbol represents the convolutional layer name;
ff1 after passing through conv1d21 and conv1d22 two convolutional layers, a depth characteristic sequence ff2 is obtained, and the construction of ff2 is expressed as follows:
ff2=Fconv1d22(Fconv1d21(ff1))
sf2 and ff2 are summed by sum to obtain a PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:
pemf=sf2+ff2
sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence Fconv1d13(sf2)、Fconv1d23(ff2) and Fconv1d33(pemf), averaging the three new signature sequences to obtain the final TEM fusion signature sequence, which is denoted as temf, and the structure of temf is represented as follows:
Figure BDA0003109940120000033
further, the specific process of step 7 is as follows: for each behavior proposal, designing a PFG layer sampling method, sampling 8 points from a proposed start time region, sampling 8 points from a proposed end time region, sampling 16 points from a proposed duration region, sampling 32 points altogether, generating a proposed feature sequence for each behavior proposal, obtaining a TEM proposed feature sequence after PFG layer sampling of a TEM fusion feature sequence, and obtaining a PEM proposed feature sequence after PFG layer sampling of a PEM fusion feature sequence;the sampling process of the PFG layer is as follows: first, for each behavior proposal
Figure BDA0003109940120000041
Wherein t issIndicating a proposed start time, teRepresenting the proposed end time, from the left time region r by linear interpolations=[ts-dg/k,ts+dg/k]Middle sampling 8 points, time region r from righte=[te-dg/k,te+dg/k]Middle sampling 8 points, from the middle region ra=[ts,te]Middle sampling 16 points, where dg=te-tsK is 5; then, the behavior proposal using these 32 sampling points
Figure BDA0003109940120000042
Generating an offer feature, assuming an offer
Figure BDA0003109940120000043
The generated proposal features that
Figure BDA0003109940120000044
All T behavior proposals are generated with the feature fpThe input characteristic of the PFG layer is finWherein
Figure BDA0003109940120000045
Has a dimension of NxC, fpHas dimensions of T × T × N × C, finThe dimension of (1) is T multiplied by C, C represents the number of feature channels, and the specific proposed feature construction process is as follows:
Figure BDA0003109940120000046
where n denotes the nth sample point,
Figure BDA0003109940120000047
presenting offer features
Figure BDA0003109940120000048
The value at the coordinates (n, c),
Figure BDA0003109940120000049
representing input features finAt the coordinate (t)lThe value in c),
Figure BDA00031099401200000410
representing input features finAt the coordinate (t)rThe value in c), wlTo represent
Figure BDA00031099401200000411
Weight of (1), wrRepresentation of
Figure BDA00031099401200000412
Weight of (1), tl、wl、tr、wrThe construction of (a) is represented as follows:
Figure BDA00031099401200000413
Figure BDA00031099401200000414
tr=1+tl
wr=1-wl
wherein, set Nl=Nr=8,Nc=16,N=Nl+Nr+NcSince the start time of a behavior proposal cannot be later than the end time, 32, if a proposal is made
Figure BDA00031099401200000415
T in (1)s≥teThe proposed features of the proposal need to be combined
Figure BDA00031099401200000416
Is set to 0.
Further, the specific process of step 8 is as follows: the PEM module takes the PEM proposal characteristic sequence as input, and obtains a T multiplied by 2 BM confidence map, namely two T multiplied by T BM confidence maps which are marked as M after passing through the PEM moduleCC∈RT×TAnd MCR∈RT×TThe TEM module takes the TEM proposed feature sequence as input, and obtains a Tx 2 boundary possibility sequence, namely two Tx 1 boundary possibility sequences after passing through the TEM module, and the two boundary possibility sequences are recorded as
Figure BDA0003109940120000051
And
Figure BDA0003109940120000052
wherein
Figure BDA0003109940120000053
Representing the probability of the likelihood of the ith time node being the proposed start time of the behavior,
Figure BDA0003109940120000054
representing the probability of the likelihood of the ith time node being the behavior proposal end time.
Further, the specific process of step 9 is as follows: four outputs, P, generated from the BMNPlus networkS、PE、MCCAnd MCRPerforming confidence fusion, and then screening all candidate proposals by using a Soft-NMS algorithm to generate a final behavior proposal, specifically:
9.1 starting with PSTo select time node tnForm a new set
Figure BDA0003109940120000055
NSRepresenting the number of time nodes finally selected, i being taken from 1 to NSWherein t isnMust satisfy
Figure BDA0003109940120000056
max denotes the max operation, k is taken from 1 to T,
Figure BDA0003109940120000057
representing a time node tnAs a probability of likelihood of the start time of the action proposal,
Figure BDA0003109940120000058
represents the probability of the k time node being the probability of the behavior proposal starting time, similarly from PETo select time node tmForm a new set
Figure BDA0003109940120000059
NERepresenting the number of time nodes finally selected, j being taken from 1 to NEIn which it must satisfy
Figure BDA00031099401200000510
Figure BDA00031099401200000511
Representing a time node tmAs a probability of possibility of the action proposal end time,
Figure BDA00031099401200000512
representing the probability of the k-th time node being the probability of a behavior proposal ending time.
9.2 from BSIn which a time node t is selectedsAs a start time, from BEIn which a time node t is selectedeAs the end time, a behavior proposal is constructed and recorded as
Figure BDA00031099401200000513
Wherein t issAnd teNeeds to satisfy ts<te
Figure BDA00031099401200000514
Represents tsProbability of likelihood of a time node as a behavior proposal start time,
Figure BDA00031099401200000515
represents teTime node as behavior proposalProbability of likelihood of ending time, pccIs shown at MCCConfidence map coordinates (t)e-ts,ts) The value of (a), pcrIs shown at MCRConfidence map coordinates (t)e-ts,ts) The value of (A) to finally obtain NpAnd (5) candidate proposals.
9.3 for each candidate proposal
Figure BDA00031099401200000516
Performing confidence fusion to obtain fusion confidence pfThe fusion mode is as follows:
Figure BDA00031099401200000517
each candidate offer after fusion can be represented as
Figure BDA00031099401200000518
9.4 screening the candidate proposals by using the Soft-NMS algorithm to generate a final action proposal set.
The method comprises the steps of extracting slow features and fast features of the video, predicting and generating a boundary possibility sequence and a boundary matching confidence map, and finally generating a behavior proposal. After an original uncut video is input, slow and fast characteristics of the video are extracted through a SlowFast model and respectively represent spatial information and time information of the video, and then the characteristics are input into BMNPlus to predict and obtain two boundary possibility sequences PSAnd PEAnd two boundary match confidence maps MCCAnd MCRAnd finally generating a final behavior proposal according to the prediction result.
The invention has the following advantages: 1) the depth feature extraction network SlowFast designed by the invention is divided into two channels, namely a slow channel and a fast channel, wherein the two channels adopt original video frames as input without calculating extra optical flow information, so that a large amount of calculation time and storage cost are saved, and the efficiency is higher; 2) the method uses different preprocessing processes aiming at the extracted slow characteristic and fast characteristic, and performs fusion in different stages to obtain a more reasonable preprocessing characteristic sequence; 3) aiming at the generation of the proposed features, the invention designs a more accurate sampling mode and a proposed feature calculation mode PFG layer, fully utilizes the start time region, the end time region and the duration time region of the behavior proposal, improves the effect of generating the behavior proposal and ensures that the video is more accurately segmented.
Drawings
FIG. 1 is an overall flow chart of the present invention.
Fig. 2 is a flow chart of the depth feature extraction network Slowfast of the present invention.
Fig. 3 is a detailed view of the 3d rescnet-50 network of the present invention.
Fig. 4 is a flow chart of the proposed generation network BMNPlus of the present invention.
Fig. 5 is a detailed diagram of the BM preprocessing module of the present invention.
Detailed Description
The invention is explained in detail below with reference to the drawings, and the specific steps are as follows.
Example 1:
step 1: and constructing and designing a Slowfast neural network, designing the Slowfast neural network into a slow channel and a fast channel, and using 3D-Resnet50 as a backbone network for the two channels. The 3D-Resnet50 comprises 1 convolutional layer conv3D1, 1 pooling layer pool, 3 res1 residual blocks, 4 res2 residual blocks, 6 res3 residual blocks, 3 res4 residual blocks and 16 residual blocks in total, wherein each residual block is formed by stacking 3 three-dimensional convolutional layers by using a bottleeck design mode, and the total number of the network is 50. In the calculation of each convolution layer, a batch normalization operation is used, and each residual block calculates an activation mapping value in a pre-activation mode. The whole 3D-Resnet50 network is input as a video frame sequence 224 × 224 × T, and output as an extracted video feature sequence T × C, where T represents the number of input video frames, i.e., the time dimension, and C represents the number of feature channels finally extracted for each frame. For a slow channel, sampling an original video by using a low frame rate for an input video frame sequence, sampling one frame every 16 frames, namely if the original video has f frames, sampling to obtain f/16 frames as input, and setting the number C of characteristic channels to be 2048; for the fast channel, the input video frame sequence samples the original video with a high frame rate, one frame is sampled every 2 frames, that is, if the original video has f frames, f/2 frames are obtained by sampling as input, and the number C of the feature channels is set to 256. And carrying out convergence training on the constructed Slowfast network on a Kinetics-600 data set, wherein a Cross Engine Loss is adopted as a Loss function, and finally a depth feature extraction model is obtained.
Step 2: inputting each video sample of the activityNet data set into a depth feature extraction model, and extracting to obtain depth feature data corresponding to each sample. The resolution of the original video samples is first scaled to 224 x 224. Then, sampling one frame of the original video every 16 frames, inputting the sampled video frame sequence into a slow channel, and extracting to obtain a slow depth characteristic sequence; sampling one frame of original video every 2 frames, inputting the sampled video frame sequence into a fast channel, and extracting to obtain a fast depth feature sequence. Therefore, for each video sample in the ActivityNet dataset, after passing through the depth feature extraction model, a slow depth feature sequence and a fast depth feature sequence are respectively obtained, and finally, all samples and the corresponding feature sequences thereof form the depth feature dataset of ActivityNet.
And step 3: the loss functions for designing the BMNPlus proposed generating network and training were constructed. The whole BMNPlus network is divided into a BM preprocessing module, a TEM module and a PEM module, wherein the BM is used for preprocessing depth features, the TEM is used for generating a boundary possibility sequence, and the PEM is used for generating a boundary matching confidence map. The Loss function Loss for BMNPlus is designed to consist of three parts: l isTEMLoss of boundary likelihood sequence, LPEMLoss of boundary matching confidence maps, L2 regularization term. The composition of the Loss of Loss function is expressed as follows:
Loss=LTEM1·LPEM2·L2(θ)
wherein L isTEMRepresenting the loss of TEM-generated boundary likelihood sequences, constraining the likelihood of proposed start and end times, LPEMTo representLoss of the BM confidence map generated by the PEM, constrains the probability of each behavior proposal being correct, L2(θ) represents the L2 regularization term to prevent overfitting. Wherein λ1Is set to 1, lambda2Set to 0.0001. L isTEMThe constitution of (a) is as follows:
LTEM=Ls(Ps,Gs)+Ls(PE,GE)
wherein P isSPrediction value representing a starting possibility sequence, GSGroup-truth, P, representing a start probability sequenceEPredicted value representing ending probability sequence, GEGroup-truth, L indicating a sequence of end possibilitiessThe constitution of (a) is as follows:
Figure BDA0003109940120000071
where T represents the time dimension of the video, i.e. the number of frames, bi=sign(gi- γ) is a binary function of giFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5. Suppose n is+=∑bi,n-=T-n+Then, then
Figure BDA0003109940120000072
LPEMThe constitution of (a) is as follows:
LpEM=Ls(Mcc,Cc)+λLR(MCR,CC)
wherein M isCCAnd MCRTwo BM confidence maps for PEM prediction, GCIs MCCAnd MCRGroup-route, LRλ is set to 10 for the L2 regression loss function.
In order to realize the supervised training of the model, the original ground-route label of the ActivinyNet data set needs to be converted into G needed by the BMNPlus networkC、GS、GEThree group-route tags. Assume the original ground-truth label of each video sample of the ActivityNet dataset is:
Figure BDA0003109940120000073
is marked as
Figure BDA0003109940120000074
Wherein
Figure BDA0003109940120000075
tsStarting time, t, representing a proposal for a behavioreRepresenting the end time of the behavior proposal, k takes 1,2, 3. For each behavior proposal
Figure BDA0003109940120000076
Duration d of the behaviour is calculated firstg=te-tsRecalculating the start time region r of the behaviors=[ts-dg/5,ts+dg/5]End time region r of behaviore=[te-dg/5,te+dg/5]Finally, n start time regions r are obtainedsN end time regions re
PSAnd PEGroup-route generation of (i.e. G)SAnd GEGeneration of (1): for time node T of T time nodesiCalculating the time span r of the time nodei=[ti-df/2,ti+df/2]Wherein d isf=ti-ti-1. Get riAnd n is rsMaximum IOR value of
Figure BDA0003109940120000081
Namely, it is
Figure BDA0003109940120000082
Representing a time node tiGroup-route as a probability of likelihood of a behavior proposal start time; get riAnd n is reMaximum IOR value of
Figure BDA0003109940120000083
Namely, it is
Figure BDA0003109940120000084
Representing a time node tiGround-route as probability of likelihood of the end time of the behavior proposal, where IOR is defined as the overlap ratio, at GSIn the generation of (1), IOR is riAnd rsIs divided by riLength of region of GEIs formed of IOR of riAnd reIs divided by riThe length of the zone.
MCcAnd MCRGroup-route generation of (i.e. G)CGeneration of (1): mCCAnd MCRFor BM confidence maps, both use the same ground-truth tag GCFor GcThe value of the upper point (d, t), i.e. the action proposal
Figure BDA0003109940120000085
Get the ground-route of
Figure BDA0003109940120000086
And
Figure BDA0003109940120000087
as the value of point (d, t).
And (3) carrying out model training on the BMNPlus on the activityNet depth characteristic data set in the step (2) to be convergent on the basis of the BMNPlus model, the loss function and the ground-truth label, and finally obtaining a behavior proposal generation model.
And 4, step 4: sampling one frame every 16 frames of an original video to obtain a sampling video with a low frame rate, sampling one frame every 2 frames of the original video to obtain a sampling video with a high frame rate, and assuming that the original video has f frames, obtaining a low frame rate video frame sequence with f/16 frames after sampling at the low frame rate, and obtaining a high frame rate video frame sequence with f/2 frames after sampling at the high frame rate.
And 5: inputting the low frame rate sampling video obtained in the step 4 into a slow channel in a depth feature extraction model to obtain a slow depth feature sequence of an original video, wherein the dimension is (f/16) × 2048, inputting the high frame rate sampling video obtained in the step 4 into a fast channel in the depth feature extraction model to obtain a fast depth feature sequence of the original video, wherein the dimension is (f/2) × 256, in order to input the fast depth feature sequence into BMNPlus for behavior proposal feature generation, time dimension average sampling is also needed, fast features are mapped to the same time dimension as slow features, and finally the fast feature sequence obtained by the fast channel is (f/16) × 256. And if f/16 is T, the slow depth feature sequence dimension is T multiplied by 2048, and the fast depth feature sequence dimension is T multiplied by 256.
Step 6: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, respectively carrying out different three-layer convolution layer preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, carrying out first characteristic fusion after the two convolution layer preprocessing to obtain a PEM fusion characteristic sequence, and carrying out second characteristic fusion by combining the PEM fusion diagram characteristic sequence after the three convolution layer preprocessing to obtain a TEM fusion characteristic sequence. The whole BM module can be represented as the following process:
assuming that the slow feature sequence and the fast feature sequence input by the BM are respectively denoted as sf1 and ff1, sf1 obtains a depth feature sequence sf2 after passing through two convolution layers of conv1d11 and conv1d12, where conv1d11 uses a one-dimensional convolution kernel size of 3, the number of feature channels is designed to be 256, conv1d12 uses a convolution kernel size of 3, the number of feature channels is designed to be 128, and the structure of sf2 is expressed as follows:
sf2=Fconv1d12(Fconv1d11(sf1))
where F represents the convolutional layer operation and the lower right hand corner of the F symbol represents the particular convolutional layer name.
ff1 is subjected to conv1d21 and conv1d22 two convolution layers to obtain a depth feature sequence ff2, wherein conv1d21 uses a one-dimensional convolution kernel size of 3, the number of feature channels is 256, conv1d22 uses a one-dimensional convolution kernel size of 3, the number of feature channels is 128, and the ff2 structure is represented as follows:
ff2=Fconv1d22(FcOnv1dz1(ff1))
the sum of sf2 and ff2 gives the PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:
pemf=sf2+ff2
sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence Fconv1d13(sf2)、Fconv1d23(ff2) and FcOnv1d33(pemf) wherein conv1d13, conv1d23, conv1d33 all use a one-dimensional convolution kernel size of 1 and the number of eigen-channels are all set to 1. Averaging the three new feature sequences to obtain the final TEM fusion feature sequence, which is denoted as temf, and the structure of temf is represented as follows:
Figure BDA0003109940120000091
after BM pretreatment, a PEM fusion characteristic sequence with T x 128 dimension and a TEM fusion characteristic sequence with T x 1 dimension are obtained.
And 7: for each action proposal, a PFG layer is used to sample 8 points from the start time region of the proposal, 8 points from the end time region of the proposal, 16 points from the duration region of the proposal, for a total of 32 points, to generate a proposal feature for each action proposal.
First, for each behavior proposal
Figure BDA0003109940120000092
Wherein t issIndicating a proposed start time, teRepresenting the proposed end time, from the left time region r by linear interpolations=[ts-dg/k,ts+dg/k]Middle sampling 8 points, time region r from righte=[te-dg/k,te+dg/k]Middle sampling 8 points, from the middle region ra=[ts,te]Middle sampling 16 points, where dg=te-tsAnd k is 5. Then, the 32 sampling points are used as suggestions
Figure BDA0003109940120000093
An offer feature is generated. Hypothesis proposition
Figure BDA0003109940120000094
The generated proposal features that
Figure BDA0003109940120000095
All T behavior proposals are generated with the feature fpThe input characteristic of the PFG layer is finWherein
Figure BDA0003109940120000096
Has a dimension of NxC, fpHas dimensions of T × T × N × C, finThe dimension of (a) is T multiplied by C, and C represents the number of characteristic channels. The specific proposed feature construction process is as follows:
Figure BDA0003109940120000097
where n denotes the nth sample point,
Figure BDA0003109940120000098
presenting offer features
Figure BDA0003109940120000099
The value at the coordinates (n, c),
Figure BDA00031099401200000910
representing input features finAt the coordinate (t)lThe value in c),
Figure BDA00031099401200000911
representing input features finAt the coordinate (t)rThe value in c), wlTo represent
Figure BDA00031099401200000912
Weight of (1), wrRepresentation of
Figure BDA00031099401200000913
Weight of (1), tl、wl、tr、wrThe construction of (a) is represented as follows:
Figure BDA0003109940120000101
Figure BDA0003109940120000102
tr=1+tl
wr=1-wl
wherein, set Nl=Nr=8,Nc=16,N=Nl+Nr+Nc32. Since the start time of a behavioral proposal may not be later than the end time, if a proposal is made
Figure BDA0003109940120000103
T in (1)s≥teThe proposed features of the proposal need to be combined
Figure BDA0003109940120000104
Is set to 0.
And (3) obtaining a TEM proposed characteristic sequence T multiplied by 32 after the TEM fusion characteristic sequence T multiplied by 1 in the step 7 is subjected to PFG layer sampling, and obtaining a PEM proposed characteristic sequence T multiplied by 32 multiplied by 128 after the PEM fusion characteristic sequence T multiplied by 128 is subjected to PFG layer sampling.
And 8: inputting the PEM proposal characteristic sequence of T multiplied by 32 multiplied by 128 into a PEM module, outputting and obtaining a BM confidence map of T multiplied by 2, namely two BM confidence maps of T multiplied by T, which are marked as MCC∈RT×TAnd MCR∈RT×TThe PEM module consists of three convolutional layers, conv3d uses a three-dimensional convolutional kernel size of 1 × 1 × 32, the number of characteristic channels is set to 512, conv2d2 uses a two-dimensional convolutional kernel size of 1 × 1, the number of characteristic channels is set to 256, conv2d3 uses a two-dimensional convolutional kernel size of 1 × 1, and the number of characteristic channels is set to 2; inputting the TEM proposal characteristic sequence of T multiplied by 32 into a TEM module, outputting two boundary possibility sequences of T multiplied by 1 which are boundary possibility sequences of T multiplied by 2, and recording the boundary possibility sequences as
Figure BDA0003109940120000105
And
Figure BDA0003109940120000106
the TEM consists of one feature compression operation and two convolutional layers, the squeeze operation averages the second dimension of the TEM proposed feature sequence T × 32 to compress the feature information, conv1d1 uses a one-dimensional convolution kernel size of 1, the number of feature channels is set to 256, conv1d2 is set to a one-dimensional convolution kernel size of 3, and the number of feature channels is set to 2.
And step 9: four outputs, two boundary likelihood sequences P, generated from a BMNPlus networkSAnd PEAnd two BM confidence maps MCCAnd MCRFusion confidence is generated, and then the Soft-NMS algorithm screens all behavior proposals. Specifically, the method comprises the following steps:
9.1 starting with PSTo select time node tnForm a new set
Figure BDA0003109940120000107
NSRepresenting the number of time nodes finally selected, i being taken from 1 to NSWherein t isnMust satisfy
Figure BDA0003109940120000108
max denotes the max operation, k is taken from 1 to T,
Figure BDA0003109940120000109
representing a time node tnAs a probability of likelihood of the start time of the action proposal,
Figure BDA00031099401200001010
represents the probability of the k time node being the probability of the behavior proposal starting time, similarly from PETo select time node tmForm a new set
Figure BDA00031099401200001011
NERepresenting the number of time nodes finally selected, j being taken from 1 to NEIn which it must satisfy
Figure BDA00031099401200001012
Figure BDA00031099401200001013
Representing a time node tmAs a probability of possibility of the action proposal end time,
Figure BDA0003109940120000111
representing the probability of the k-th time node being the probability of a behavior proposal ending time.
9.2 from BSIn which a time node t is selectedsAs a start time, from BEIn which a time node t is selectedeAs the end time, a behavior proposal is constructed and recorded as
Figure BDA0003109940120000112
Wherein t issAnd teNeeds to satisfy ts<te
Figure BDA0003109940120000113
Represents tsProbability of likelihood of a time node as a behavior proposal start time,
Figure BDA0003109940120000114
represents teProbability of a time node being the end time of a behavior proposal, pccIs shown at MCCConfidence map coordinates (t)e-ts,ts) The value of (a), pcrIs shown at MCRConfidence map coordinates (t)e-ts,ts) The value of (A) to finally obtain NpAnd (5) candidate proposals.
9.3 for each candidate proposal
Figure BDA0003109940120000115
Performing confidence fusion to obtain fusion confidence pfThe fusion mode is as follows:
Figure BDA0003109940120000116
each candidate offer after fusion can be represented as
Figure BDA0003109940120000117
9.4 screening candidate proposals by using Soft-NMS algorithm to generate final action proposal set
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.

Claims (6)

1. A method of behavior proposal generation for video behavior detection, the method comprising the steps of:
step 1: constructing and designing a SlowFast neural network, designing the SlowFast neural network into a slow channel and a fast channel, wherein each channel adopts 3DResnet-50 as a main network, and training the SlowFast network on a Kinetics-600 data set until convergence to obtain a SlowFast depth feature extraction model;
step 2: performing feature extraction on an activityNet data set by using the SlowFast depth feature extraction model trained in the step 1 to obtain an activityNet depth feature data set;
and step 3: constructing and designing a BMNPlus neural network and a specific loss function, and training the BMNPlus network on the activityNet depth characteristic data set in the step 2 until convergence is achieved to obtain a behavior proposal generation model;
and 4, step 4: sampling an original uncut video by using two different frame rates to respectively obtain a low frame rate sampling video and a high frame rate sampling video;
and 5: inputting the low frame rate sampling video in the step 4 into the slow channel in the step 1 to obtain a slow depth characteristic sequence, and inputting the high frame rate sampling video in the step 4 into the fast channel in the step 1 to obtain a fast depth characteristic sequence;
step 6: respectively preprocessing the slow depth characteristic sequence and the fast depth characteristic sequence in the step 5 by using different three convolution layers, fusing the slow depth characteristic sequence and the fast depth characteristic sequence after the second convolution layer to obtain a PEM fusion characteristic sequence, and fusing the PEM fusion characteristic sequence for the second time after the third convolution layer to obtain a TEM fusion characteristic sequence;
and 7: designing a PFG layer to respectively sample a TEM fusion characteristic sequence and a PEM fusion characteristic sequence, respectively sampling 8 points in a starting time region and an ending time region, sampling 16 points in a duration region, and respectively generating a TEM proposal characteristic sequence and a PEM proposal characteristic sequence;
and 8: inputting the TEM proposed feature sequence in the step 7 into a TEM, outputting to obtain a boundary possibility sequence, inputting the PEM proposed feature sequence in the step 7 into a PEM, and outputting to obtain a boundary matching confidence map;
and step 9: and (4) combining the boundary possibility sequence and the boundary matching confidence map in the step 8 to generate a fusion confidence for each behavior proposal, and screening the candidate behavior proposals by using a Soft-NMS algorithm to generate a final behavior proposal.
2. The behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 3 is as follows: constructing a BMNPlus network for behavior proposal generation, dividing the whole BMNPlus network into BM, TEM and PEM three modules, designing a Loss function, training and adjusting parameters of the network on an activityNet depth characteristic data set to obtain a converged model, wherein the Loss function Loss is designed as follows:
Loss=LTEM1·LPEM2·L2(θ)
wherein L isTEMRepresenting the loss of TEM-generated boundary likelihood sequences to constrain the likelihood of each time node as a proposed start or end time point, LPEMRepresents the loss of the PEM-generated boundary match confidence map, the confidence score, L, used to constrain each behavior proposal2(θ) represents the L2 regularization term, prevention modelOverfitting, λ1Is set to 1, lambda2Is set to 0.0001, LTEMThe constitution of (a) is as follows:
LTEM=Ls(PS,GS)+Ls(PE,GE)
wherein P isSPrediction value representing a starting possibility sequence, GSGroup-truth, P, representing a start probability sequenceEPredicted value representing ending probability sequence, GEGroup-truth, L indicating a sequence of end possibilitiessThe constitution of (a) is as follows:
Figure FDA0003109940110000021
where T represents the number of time nodes of the video, bi=sign(gi- γ) is a binary function of giFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5, assuming n is n+=∑bi,n-=T-n+Then, then
Figure FDA0003109940110000022
LPEMThe constitution of (a) is as follows:
LPEM=Ls(MCC,GC)+λLR(MCR,GC)
wherein M isCCAnd MCRTwo BM confidence maps for PEM prediction, GCIs MCCAnd MCRGroup-route, LRλ is set to 10 for the L2 regression loss function.
3. The behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 6 is as follows: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, and respectively performing different preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, wherein the preprocessing process comprises three convolution layers, performing characteristic fusion after a second convolution layer to obtain a PEM fusion characteristic sequence, and performing second characteristic fusion after a third convolution layer to obtain a TEM fusion characteristic sequence, and the whole BM module can be expressed as the following process:
assuming that the slow signature sequence and fast signature sequence inputted by BM are respectively recorded as sf1 and ff1, then sf1 obtains a depth signature sequence sf2 after passing through conv1d11 and convld12 convolution layers, and the structure of sf2 is expressed as follows:
sf2=Fconv1d12(Fconv1d11(sf1))
wherein F represents the convolutional layer operation, and the lower right corner of the F symbol represents the convolutional layer name;
ff1 after passing through conv1d21 and conv1d22 two convolutional layers, a depth characteristic sequence ff2 is obtained, and the construction of ff2 is expressed as follows:
ff2=Fconv1d22(Fconv1d21(ff1))
sf2 and ff2 are summed by sum to obtain a PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:
pemf=sf2+ff2
sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence Fconv1d13(sf2)、Fconv1d23(ff2) and Fconv1d33(pemf), averaging the three new signature sequences to obtain the final TEM fusion signature sequence, which is denoted as temf, and the structure of temf is represented as follows:
Figure FDA0003109940110000023
4. the behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 7 is as follows: for each behavior proposal, a PFG layer sampling method is designed, 8 points are sampled from the start time region of the proposal, 8 points are sampled from the end time region of the proposal, 16 points are sampled from the duration region of the proposal, and 32 points are sampled in total for each behavior proposalObtaining a proposed feature sequence, obtaining a TEM proposed feature sequence after the TEM fusion feature sequence is subjected to PFG layer sampling, and obtaining a PEM proposed feature sequence after the PEM fusion feature sequence is subjected to PFG layer sampling; the sampling process of the PFG layer is as follows: first, for each behavior proposal
Figure FDA0003109940110000031
Wherein t issIndicating a proposed start time, teRepresenting the proposed end time, from the left time region r by linear interpolations=[ts-dg/k,ts+dg/k]Middle sampling 8 points, time region r from righte=[te-dg/k,te+dg/k]Middle sampling 8 points, from the middle region ra=[ts,te]Middle sampling 16 points, where dg=te-tsK is 5; then, the behavior proposal using these 32 sampling points
Figure FDA0003109940110000032
Generating an offer feature, assuming an offer
Figure FDA0003109940110000033
The generated proposal features that
Figure FDA0003109940110000034
All T behavior proposals are generated with the feature fpThe input characteristic of the PFG layer is finWherein
Figure FDA0003109940110000035
Has a dimension of NxC, fpHas dimensions of T × T × N × C, finThe dimension of (1) is T multiplied by C, C represents the number of feature channels, and the specific proposed feature construction process is as follows:
Figure FDA0003109940110000036
where n denotes the nth sample point,
Figure FDA0003109940110000037
presenting offer features
Figure FDA0003109940110000038
The value at the coordinates (n, c),
Figure FDA0003109940110000039
representing input features finAt the coordinate (t)lThe value in c),
Figure FDA00031099401100000310
representing input features finAt the coordinate (t)rThe value in c), wlTo represent
Figure FDA00031099401100000311
Weight of (1), wrRepresentation of
Figure FDA00031099401100000312
Weight of (1), tl、wl、tr、wrThe construction of (a) is represented as follows:
Figure FDA00031099401100000313
Figure FDA00031099401100000314
tr=1+tl
wr=1-wl
wherein, set Nl=Nr=8,Nc=16,N=Nl+Nr+Nc32, since the start time of a behavior proposal cannot be later than the end time,thus, if proposed
Figure FDA00031099401100000315
T in (1)s≥teThe proposed features of the proposal need to be combined
Figure FDA00031099401100000316
Is set to 0.
5. The method for generating behavior proposal aiming at video behavior detection as claimed in claim 1, characterized in that the specific process of step 8 is as follows: the PEM module takes the PEM proposal characteristic sequence as input, and obtains a T multiplied by 2 BM confidence map, namely two T multiplied by T BM confidence maps which are marked as M after passing through the PEM moduleCC∈RT×TAnd MCR∈RT×TThe TEM module takes the TEM proposed feature sequence as input, and obtains a Tx 2 boundary possibility sequence, namely two Tx 1 boundary possibility sequences after passing through the TEM module, and the two boundary possibility sequences are recorded as
Figure FDA0003109940110000041
And
Figure FDA0003109940110000042
wherein
Figure FDA0003109940110000043
Representing the probability of the likelihood of the ith time node being the proposed start time of the behavior,
Figure FDA0003109940110000044
representing the probability of the likelihood of the ith time node being the behavior proposal end time.
6. The method for generating behavior proposal aiming at video behavior detection according to claim 1, characterized in that the specific process of step 9 is as follows: four outputs, P, generated from the BMNPlus networkS、PE、MCCAnd MCRProceed to confidenceAnd (3) fusing, and then screening all candidate proposals by using a Soft-NMS algorithm to generate a final behavior proposal, specifically:
9.1 starting with PSTo select time node tnForm a new set
Figure FDA0003109940110000045
NSRepresenting the number of time nodes finally selected, i being taken from 1 to NSWherein t isnMust satisfy
Figure FDA0003109940110000046
max denotes the max operation, k is taken from 1 to T,
Figure FDA0003109940110000047
representing a time node tnAs a probability of likelihood of the start time of the action proposal,
Figure FDA0003109940110000048
represents the probability of the k time node being the probability of the behavior proposal starting time, similarly from PETo select time node tmForm a new set
Figure FDA0003109940110000049
NERepresenting the number of time nodes finally selected, j being taken from 1 to NEIn which it must satisfy
Figure FDA00031099401100000410
Figure FDA00031099401100000411
Representing a time node tmAs a probability of possibility of the action proposal end time,
Figure FDA00031099401100000412
representing the probability of the k-th time node being the probability of a behavior proposal ending time.
9.2 from BSIn which a time node t is selectedsAs a start time, from BEIn which a time node t is selectedeAs the end time, a behavior proposal is constructed and recorded as
Figure FDA00031099401100000413
Wherein t issAnd teNeeds to satisfy ts<te
Figure FDA00031099401100000414
Represents tsProbability of likelihood of a time node as a behavior proposal start time,
Figure FDA00031099401100000415
represents teProbability of a time node being the end time of a behavior proposal, pccIs shown at MCCConfidence map coordinates (t)e-ts,ts) The value of (a), pcrIs shown at MCRConfidence map coordinates (t)e-ts,ts) The value of (A) to finally obtain NpAnd (5) candidate proposals.
9.3 for each candidate proposal
Figure FDA00031099401100000416
Performing confidence fusion to obtain fusion confidence pfThe fusion mode is as follows:
Figure FDA00031099401100000417
each candidate offer after fusion can be represented as
Figure FDA00031099401100000418
9.4 screening the candidate proposals by using the Soft-NMS algorithm to generate a final action proposal set.
CN202110647905.8A 2021-06-10 2021-06-10 Behavior proposal generation method for video behavior detection Active CN113298017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110647905.8A CN113298017B (en) 2021-06-10 2021-06-10 Behavior proposal generation method for video behavior detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110647905.8A CN113298017B (en) 2021-06-10 2021-06-10 Behavior proposal generation method for video behavior detection

Publications (2)

Publication Number Publication Date
CN113298017A true CN113298017A (en) 2021-08-24
CN113298017B CN113298017B (en) 2024-04-23

Family

ID=77327868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110647905.8A Active CN113298017B (en) 2021-06-10 2021-06-10 Behavior proposal generation method for video behavior detection

Country Status (1)

Country Link
CN (1) CN113298017B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627410A (en) * 2021-10-14 2021-11-09 江苏奥斯汀光电科技股份有限公司 Method for recognizing and retrieving action semantics in video

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627410A (en) * 2021-10-14 2021-11-09 江苏奥斯汀光电科技股份有限公司 Method for recognizing and retrieving action semantics in video
CN113627410B (en) * 2021-10-14 2022-03-18 江苏奥斯汀光电科技股份有限公司 Method for recognizing and retrieving action semantics in video

Also Published As

Publication number Publication date
CN113298017B (en) 2024-04-23

Similar Documents

Publication Publication Date Title
CN110781838B (en) Multi-mode track prediction method for pedestrians in complex scene
CN112001339B (en) Pedestrian social distance real-time monitoring method based on YOLO v4
CN110516536B (en) Weak supervision video behavior detection method based on time sequence class activation graph complementation
CN108230291B (en) Object recognition system training method, object recognition method, device and electronic equipment
CN111931602B (en) Attention mechanism-based multi-flow segmented network human body action recognition method and system
CN109740419A (en) A kind of video behavior recognition methods based on Attention-LSTM network
CN110188654B (en) Video behavior identification method based on mobile uncut network
CN112446342B (en) Key frame recognition model training method, recognition method and device
CN109977895B (en) Wild animal video target detection method based on multi-feature map fusion
CN111539290B (en) Video motion recognition method and device, electronic equipment and storage medium
CN110110648B (en) Action nomination method based on visual perception and artificial intelligence
CN112801063B (en) Neural network system and image crowd counting method based on neural network system
CN112200096B (en) Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video
Suratkar et al. Employing transfer-learning based CNN architectures to enhance the generalizability of deepfake detection
Lin et al. Joint learning of local and global context for temporal action proposal generation
CN111914731A (en) Multi-mode LSTM video motion prediction method based on self-attention mechanism
CN116168329A (en) Video motion detection method, equipment and medium based on key frame screening pixel block
US20230154139A1 (en) Systems and methods for contrastive pretraining with video tracking supervision
CN115293986A (en) Multi-temporal remote sensing image cloud region reconstruction method
CN113298017A (en) Behavior proposal generation method for video behavior detection
CN113569758A (en) Time sequence action positioning method, system, equipment and medium based on action triple guidance
CN116189281B (en) End-to-end human behavior classification method and system based on space-time self-adaptive fusion
CN112528077A (en) Video face retrieval method and system based on video embedding
CN114120076B (en) Cross-view video gait recognition method based on gait motion estimation
CN111209886A (en) Rapid pedestrian re-identification method based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant