CN113298017A - Behavior proposal generation method for video behavior detection - Google Patents
Behavior proposal generation method for video behavior detection Download PDFInfo
- Publication number
- CN113298017A CN113298017A CN202110647905.8A CN202110647905A CN113298017A CN 113298017 A CN113298017 A CN 113298017A CN 202110647905 A CN202110647905 A CN 202110647905A CN 113298017 A CN113298017 A CN 113298017A
- Authority
- CN
- China
- Prior art keywords
- proposal
- behavior
- sequence
- time
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000001514 detection method Methods 0.000 title claims abstract description 19
- 238000005070 sampling Methods 0.000 claims abstract description 57
- 230000004927 fusion Effects 0.000 claims abstract description 54
- 230000008569 process Effects 0.000 claims abstract description 29
- 238000007781 pre-processing Methods 0.000 claims abstract description 18
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 238000012216 screening Methods 0.000 claims abstract description 8
- 230000009471 action Effects 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 239000004576 sand Substances 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000002265 prevention Effects 0.000 claims 1
- 230000006399 behavior Effects 0.000 abstract description 74
- 238000004364 calculation method Methods 0.000 description 8
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 5
- 230000004913 activation Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 101100194362 Schizosaccharomyces pombe (strain 972 / ATCC 24843) res1 gene Proteins 0.000 description 1
- 101100194363 Schizosaccharomyces pombe (strain 972 / ATCC 24843) res2 gene Proteins 0.000 description 1
- 206010044565 Tremor Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a behavior proposal generation method aiming at video behavior detection. In the feature extraction stage, spatial information and time information of a video are respectively extracted by using a slow channel and a fast channel; in the behavior proposal generation stage, different preprocessing processes are firstly used for the extracted spatial information and time information, and the extracted spatial information and time information are fused in two different stages, then PFG layer is used for sampling each behavior proposal, proposal characteristics are generated and are respectively transmitted to TEM and PEM for predicting a boundary possibility sequence and a boundary matching confidence map, and finally, confidence fusion is carried out on the prediction results to generate candidate behavior proposals, and a Soft-NMS algorithm is used for screening. The method can generate behavior suggestions for the uncut video under the condition that the original video is uncut, segment video segments containing behaviors in the video, and position the starting time and the ending time of the trip.
Description
Technical Field
The invention relates to a behavior proposal generation method, in particular to a behavior proposal generation method aiming at video behavior detection, belonging to the field of image processing and computer vision.
Background
With the development of the information age, short video APPs such as tremble and fast hands are more and more popular with people, which generates a large amount of video data, and meanwhile, the demand for video behavior detection is more and more intense. In video monitoring, whether abnormal conditions such as violence, fighting and the like exist in a video or not is judged through behavior detection, so that possible dangerous behaviors are detected in real time, and danger prompt information is sent to supervision personnel; during automatic driving, the object in the image is subjected to behavior detection through image information captured by the vehicle, and the next motion trail of the object is predicted, so that a safe and reliable driving route is worked out, pedestrians are avoided, and the safety of automatic driving is improved; in the event commentary, real-time robot commentary becomes infinitely possible by performing behavior detection on athletes in the game, such as thirds in a basketball game, caps, snatching and the like. In video behavior detection, behavior proposal generation is the most critical technology, and through the behavior proposal generation, segments of a video where behaviors are likely to occur are positioned, noise segments of an original uncut video are removed, and the video segments are divided into video segments only containing the behaviors.
The currently mainstream behavior proposal generation method is divided into two processes: firstly, feature extraction is carried out on an original uncut video, a method of a double-current convolution neural network is usually adopted in the process, but the double-current network needs to calculate optical flow information between continuous video frames as input, and needs to occupy a large amount of calculation time and optical flow information storage cost, so that the efficiency is very low; second, the proposed generation of extracted depth features is not yet mature.
The existing behavior proposal generation method mainly faces the following difficulties:
1. time-sequence of video: video needs to focus on timing information more than if the image contains only spatial information.
2. Complexity of the calculation: the video is formed by stacking a series of frame images, and most of the current algorithms need to perform complex optical flow calculation or process the time dimension of the video through a three-dimensional convolution kernel. The optical flow calculation is a complex process, which needs a large amount of calculation time, and the three-dimensional convolution kernel increases a time dimension, which greatly increases the number of parameters of the network, and thus puts higher requirements on computer hardware.
3. Proposal generation: at present, research related to proposal generation is too little, most methods are evolved from target detection algorithms of images, the obtained effects are not satisfactory, and reasonable network prediction output and a proposal generation method as accurate as possible need to be designed.
Disclosure of Invention
The invention solves the problems and difficulties, and provides a behavior proposal generation method aiming at video behavior detection, which can generate behavior proposals for an original video, remove noise segments of the video, segment video segments containing behaviors in the video and position the starting time and the ending time of the behavior in the video aiming at the video behavior detection problem under the condition that the original video is not cut.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a method of behavior proposal generation for video behavior detection, the method comprising the steps of:
step 1: constructing and designing a SlowFast neural network, designing the SlowFast neural network into a slow channel and a fast channel, wherein each channel adopts 3DResnet-50 as a main network, and training the SlowFast network on a Kinetics-600 data set until convergence to obtain a SlowFast depth feature extraction model;
step 2: performing feature extraction on an activityNet data set by using the SlowFast depth feature extraction model trained in the step 1 to obtain an activityNet depth feature data set;
and step 3: constructing and designing a BMNPlus neural network and a specific loss function, and training the BMNPlus network on the activityNet depth characteristic data set in the step 2 until convergence is achieved to obtain a behavior proposal generation model;
and 4, step 4: sampling an original uncut video by using two different frame rates to respectively obtain a low frame rate sampling video and a high frame rate sampling video;
and 5: inputting the low frame rate sampling video in the step 4 into the slow channel in the step 1 to obtain a slow depth characteristic sequence, and inputting the high frame rate sampling video in the step 4 into the fast channel in the step 1 to obtain a fast depth characteristic sequence;
step 6: respectively preprocessing the slow depth characteristic sequence and the fast depth characteristic sequence in the step 5 by using different three convolution layers, fusing the slow depth characteristic sequence and the fast depth characteristic sequence after the second convolution layer to obtain a PEM fusion characteristic sequence, and fusing the PEM fusion characteristic sequence for the second time after the third convolution layer to obtain a TEM fusion characteristic sequence;
and 7: designing a PFG layer to respectively sample a TEM fusion characteristic sequence and a PEM fusion characteristic sequence, respectively sampling 8 points in a starting time region and an ending time region, sampling 16 points in a duration region, and respectively generating a TEM proposal characteristic sequence and a PEM proposal characteristic sequence;
and 8: inputting the TEM proposed feature sequence in the step 7 into a TEM, outputting to obtain a boundary possibility sequence, inputting the PEM proposed feature sequence in the step 7 into a PEM, and outputting to obtain a boundary matching confidence map;
and step 9: and (4) combining the boundary possibility sequence and the boundary matching confidence map in the step 8 to generate a fusion confidence for each behavior proposal, and screening the candidate behavior proposals by using a Soft-NMS algorithm to generate a final behavior proposal.
Further, the specific process of step 3 is as follows: constructing a BMNPlus network for behavior proposal generation, dividing the whole BMNPlus network into BM, TEM and PEM three modules, designing a Loss function, training and adjusting parameters of the network on an activityNet depth characteristic data set to obtain a converged model, wherein the Loss function Loss is designed as follows:
Loss=LTEM+λ1·LPEM+λ2·L2(θ)
wherein L isTEMRepresenting the loss of TEM-generated boundary likelihood sequences to constrain the likelihood of each time node as a proposed start or end time point, LPEMRepresenting PEM generated boundary match confidenceLoss of graph, confidence score, L, to constrain each behavior proposal2(θ) represents the L2 regularization term, preventing model overfitting, λ1Is set to 1, lambda2Is set to 0.0001, LTEMThe constitution of (a) is as follows:
LTEM=Ls(PS,GS)+Ls(PE,GE)
wherein P isSPrediction value representing a starting possibility sequence, GSGroup-truth, P, representing a start probability sequenceEPredicted value representing ending probability sequence, GEGroup-truth, L indicating a sequence of end possibilitiessThe constitution of (a) is as follows:
where T represents the number of time nodes of the video, bi=sign(gi- γ) is a binary function of giFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5, assuming n is n+=∑bi,n-=T-n+Then, then
LPEMThe constitution of (a) is as follows:
LPEM=Ls(MCC,GC)+λLR(MCR,CC)
wherein M isCCAnd MCRTwo BM confidence maps for PEM prediction, GCIs MCCAnd MCRGroup-route, LRλ is set to 10 for the L2 regression loss function.
Further, the specific process of step 6 is as follows: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, and respectively performing different preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, wherein the preprocessing process comprises three convolution layers, performing characteristic fusion after a second convolution layer to obtain a PEM fusion characteristic sequence, and performing second characteristic fusion after a third convolution layer to obtain a TEM fusion characteristic sequence, and the whole BM module can be expressed as the following process:
assuming that the slow signature sequence and fast signature sequence inputted by BM are respectively recorded as sf1 and ff1, then sf1 obtains a depth signature sequence sf2 after passing through conv1d11 and conv1d12 convolution layers, and the structure of sf2 is shown as follows:
sf2=Fconv1d12(Fconv1d11(sf1))
wherein F represents the convolutional layer operation, and the lower right corner of the F symbol represents the convolutional layer name;
ff1 after passing through conv1d21 and conv1d22 two convolutional layers, a depth characteristic sequence ff2 is obtained, and the construction of ff2 is expressed as follows:
ff2=Fconv1d22(Fconv1d21(ff1))
sf2 and ff2 are summed by sum to obtain a PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:
pemf=sf2+ff2
sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence Fconv1d13(sf2)、Fconv1d23(ff2) and Fconv1d33(pemf), averaging the three new signature sequences to obtain the final TEM fusion signature sequence, which is denoted as temf, and the structure of temf is represented as follows:
further, the specific process of step 7 is as follows: for each behavior proposal, designing a PFG layer sampling method, sampling 8 points from a proposed start time region, sampling 8 points from a proposed end time region, sampling 16 points from a proposed duration region, sampling 32 points altogether, generating a proposed feature sequence for each behavior proposal, obtaining a TEM proposed feature sequence after PFG layer sampling of a TEM fusion feature sequence, and obtaining a PEM proposed feature sequence after PFG layer sampling of a PEM fusion feature sequence;the sampling process of the PFG layer is as follows: first, for each behavior proposalWherein t issIndicating a proposed start time, teRepresenting the proposed end time, from the left time region r by linear interpolations=[ts-dg/k,ts+dg/k]Middle sampling 8 points, time region r from righte=[te-dg/k,te+dg/k]Middle sampling 8 points, from the middle region ra=[ts,te]Middle sampling 16 points, where dg=te-tsK is 5; then, the behavior proposal using these 32 sampling pointsGenerating an offer feature, assuming an offerThe generated proposal features thatAll T behavior proposals are generated with the feature fpThe input characteristic of the PFG layer is finWhereinHas a dimension of NxC, fpHas dimensions of T × T × N × C, finThe dimension of (1) is T multiplied by C, C represents the number of feature channels, and the specific proposed feature construction process is as follows:
where n denotes the nth sample point,presenting offer featuresThe value at the coordinates (n, c),representing input features finAt the coordinate (t)lThe value in c),representing input features finAt the coordinate (t)rThe value in c), wlTo representWeight of (1), wrRepresentation ofWeight of (1), tl、wl、tr、wrThe construction of (a) is represented as follows:
tr=1+tl
wr=1-wl
wherein, set Nl=Nr=8,Nc=16,N=Nl+Nr+NcSince the start time of a behavior proposal cannot be later than the end time, 32, if a proposal is madeT in (1)s≥teThe proposed features of the proposal need to be combinedIs set to 0.
Further, the specific process of step 8 is as follows: the PEM module takes the PEM proposal characteristic sequence as input, and obtains a T multiplied by 2 BM confidence map, namely two T multiplied by T BM confidence maps which are marked as M after passing through the PEM moduleCC∈RT×TAnd MCR∈RT×TThe TEM module takes the TEM proposed feature sequence as input, and obtains a Tx 2 boundary possibility sequence, namely two Tx 1 boundary possibility sequences after passing through the TEM module, and the two boundary possibility sequences are recorded asAndwhereinRepresenting the probability of the likelihood of the ith time node being the proposed start time of the behavior,representing the probability of the likelihood of the ith time node being the behavior proposal end time.
Further, the specific process of step 9 is as follows: four outputs, P, generated from the BMNPlus networkS、PE、MCCAnd MCRPerforming confidence fusion, and then screening all candidate proposals by using a Soft-NMS algorithm to generate a final behavior proposal, specifically:
9.1 starting with PSTo select time node tnForm a new setNSRepresenting the number of time nodes finally selected, i being taken from 1 to NSWherein t isnMust satisfymax denotes the max operation, k is taken from 1 to T,representing a time node tnAs a probability of likelihood of the start time of the action proposal,represents the probability of the k time node being the probability of the behavior proposal starting time, similarly from PETo select time node tmForm a new setNERepresenting the number of time nodes finally selected, j being taken from 1 to NEIn which it must satisfy Representing a time node tmAs a probability of possibility of the action proposal end time,representing the probability of the k-th time node being the probability of a behavior proposal ending time.
9.2 from BSIn which a time node t is selectedsAs a start time, from BEIn which a time node t is selectedeAs the end time, a behavior proposal is constructed and recorded asWherein t issAnd teNeeds to satisfy ts<te,Represents tsProbability of likelihood of a time node as a behavior proposal start time,represents teTime node as behavior proposalProbability of likelihood of ending time, pccIs shown at MCCConfidence map coordinates (t)e-ts,ts) The value of (a), pcrIs shown at MCRConfidence map coordinates (t)e-ts,ts) The value of (A) to finally obtain NpAnd (5) candidate proposals.
9.3 for each candidate proposalPerforming confidence fusion to obtain fusion confidence pfThe fusion mode is as follows:
9.4 screening the candidate proposals by using the Soft-NMS algorithm to generate a final action proposal set.
The method comprises the steps of extracting slow features and fast features of the video, predicting and generating a boundary possibility sequence and a boundary matching confidence map, and finally generating a behavior proposal. After an original uncut video is input, slow and fast characteristics of the video are extracted through a SlowFast model and respectively represent spatial information and time information of the video, and then the characteristics are input into BMNPlus to predict and obtain two boundary possibility sequences PSAnd PEAnd two boundary match confidence maps MCCAnd MCRAnd finally generating a final behavior proposal according to the prediction result.
The invention has the following advantages: 1) the depth feature extraction network SlowFast designed by the invention is divided into two channels, namely a slow channel and a fast channel, wherein the two channels adopt original video frames as input without calculating extra optical flow information, so that a large amount of calculation time and storage cost are saved, and the efficiency is higher; 2) the method uses different preprocessing processes aiming at the extracted slow characteristic and fast characteristic, and performs fusion in different stages to obtain a more reasonable preprocessing characteristic sequence; 3) aiming at the generation of the proposed features, the invention designs a more accurate sampling mode and a proposed feature calculation mode PFG layer, fully utilizes the start time region, the end time region and the duration time region of the behavior proposal, improves the effect of generating the behavior proposal and ensures that the video is more accurately segmented.
Drawings
FIG. 1 is an overall flow chart of the present invention.
Fig. 2 is a flow chart of the depth feature extraction network Slowfast of the present invention.
Fig. 3 is a detailed view of the 3d rescnet-50 network of the present invention.
Fig. 4 is a flow chart of the proposed generation network BMNPlus of the present invention.
Fig. 5 is a detailed diagram of the BM preprocessing module of the present invention.
Detailed Description
The invention is explained in detail below with reference to the drawings, and the specific steps are as follows.
Example 1:
step 1: and constructing and designing a Slowfast neural network, designing the Slowfast neural network into a slow channel and a fast channel, and using 3D-Resnet50 as a backbone network for the two channels. The 3D-Resnet50 comprises 1 convolutional layer conv3D1, 1 pooling layer pool, 3 res1 residual blocks, 4 res2 residual blocks, 6 res3 residual blocks, 3 res4 residual blocks and 16 residual blocks in total, wherein each residual block is formed by stacking 3 three-dimensional convolutional layers by using a bottleeck design mode, and the total number of the network is 50. In the calculation of each convolution layer, a batch normalization operation is used, and each residual block calculates an activation mapping value in a pre-activation mode. The whole 3D-Resnet50 network is input as a video frame sequence 224 × 224 × T, and output as an extracted video feature sequence T × C, where T represents the number of input video frames, i.e., the time dimension, and C represents the number of feature channels finally extracted for each frame. For a slow channel, sampling an original video by using a low frame rate for an input video frame sequence, sampling one frame every 16 frames, namely if the original video has f frames, sampling to obtain f/16 frames as input, and setting the number C of characteristic channels to be 2048; for the fast channel, the input video frame sequence samples the original video with a high frame rate, one frame is sampled every 2 frames, that is, if the original video has f frames, f/2 frames are obtained by sampling as input, and the number C of the feature channels is set to 256. And carrying out convergence training on the constructed Slowfast network on a Kinetics-600 data set, wherein a Cross Engine Loss is adopted as a Loss function, and finally a depth feature extraction model is obtained.
Step 2: inputting each video sample of the activityNet data set into a depth feature extraction model, and extracting to obtain depth feature data corresponding to each sample. The resolution of the original video samples is first scaled to 224 x 224. Then, sampling one frame of the original video every 16 frames, inputting the sampled video frame sequence into a slow channel, and extracting to obtain a slow depth characteristic sequence; sampling one frame of original video every 2 frames, inputting the sampled video frame sequence into a fast channel, and extracting to obtain a fast depth feature sequence. Therefore, for each video sample in the ActivityNet dataset, after passing through the depth feature extraction model, a slow depth feature sequence and a fast depth feature sequence are respectively obtained, and finally, all samples and the corresponding feature sequences thereof form the depth feature dataset of ActivityNet.
And step 3: the loss functions for designing the BMNPlus proposed generating network and training were constructed. The whole BMNPlus network is divided into a BM preprocessing module, a TEM module and a PEM module, wherein the BM is used for preprocessing depth features, the TEM is used for generating a boundary possibility sequence, and the PEM is used for generating a boundary matching confidence map. The Loss function Loss for BMNPlus is designed to consist of three parts: l isTEMLoss of boundary likelihood sequence, LPEMLoss of boundary matching confidence maps, L2 regularization term. The composition of the Loss of Loss function is expressed as follows:
Loss=LTEM+λ1·LPEM+λ2·L2(θ)
wherein L isTEMRepresenting the loss of TEM-generated boundary likelihood sequences, constraining the likelihood of proposed start and end times, LPEMTo representLoss of the BM confidence map generated by the PEM, constrains the probability of each behavior proposal being correct, L2(θ) represents the L2 regularization term to prevent overfitting. Wherein λ1Is set to 1, lambda2Set to 0.0001. L isTEMThe constitution of (a) is as follows:
LTEM=Ls(Ps,Gs)+Ls(PE,GE)
wherein P isSPrediction value representing a starting possibility sequence, GSGroup-truth, P, representing a start probability sequenceEPredicted value representing ending probability sequence, GEGroup-truth, L indicating a sequence of end possibilitiessThe constitution of (a) is as follows:
where T represents the time dimension of the video, i.e. the number of frames, bi=sign(gi- γ) is a binary function of giFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5. Suppose n is+=∑bi,n-=T-n+Then, thenLPEMThe constitution of (a) is as follows:
LpEM=Ls(Mcc,Cc)+λLR(MCR,CC)
wherein M isCCAnd MCRTwo BM confidence maps for PEM prediction, GCIs MCCAnd MCRGroup-route, LRλ is set to 10 for the L2 regression loss function.
In order to realize the supervised training of the model, the original ground-route label of the ActivinyNet data set needs to be converted into G needed by the BMNPlus networkC、GS、GEThree group-route tags. Assume the original ground-truth label of each video sample of the ActivityNet dataset is:is marked asWhereintsStarting time, t, representing a proposal for a behavioreRepresenting the end time of the behavior proposal, k takes 1,2, 3. For each behavior proposalDuration d of the behaviour is calculated firstg=te-tsRecalculating the start time region r of the behaviors=[ts-dg/5,ts+dg/5]End time region r of behaviore=[te-dg/5,te+dg/5]Finally, n start time regions r are obtainedsN end time regions re。
PSAnd PEGroup-route generation of (i.e. G)SAnd GEGeneration of (1): for time node T of T time nodesiCalculating the time span r of the time nodei=[ti-df/2,ti+df/2]Wherein d isf=ti-ti-1. Get riAnd n is rsMaximum IOR value ofNamely, it isRepresenting a time node tiGroup-route as a probability of likelihood of a behavior proposal start time; get riAnd n is reMaximum IOR value ofNamely, it isRepresenting a time node tiGround-route as probability of likelihood of the end time of the behavior proposal, where IOR is defined as the overlap ratio, at GSIn the generation of (1), IOR is riAnd rsIs divided by riLength of region of GEIs formed of IOR of riAnd reIs divided by riThe length of the zone.
MCcAnd MCRGroup-route generation of (i.e. G)CGeneration of (1): mCCAnd MCRFor BM confidence maps, both use the same ground-truth tag GCFor GcThe value of the upper point (d, t), i.e. the action proposalGet the ground-route ofAndas the value of point (d, t).
And (3) carrying out model training on the BMNPlus on the activityNet depth characteristic data set in the step (2) to be convergent on the basis of the BMNPlus model, the loss function and the ground-truth label, and finally obtaining a behavior proposal generation model.
And 4, step 4: sampling one frame every 16 frames of an original video to obtain a sampling video with a low frame rate, sampling one frame every 2 frames of the original video to obtain a sampling video with a high frame rate, and assuming that the original video has f frames, obtaining a low frame rate video frame sequence with f/16 frames after sampling at the low frame rate, and obtaining a high frame rate video frame sequence with f/2 frames after sampling at the high frame rate.
And 5: inputting the low frame rate sampling video obtained in the step 4 into a slow channel in a depth feature extraction model to obtain a slow depth feature sequence of an original video, wherein the dimension is (f/16) × 2048, inputting the high frame rate sampling video obtained in the step 4 into a fast channel in the depth feature extraction model to obtain a fast depth feature sequence of the original video, wherein the dimension is (f/2) × 256, in order to input the fast depth feature sequence into BMNPlus for behavior proposal feature generation, time dimension average sampling is also needed, fast features are mapped to the same time dimension as slow features, and finally the fast feature sequence obtained by the fast channel is (f/16) × 256. And if f/16 is T, the slow depth feature sequence dimension is T multiplied by 2048, and the fast depth feature sequence dimension is T multiplied by 256.
Step 6: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, respectively carrying out different three-layer convolution layer preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, carrying out first characteristic fusion after the two convolution layer preprocessing to obtain a PEM fusion characteristic sequence, and carrying out second characteristic fusion by combining the PEM fusion diagram characteristic sequence after the three convolution layer preprocessing to obtain a TEM fusion characteristic sequence. The whole BM module can be represented as the following process:
assuming that the slow feature sequence and the fast feature sequence input by the BM are respectively denoted as sf1 and ff1, sf1 obtains a depth feature sequence sf2 after passing through two convolution layers of conv1d11 and conv1d12, where conv1d11 uses a one-dimensional convolution kernel size of 3, the number of feature channels is designed to be 256, conv1d12 uses a convolution kernel size of 3, the number of feature channels is designed to be 128, and the structure of sf2 is expressed as follows:
sf2=Fconv1d12(Fconv1d11(sf1))
where F represents the convolutional layer operation and the lower right hand corner of the F symbol represents the particular convolutional layer name.
ff1 is subjected to conv1d21 and conv1d22 two convolution layers to obtain a depth feature sequence ff2, wherein conv1d21 uses a one-dimensional convolution kernel size of 3, the number of feature channels is 256, conv1d22 uses a one-dimensional convolution kernel size of 3, the number of feature channels is 128, and the ff2 structure is represented as follows:
ff2=Fconv1d22(FcOnv1dz1(ff1))
the sum of sf2 and ff2 gives the PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:
pemf=sf2+ff2
sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence Fconv1d13(sf2)、Fconv1d23(ff2) and FcOnv1d33(pemf) wherein conv1d13, conv1d23, conv1d33 all use a one-dimensional convolution kernel size of 1 and the number of eigen-channels are all set to 1. Averaging the three new feature sequences to obtain the final TEM fusion feature sequence, which is denoted as temf, and the structure of temf is represented as follows:
after BM pretreatment, a PEM fusion characteristic sequence with T x 128 dimension and a TEM fusion characteristic sequence with T x 1 dimension are obtained.
And 7: for each action proposal, a PFG layer is used to sample 8 points from the start time region of the proposal, 8 points from the end time region of the proposal, 16 points from the duration region of the proposal, for a total of 32 points, to generate a proposal feature for each action proposal.
First, for each behavior proposalWherein t issIndicating a proposed start time, teRepresenting the proposed end time, from the left time region r by linear interpolations=[ts-dg/k,ts+dg/k]Middle sampling 8 points, time region r from righte=[te-dg/k,te+dg/k]Middle sampling 8 points, from the middle region ra=[ts,te]Middle sampling 16 points, where dg=te-tsAnd k is 5. Then, the 32 sampling points are used as suggestionsAn offer feature is generated. Hypothesis propositionThe generated proposal features thatAll T behavior proposals are generated with the feature fpThe input characteristic of the PFG layer is finWhereinHas a dimension of NxC, fpHas dimensions of T × T × N × C, finThe dimension of (a) is T multiplied by C, and C represents the number of characteristic channels. The specific proposed feature construction process is as follows:
where n denotes the nth sample point,presenting offer featuresThe value at the coordinates (n, c),representing input features finAt the coordinate (t)lThe value in c),representing input features finAt the coordinate (t)rThe value in c), wlTo representWeight of (1), wrRepresentation ofWeight of (1), tl、wl、tr、wrThe construction of (a) is represented as follows:
tr=1+tl
wr=1-wl
wherein, set Nl=Nr=8,Nc=16,N=Nl+Nr+Nc32. Since the start time of a behavioral proposal may not be later than the end time, if a proposal is madeT in (1)s≥teThe proposed features of the proposal need to be combinedIs set to 0.
And (3) obtaining a TEM proposed characteristic sequence T multiplied by 32 after the TEM fusion characteristic sequence T multiplied by 1 in the step 7 is subjected to PFG layer sampling, and obtaining a PEM proposed characteristic sequence T multiplied by 32 multiplied by 128 after the PEM fusion characteristic sequence T multiplied by 128 is subjected to PFG layer sampling.
And 8: inputting the PEM proposal characteristic sequence of T multiplied by 32 multiplied by 128 into a PEM module, outputting and obtaining a BM confidence map of T multiplied by 2, namely two BM confidence maps of T multiplied by T, which are marked as MCC∈RT×TAnd MCR∈RT×TThe PEM module consists of three convolutional layers, conv3d uses a three-dimensional convolutional kernel size of 1 × 1 × 32, the number of characteristic channels is set to 512, conv2d2 uses a two-dimensional convolutional kernel size of 1 × 1, the number of characteristic channels is set to 256, conv2d3 uses a two-dimensional convolutional kernel size of 1 × 1, and the number of characteristic channels is set to 2; inputting the TEM proposal characteristic sequence of T multiplied by 32 into a TEM module, outputting two boundary possibility sequences of T multiplied by 1 which are boundary possibility sequences of T multiplied by 2, and recording the boundary possibility sequences asAndthe TEM consists of one feature compression operation and two convolutional layers, the squeeze operation averages the second dimension of the TEM proposed feature sequence T × 32 to compress the feature information, conv1d1 uses a one-dimensional convolution kernel size of 1, the number of feature channels is set to 256, conv1d2 is set to a one-dimensional convolution kernel size of 3, and the number of feature channels is set to 2.
And step 9: four outputs, two boundary likelihood sequences P, generated from a BMNPlus networkSAnd PEAnd two BM confidence maps MCCAnd MCRFusion confidence is generated, and then the Soft-NMS algorithm screens all behavior proposals. Specifically, the method comprises the following steps:
9.1 starting with PSTo select time node tnForm a new setNSRepresenting the number of time nodes finally selected, i being taken from 1 to NSWherein t isnMust satisfymax denotes the max operation, k is taken from 1 to T,representing a time node tnAs a probability of likelihood of the start time of the action proposal,represents the probability of the k time node being the probability of the behavior proposal starting time, similarly from PETo select time node tmForm a new setNERepresenting the number of time nodes finally selected, j being taken from 1 to NEIn which it must satisfy Representing a time node tmAs a probability of possibility of the action proposal end time,representing the probability of the k-th time node being the probability of a behavior proposal ending time.
9.2 from BSIn which a time node t is selectedsAs a start time, from BEIn which a time node t is selectedeAs the end time, a behavior proposal is constructed and recorded asWherein t issAnd teNeeds to satisfy ts<te,Represents tsProbability of likelihood of a time node as a behavior proposal start time,represents teProbability of a time node being the end time of a behavior proposal, pccIs shown at MCCConfidence map coordinates (t)e-ts,ts) The value of (a), pcrIs shown at MCRConfidence map coordinates (t)e-ts,ts) The value of (A) to finally obtain NpAnd (5) candidate proposals.
9.3 for each candidate proposalPerforming confidence fusion to obtain fusion confidence pfThe fusion mode is as follows:
9.4 screening candidate proposals by using Soft-NMS algorithm to generate final action proposal set
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.
Claims (6)
1. A method of behavior proposal generation for video behavior detection, the method comprising the steps of:
step 1: constructing and designing a SlowFast neural network, designing the SlowFast neural network into a slow channel and a fast channel, wherein each channel adopts 3DResnet-50 as a main network, and training the SlowFast network on a Kinetics-600 data set until convergence to obtain a SlowFast depth feature extraction model;
step 2: performing feature extraction on an activityNet data set by using the SlowFast depth feature extraction model trained in the step 1 to obtain an activityNet depth feature data set;
and step 3: constructing and designing a BMNPlus neural network and a specific loss function, and training the BMNPlus network on the activityNet depth characteristic data set in the step 2 until convergence is achieved to obtain a behavior proposal generation model;
and 4, step 4: sampling an original uncut video by using two different frame rates to respectively obtain a low frame rate sampling video and a high frame rate sampling video;
and 5: inputting the low frame rate sampling video in the step 4 into the slow channel in the step 1 to obtain a slow depth characteristic sequence, and inputting the high frame rate sampling video in the step 4 into the fast channel in the step 1 to obtain a fast depth characteristic sequence;
step 6: respectively preprocessing the slow depth characteristic sequence and the fast depth characteristic sequence in the step 5 by using different three convolution layers, fusing the slow depth characteristic sequence and the fast depth characteristic sequence after the second convolution layer to obtain a PEM fusion characteristic sequence, and fusing the PEM fusion characteristic sequence for the second time after the third convolution layer to obtain a TEM fusion characteristic sequence;
and 7: designing a PFG layer to respectively sample a TEM fusion characteristic sequence and a PEM fusion characteristic sequence, respectively sampling 8 points in a starting time region and an ending time region, sampling 16 points in a duration region, and respectively generating a TEM proposal characteristic sequence and a PEM proposal characteristic sequence;
and 8: inputting the TEM proposed feature sequence in the step 7 into a TEM, outputting to obtain a boundary possibility sequence, inputting the PEM proposed feature sequence in the step 7 into a PEM, and outputting to obtain a boundary matching confidence map;
and step 9: and (4) combining the boundary possibility sequence and the boundary matching confidence map in the step 8 to generate a fusion confidence for each behavior proposal, and screening the candidate behavior proposals by using a Soft-NMS algorithm to generate a final behavior proposal.
2. The behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 3 is as follows: constructing a BMNPlus network for behavior proposal generation, dividing the whole BMNPlus network into BM, TEM and PEM three modules, designing a Loss function, training and adjusting parameters of the network on an activityNet depth characteristic data set to obtain a converged model, wherein the Loss function Loss is designed as follows:
Loss=LTEM+λ1·LPEM+λ2·L2(θ)
wherein L isTEMRepresenting the loss of TEM-generated boundary likelihood sequences to constrain the likelihood of each time node as a proposed start or end time point, LPEMRepresents the loss of the PEM-generated boundary match confidence map, the confidence score, L, used to constrain each behavior proposal2(θ) represents the L2 regularization term, prevention modelOverfitting, λ1Is set to 1, lambda2Is set to 0.0001, LTEMThe constitution of (a) is as follows:
LTEM=Ls(PS,GS)+Ls(PE,GE)
wherein P isSPrediction value representing a starting possibility sequence, GSGroup-truth, P, representing a start probability sequenceEPredicted value representing ending probability sequence, GEGroup-truth, L indicating a sequence of end possibilitiessThe constitution of (a) is as follows:
where T represents the number of time nodes of the video, bi=sign(gi- γ) is a binary function of giFrom [0,1 ]]Convert to {0,1}, and set γ to 0.5, assuming n is n+=∑bi,n-=T-n+Then, then
LPEMThe constitution of (a) is as follows:
LPEM=Ls(MCC,GC)+λLR(MCR,GC)
wherein M isCCAnd MCRTwo BM confidence maps for PEM prediction, GCIs MCCAnd MCRGroup-route, LRλ is set to 10 for the L2 regression loss function.
3. The behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 6 is as follows: inputting the slow depth characteristic sequence and the fast depth characteristic sequence into a BM preprocessing module, and respectively performing different preprocessing processes on the slow depth characteristic sequence and the fast depth characteristic sequence by the BM, wherein the preprocessing process comprises three convolution layers, performing characteristic fusion after a second convolution layer to obtain a PEM fusion characteristic sequence, and performing second characteristic fusion after a third convolution layer to obtain a TEM fusion characteristic sequence, and the whole BM module can be expressed as the following process:
assuming that the slow signature sequence and fast signature sequence inputted by BM are respectively recorded as sf1 and ff1, then sf1 obtains a depth signature sequence sf2 after passing through conv1d11 and convld12 convolution layers, and the structure of sf2 is expressed as follows:
sf2=Fconv1d12(Fconv1d11(sf1))
wherein F represents the convolutional layer operation, and the lower right corner of the F symbol represents the convolutional layer name;
ff1 after passing through conv1d21 and conv1d22 two convolutional layers, a depth characteristic sequence ff2 is obtained, and the construction of ff2 is expressed as follows:
ff2=Fconv1d22(Fconv1d21(ff1))
sf2 and ff2 are summed by sum to obtain a PEM fusion signature sequence denoted pemf, and the construction of pemf is shown below:
pemf=sf2+ff2
sf2, ff2 and pemf pass through conv1d13, conv1d23 and conv1d33 convolution layers respectively to obtain a new characteristic sequence Fconv1d13(sf2)、Fconv1d23(ff2) and Fconv1d33(pemf), averaging the three new signature sequences to obtain the final TEM fusion signature sequence, which is denoted as temf, and the structure of temf is represented as follows:
4. the behavior proposal generation method for video behavior detection according to claim 1, characterized in that the specific process of step 7 is as follows: for each behavior proposal, a PFG layer sampling method is designed, 8 points are sampled from the start time region of the proposal, 8 points are sampled from the end time region of the proposal, 16 points are sampled from the duration region of the proposal, and 32 points are sampled in total for each behavior proposalObtaining a proposed feature sequence, obtaining a TEM proposed feature sequence after the TEM fusion feature sequence is subjected to PFG layer sampling, and obtaining a PEM proposed feature sequence after the PEM fusion feature sequence is subjected to PFG layer sampling; the sampling process of the PFG layer is as follows: first, for each behavior proposalWherein t issIndicating a proposed start time, teRepresenting the proposed end time, from the left time region r by linear interpolations=[ts-dg/k,ts+dg/k]Middle sampling 8 points, time region r from righte=[te-dg/k,te+dg/k]Middle sampling 8 points, from the middle region ra=[ts,te]Middle sampling 16 points, where dg=te-tsK is 5; then, the behavior proposal using these 32 sampling pointsGenerating an offer feature, assuming an offerThe generated proposal features thatAll T behavior proposals are generated with the feature fpThe input characteristic of the PFG layer is finWhereinHas a dimension of NxC, fpHas dimensions of T × T × N × C, finThe dimension of (1) is T multiplied by C, C represents the number of feature channels, and the specific proposed feature construction process is as follows:
where n denotes the nth sample point,presenting offer featuresThe value at the coordinates (n, c),representing input features finAt the coordinate (t)lThe value in c),representing input features finAt the coordinate (t)rThe value in c), wlTo representWeight of (1), wrRepresentation ofWeight of (1), tl、wl、tr、wrThe construction of (a) is represented as follows:
tr=1+tl
wr=1-wl
5. The method for generating behavior proposal aiming at video behavior detection as claimed in claim 1, characterized in that the specific process of step 8 is as follows: the PEM module takes the PEM proposal characteristic sequence as input, and obtains a T multiplied by 2 BM confidence map, namely two T multiplied by T BM confidence maps which are marked as M after passing through the PEM moduleCC∈RT×TAnd MCR∈RT×TThe TEM module takes the TEM proposed feature sequence as input, and obtains a Tx 2 boundary possibility sequence, namely two Tx 1 boundary possibility sequences after passing through the TEM module, and the two boundary possibility sequences are recorded asAndwhereinRepresenting the probability of the likelihood of the ith time node being the proposed start time of the behavior,representing the probability of the likelihood of the ith time node being the behavior proposal end time.
6. The method for generating behavior proposal aiming at video behavior detection according to claim 1, characterized in that the specific process of step 9 is as follows: four outputs, P, generated from the BMNPlus networkS、PE、MCCAnd MCRProceed to confidenceAnd (3) fusing, and then screening all candidate proposals by using a Soft-NMS algorithm to generate a final behavior proposal, specifically:
9.1 starting with PSTo select time node tnForm a new setNSRepresenting the number of time nodes finally selected, i being taken from 1 to NSWherein t isnMust satisfymax denotes the max operation, k is taken from 1 to T,representing a time node tnAs a probability of likelihood of the start time of the action proposal,represents the probability of the k time node being the probability of the behavior proposal starting time, similarly from PETo select time node tmForm a new setNERepresenting the number of time nodes finally selected, j being taken from 1 to NEIn which it must satisfy Representing a time node tmAs a probability of possibility of the action proposal end time,representing the probability of the k-th time node being the probability of a behavior proposal ending time.
9.2 from BSIn which a time node t is selectedsAs a start time, from BEIn which a time node t is selectedeAs the end time, a behavior proposal is constructed and recorded asWherein t issAnd teNeeds to satisfy ts<te,Represents tsProbability of likelihood of a time node as a behavior proposal start time,represents teProbability of a time node being the end time of a behavior proposal, pccIs shown at MCCConfidence map coordinates (t)e-ts,ts) The value of (a), pcrIs shown at MCRConfidence map coordinates (t)e-ts,ts) The value of (A) to finally obtain NpAnd (5) candidate proposals.
9.3 for each candidate proposalPerforming confidence fusion to obtain fusion confidence pfThe fusion mode is as follows:
9.4 screening the candidate proposals by using the Soft-NMS algorithm to generate a final action proposal set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110647905.8A CN113298017B (en) | 2021-06-10 | 2021-06-10 | Behavior proposal generation method for video behavior detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110647905.8A CN113298017B (en) | 2021-06-10 | 2021-06-10 | Behavior proposal generation method for video behavior detection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113298017A true CN113298017A (en) | 2021-08-24 |
CN113298017B CN113298017B (en) | 2024-04-23 |
Family
ID=77327868
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110647905.8A Active CN113298017B (en) | 2021-06-10 | 2021-06-10 | Behavior proposal generation method for video behavior detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113298017B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113627410A (en) * | 2021-10-14 | 2021-11-09 | 江苏奥斯汀光电科技股份有限公司 | Method for recognizing and retrieving action semantics in video |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784269A (en) * | 2019-01-11 | 2019-05-21 | 中国石油大学(华东) | One kind is based on the united human action detection of space-time and localization method |
CN109919122A (en) * | 2019-03-18 | 2019-06-21 | 中国石油大学(华东) | A kind of timing behavioral value method based on 3D human body key point |
-
2021
- 2021-06-10 CN CN202110647905.8A patent/CN113298017B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784269A (en) * | 2019-01-11 | 2019-05-21 | 中国石油大学(华东) | One kind is based on the united human action detection of space-time and localization method |
CN109919122A (en) * | 2019-03-18 | 2019-06-21 | 中国石油大学(华东) | A kind of timing behavioral value method based on 3D human body key point |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113627410A (en) * | 2021-10-14 | 2021-11-09 | 江苏奥斯汀光电科技股份有限公司 | Method for recognizing and retrieving action semantics in video |
CN113627410B (en) * | 2021-10-14 | 2022-03-18 | 江苏奥斯汀光电科技股份有限公司 | Method for recognizing and retrieving action semantics in video |
Also Published As
Publication number | Publication date |
---|---|
CN113298017B (en) | 2024-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110781838B (en) | Multi-mode track prediction method for pedestrians in complex scene | |
CN112001339B (en) | Pedestrian social distance real-time monitoring method based on YOLO v4 | |
CN110516536B (en) | Weak supervision video behavior detection method based on time sequence class activation graph complementation | |
CN108230291B (en) | Object recognition system training method, object recognition method, device and electronic equipment | |
CN111931602B (en) | Attention mechanism-based multi-flow segmented network human body action recognition method and system | |
CN109740419A (en) | A kind of video behavior recognition methods based on Attention-LSTM network | |
CN110188654B (en) | Video behavior identification method based on mobile uncut network | |
CN112446342B (en) | Key frame recognition model training method, recognition method and device | |
CN109977895B (en) | Wild animal video target detection method based on multi-feature map fusion | |
CN111539290B (en) | Video motion recognition method and device, electronic equipment and storage medium | |
CN110110648B (en) | Action nomination method based on visual perception and artificial intelligence | |
CN112801063B (en) | Neural network system and image crowd counting method based on neural network system | |
CN112200096B (en) | Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video | |
Suratkar et al. | Employing transfer-learning based CNN architectures to enhance the generalizability of deepfake detection | |
Lin et al. | Joint learning of local and global context for temporal action proposal generation | |
CN111914731A (en) | Multi-mode LSTM video motion prediction method based on self-attention mechanism | |
CN116168329A (en) | Video motion detection method, equipment and medium based on key frame screening pixel block | |
US20230154139A1 (en) | Systems and methods for contrastive pretraining with video tracking supervision | |
CN115293986A (en) | Multi-temporal remote sensing image cloud region reconstruction method | |
CN113298017A (en) | Behavior proposal generation method for video behavior detection | |
CN113569758A (en) | Time sequence action positioning method, system, equipment and medium based on action triple guidance | |
CN116189281B (en) | End-to-end human behavior classification method and system based on space-time self-adaptive fusion | |
CN112528077A (en) | Video face retrieval method and system based on video embedding | |
CN114120076B (en) | Cross-view video gait recognition method based on gait motion estimation | |
CN111209886A (en) | Rapid pedestrian re-identification method based on deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |