CN109409257A - A kind of video timing motion detection method based on Weakly supervised study - Google Patents
A kind of video timing motion detection method based on Weakly supervised study Download PDFInfo
- Publication number
- CN109409257A CN109409257A CN201811181395.4A CN201811181395A CN109409257A CN 109409257 A CN109409257 A CN 109409257A CN 201811181395 A CN201811181395 A CN 201811181395A CN 109409257 A CN109409257 A CN 109409257A
- Authority
- CN
- China
- Prior art keywords
- video
- classifier
- segment
- motion detection
- detection method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The present invention relates to digital image processing techniques field, specially a kind of video timing motion detection method based on Weakly supervised study.This method comprises the concrete steps that, step 1: video input classifier, respectively obtaining different detection confidence levels;Step 2: score of the fusion video in different classifications device;Step 3: condition random field accurate adjustment result;The step of detection-phase is that step 4: the classifier that video input to be detected is trained obtains different detection confidence levels;Step 5: passing through the different detection confidence level of FC-CRF optimization fusion.This method can combine the output of the priori knowledge of the mankind and neural network, the experimental results showed that FC-CRF improves the detection performance of 20.8%[email protected] on ActivityNet.
Description
Technical field
The present invention relates to digital image processing techniques field, specially a kind of video timing based on weak prison study acts inspection
Survey method.
Background technique
In the past few years, the inspiration of the immense success by deep learning in terms of the analysis task based on image, perhaps
There are multi-model deep learning framework, especially convolutional neural networks (CNN) or recurrent neural network (RNN) to be introduced in base
In the motion analysis of video.Karpathy et al. carries out action recognition using deep learning in video first, and at design
Manage the various deep learning models of single frames or series of frames.Tran et al. constructs a C3D model, and the model is in space-time
3D convolution is executed in video body and integrates appearance and movement prompt preferably to indicate.Wang et al. proposes time slice network
(TSN), it inherits the advantages of double-current feature extraction structure, and longer video clipping is coped with using sparse sampling scheme.
Qiu et al. proposes puppet 3D (P3D) remaining network to recycle the ready-made 2D network of 3D CNN.In addition to processing action recognition it
Outside, it can solve movement detection there are also some other work or candidate region generate problem.Shou et al. is detected using multistage CNN
Network carries out time operating position fixing.Escorcia et al. proposes DAPs model, which uses RNN encoded video sequence, and
Retrieval action is suggested during single.Lin et al., which is skipped, generates step using the candidate region of single step motion detector (SSAD)
Suddenly.Shou et al. designs convolution-deconvolution (CDC) network to determine accurate timing limits.
In the past few years, behavioural analysis understands that field causes many concerns in video.According to manual character representation
Or deep learning model architecture, many researchs have been carried out to this problem.A large amount of work on hands are handled in a manner of supervising by force
Action analysis task, wherein the training data of the action example without background by manual annotations or trims.In recent years, some strong prisons
The method of superintending and directing achieves satisfactory result.However, mark acts example nowadays in more and more large-scale sets of video data
Precise time position be time-consuming and time-consuming.In addition, as indicated, the exact time of movement different from object boundary
The definition of range is usually subjectivity, and inconsistent between different observers, this may cause additional deviation and mistake.
It is reasonably to select using Weakly supervised method to overcome these limitations of timing motion detection.The prior art is logical
Accurate time-labeling or the video cut out building deep learning model are crossed, and model of the invention directlys adopt unpruned view
Frequency evidence gives training, and only needs video rank class label.
Summary of the invention
It is an object of the invention to a kind of video timing motion detection method based on Weakly supervised study.It is dynamic to solve timing
It detects, the time location of example is acted in model prediction of the invention action classification and video.Appoint in Weakly supervised study
In business, only videl stage tag along sort is provided as supervisory signals, and in the training process, includes the movement mixed with background
The video clipping of example will not be modified.
In order to achieve the object of the present invention, following technical solution is specifically taken:
A kind of video timing motion detection method based on Weakly supervised study, specific step is as follows for training:
Step 1: video input classifier, respectively obtaining different detection confidence levels;
Step 2: score of the fusion video in different classifications device;
Step 3: condition random field accurate adjustment result.
Above-mentioned steps 1 carry out in the following order:
A) video is divided into the isoplith not being overlapped, using segment as unit extraction feature.
B) classifier provides corresponding detection confidence level to different action classifications respectively according to the feature of these segments.
The step 2 carries out in the following order:
C video clips) are given, by preliminary classification device, corresponding category score is obtained and (is detailed in step 1);
D) according to score, video clips partial content is wiped, new video segment is obtained.Concrete operations are as follows: according to piece of video
Disconnected category score, calculates the class probability of its classification, then according to probability height, at random corresponding video clip, removes training
Collection.
E) all videos of training set are traversed once, such as above-mentioned removal partial video segment, obtains new training set.
The step 3 carries out in the following order:
F) the training classifier on the video of new training set;
G) training convergence judgement when being judged as NO, repeats step second step and third step, when being judged as YES, obtains a system
Arrange trained classifier.
In the training process, the segment that there is exceptionally high degree of trust behavior to occur gradually is deleted.By doing so, to obtain one
Series has the classifier of respective preference, is used for different types of movement segment.
In service stage, the segment for making example is driven according to the classifier selection trained repeatedly, and pass through full connection strap
Part random field (FC-CRF) optimization fusion result.The step of detection-phase, is as follows:
Step 4: the classifier that video input to be detected is trained obtains different detection confidence levels;
Step 5: passing through the different detection confidence level of FC-CRF optimization fusion;
Above-mentioned steps 4 carry out in the following order:
I) video to be detected is divided into the isoplith not being overlapped, using segment as unit extraction feature.
II) trained classifier provides corresponding inspection to different action classifications respectively according to the feature of these segments
Survey confidence level.
Above-mentioned steps 5 carry out in the following order:
III) according to video clips category score, the class probability of its classification is calculated.
IV full condition of contact random field FC-CRF) is used, in the form of probability graph, receives class probability and inputs, and according to
The time shaft position of video clip, optimization fusion is as a result, export final detection probability.
Due to taking above-mentioned technological means, the invention has the advantages that and good effect:
1. the invention proposes a Weakly supervised models to detect the time movement that do not trim in video.The model by pair
Video carries out gradually erasing to obtain a series of classifiers.In test phase, by collecting the detection knot from classifier one by one
Fruit is come to apply model of the invention be convenient.
2. according to known to the present invention, this is first by full condition of contact random field [22] (fully connected
Conditional ramdom filed, FC-CRF) introduce time motion detection task work, it is used for the elder generation of the mankind
The output for testing knowledge and neural network combines.The experimental results showed that FC-CRF improves 20.8% on ActivityNet
The detection performance of [email protected].
3. the present invention has carried out extensive experiment to two challenging sets of video data of not trimming, i.e.,
ActivityNet [11] and THUMOS'14 [20];Prove the detection effect of the method for the present invention in Average Accuracy (mean
Average precision, mAP) it is more than other all Weakly supervised timing motion detection methods, or even be comparable to certain strong
Measure of supervision.
In order to illustrate more clearly of conception and technical scheme of the invention, with reference to the accompanying drawing, pass through specific embodiment pair
The present invention is described further.
Detailed description of the invention
Fig. 1 is the flow chart of video timing motion detection method of the present invention;
Fig. 2 is training flow chart of the invention.
Specific embodiment
Fig. 1 is the flow chart of video timing motion detection method of the present invention, and as shown in Figure 1, one kind being based on Weakly supervised study
Video timing motion detection method, include the following steps: 1, each classifier S1 of video input, respectively obtain different inspections
Survey confidence level;2, score S2 of the fusion video in different classifications device;3, condition random field accurate adjustment result S3.
Fig. 2 is training flow chart of the invention, as shown in Fig. 2, training flow chart includes the following steps: that video clips pass through
Preliminary classification device obtains corresponding category score 11;According to score, video clips partial content is wiped, new video segment 12 is obtained;
The training classifier 13 on new video;Training convergence judgement, is judged as NO 14, repeats step 12 and 13, be judged as YES under entrance
One step 15;Obtain a series of trained classifiers 15.
Specific step is as follows for the model training process of the method for the present invention:
Given videoComprising N number of editing, the wherein other class label of K videl stageIt gives
The fixed classifier specified by parameter θ, the present invention can obtain classification fractional φ (V;θ)∈RNXC, wherein C is the number of all categories
Amount.In t-th of erasing step, the rest segment of training video is expressed as V by the present inventiont, and classifier is expressed as θt.It is right
In the i-th row φ (Vt;θt) φI:, corresponding i-th of editing of original classification score, the present invention calculates probability in j-th of segment
The standardized classification p of softmaxI, j(Vt):
In addition, the present invention defines weight factor αI, j:
Wherein δτIt is defined as follows:
Wherein τ is decay factor, is a hyper parameter.Probability of erasure sI, jIt is as follows:
sI, j(Vt)=αI, j(Vt)pI, j(Vt)
Obtain t wheel probability of erasure sI, j(Vt) after, the present invention completes training process as follows:
Step 2: the use of model.
By a series of obtained classifier calculated pI, jWith αI, j, obtain its average valueWithThe present invention establishes one entirely
Condition of contact random field, energy function are as follows:
Wherein, label independent variable liWith ljByIt is specified, indicate the corresponding class label of i-th, j segment.Hereafter, it uses
Mean field approximation optimizationAnd ask α p result can each segment monitoring confidence level.According to the full condition of contact random field,
It calculates and maximizes posterior probability, the final score of every section of video can be obtained.
Method of the invention is tested on the present invention is in ActivityNet and THUMOS ' 14, it is as a result as follows.
In below table, the index compared is the friendship of different time axis and the average precision than under, i.e. mAP (mean
Average Precision), it measures in the video being retrieved in the friendship of different time axis and than ratio accurate under threshold value.It should
Index is the bigger the better.
Strong supervised learning refers to that the markup information of training sample includes video classification information and timing information.
Weakly supervised study refers to that the markup information of training sample only includes video classification information.
Single phase.Cascade, single classification, more classification refer to the distinct methods that respective document proposes, propose to other bibliography
Other methods will not enumerate.
Table 1 is that different time axis is handed over and than the average precision under threshold value in ActivityNet data set,
2 mAP@tIoU on THUMOS ' 14 of table --- THUMOS14 data set different time axis friendship is simultaneously average than under threshold value
Precision ratio.
Wherein: Strong/Weak Supervision: strong supervision/Weakly supervised study, each method in table first row
The method provided for corresponding document and author.
According to other embodiments of the invention, for the technical solution:
1. classifier can be based on any neural network, can also be itself and traditional characteristic.
2. full condition of contact random field can be replaced with the condition random field of any kind.
Bibliography, abbreviation document, interior square brackets are literature number, such as: [53] are document 53, and [59] are document 59,
[1]A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,and L.Fei-
Fei.2014.Large-scale video classification with convolutional neural
networks.In CVPR.1725–1732.
[2]P.Bojanowski,R.Lajugie,F.R.Bach,I.Laptev,J.Ponce,C.Schmid,and
J.Sivic.2014.Weakly supervised action labeling in videos under ordering
constraints.In ECCV.628–643.
[3]P.Bojanowski,R.Lajugie,E.Grave,F.Bach,I.Laptev,J.Ponce,and
C.Schmid.2015. Weakly-supervised alignment of video with text.In ICCV.4462–
4470.
[4]A.Pinz C.Feichtenhofer and A.Zisserman.2016.Convolutional two-
stream network fusion for video action recognition.In CVPR.1933–1941.
[5]Joao Carreira and Andrew Zisserman.2017.Quo Vadis,Action
Recognition A New Model and the Kinetics Dataset.In IEEE Conference on
Computer Vision and Pattern Recognition.4724–4733.
[6]Xiyang Dai,Bharat Singh,Guyue Zhang,Larry S.Davis,and Yan Qiu
Chen.2017.Temporal Context Network for Activity Localization in Videos.In
IEEE International Conference on Computer Vision.5727–5736.
[7]Oneata Dan,Jakob Verbeek,and Cordelia Schmid.2014.The LEAR
submission at Thumos 2014. Computer Vision and Pattern Recognition[cs.CV]
(2014).
[8]J.Donahue,L.Anne Hendricks,S.Guadarrama,M.Rohrbach,S.Venugopalan,
K.Saenko,and T. Darrell.2015.Long-term recurrent convolutional networks for
visual recognition and description.In CVPR. 2625–2634.
[9]V.Escorcia,F.C.Heilbron,J.C.Niebles,and B.Ghanem.2016.Daps:Deep
action proposals for action understanding.In In European Conference on
Computer Vision.768–784.
[10]Victor Escorcia,Fabian Caba Heilbron,Juan Carlos Niebles,and
Bernard Ghanem.2016.DAPs:Deep Action Proposals for Action Understanding.In
European Conference on Computer Vision.768–784.
[11]B.Ghanem F.Caba Heilbron,V.Escorcia and J.Carlos
Niebles.2015.Activitynet:A large-scale video benchmark for human activity
understanding.In In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition.961–970.
[12]C.Gan,C.Sun,L.Duan,and B.Gong.2016.Webly-supervised video
recognition by mutually voting for relevant web images and web video
frames.In ECCV.849–866.
[13]Jiyang Gao,Zhenheng Yang,Chen Sun,Kan Chen,and Ram
Nevatia.2017.TURN TAP:Temporal Unit Regression Network for Temporal Action
Proposals.arXiv:1703.06189(2017).
[14]A.Richard H.Kuehne and J.Gall.2016.Weakly supervised learning of
actions from transcripts.CoRR, abs/1610.02237(2016).
[15]Fabian Caba Heilbron,Wayner Barrios,Victor Escorcia,and Bernard
Ghanem.2017.SCC:Semantic Context Cascade for Efficient Action Detection.In
IEEE Conference on Computer Vision and Pattern Recognition.
[16]Fabian Caba Heilbron,Juan Carlos Niebles,and Bernard
Ghanem.2016.Fast Tem-poral Activity Proposals for Efficient Detection of
Human Actions in Untrimmed Videos.In Computer Vision and Pattern
Recognition.1914–1923.
[17]D.Huang,L.Fei-Fei,and J.C.Niebles.2016.Connectionist temporal
modeling for weakly supervised action labeling.In ECCV.137–153.
[18]Dinesh Jayaraman and Kristen Grauman.2016.Slow and Steady Feature
Analysis:Higher Order Temporal Coherence in Video.In Computer Vision and
Pattern Recognition.3852–3861.
[19]Yangqing Jia,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan
Long,Ross Girshick,Sergio Guadarrama,and Trevor Darrell.2014.Caffe:
Convolutional Architecture for Fast Feature Embedding.arXiv preprint arXiv:
1408.5093(2014).
[20]Y.-G.Jiang,J.Liu,A.Roshan Zamir,G.Toderici,I.Laptev,M.Shah,and
R.Suk-thankar.2014. THUMOS challenge:Action recognition with a large number
of classes.http://crcv.ucf.edu/THUMOS14/(2014).
[21]Svebor Karaman,Lorenzo Seidenari,and Alberto Del Bimbo.[n.d.]
.Fast saliency based pooling of Fisher encoded dense trajectories.([n.d.]).
[22]P. and V.Koltun.2011.Efficient inference in fully
connected crfs with gaussian edge potentials.In NIPS.109–117.
[23]Y.Qiao L.Wang and X.Tang.2016.MoFAP:A multi-level representation
for action recognition.IJCV 119,3(2016),254–271.
[24]Ivan Laptev and Tony Lindeberg.2003.Space-time interest points.In
9th Interna-tional Conference on Computer Vision.432–439.
[25]I.Laptev,M.Marszalek,C.Schmid,and B.Rozenfeld.2008.Learning
realistic human actions from movies.In CVPR.1–8.
[26]Colin Lea,Michael D.Flynn,Rene Vidal,Austin Reiter,and Gregory
D.Hager.2017.Temporal Convolutional Networks for Action Segmentation and
Detection.In IEEE Conference on Computer Vision and Pattern Recognition.1003–
1012.
[27]Tianwei Lin,Xu Zhao,and Zheng Shou.2017.Single Shot Temporal
Action Detection.In ACM on Multimedia Conference.
[28]L.Wang,Y.Xiong,D.Lin,and L.V.Gool.2017.UntrimmedNets for Weakly
Super-vised Action Recognition and Detection.arXiv:1703.03329(2017).
[29]L.Wang,Y.Xiong,Z.Wang,Y.Qiao,D.Lin,X.Tang,and L.Van
Gool.2016.Temporal segment networks: Towards good practices for deep action
recognition.In ECCV.20–36.
[30]Cordelia Schmid Marcin Marszalek,Ivan Laptev.2009.Actions in
context.In CVPR.2929–2936.
[31]Hossein Mobahi,Ronan Collobert,and Jason Weston.2009.Deep
learning from temporal coherence in video..In International Conference on
Machine Learning,ICML 2009,Montreal,Quebec,Canada,June.93.
[32]Li Nannan,Xu Dan,Ying Zhenqiang,Li Zhihao,and Li
Ge.2016.Searching Action Propsoals via Spatial Actionness estimation and
Temporal Path Inference and Tracking.In Asian Conference on Computer
Vision.384–399.
[33]J.Sivic F.R.Bach O.Duchenne,I.Laptev and J.Ponce.2009.Automatic
annotation of human actions in video.In ICCV.1491–1498.
[34]Zhaofan Qiu,Ting Yao,and Tao Mei.2017.Learning Spatio-Temporal
Represen-tation with Pseudo-3D Residual Networks.In ICCV.
[35]Alexander Richard and Juergen Gall.2016.Temporal Action Detection
Using a Statistical Language Model.In Computer Vision and Pattern
Recognition.
[36]Suman Saha,Gurkirt Singh,Michael Sapienza,Philip H.S.Torr,and
Fabio Cuz-zolin.2016.Deep Learning for Detecting Multiple Space-Time Action
Tubes in Videos.arXiv:1608.01529(2016).
[37]Zheng Shou,Jonathan Chan,Alireza Zareian,Kazuyuki Miyazawa,and
Shih Fu Chang.2017.CDC: Convolutional-De-Convolutional Networks for Precise
Tem-poral Action Localization in Untrimmed Videos. (2017).
[38]Zheng Shou,Dongang Wang,and Shih-Fu Chang.2016.Temporal action
lo-calization in untrimmed videos via multi-stage cnns.In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.
[39]Gunnar A.Sigurdsson,Olga Russakovsky,and Abhinav Gupta.2017.What
Actions are Needed for Understanding Human Actions in Videos? CoRR abs/
1708.02696(2017).arXiv:1708.02696 http://arxiv.org/abs/1708.02696
[40]Karen Simonyan and Andrew Zisserman.2014.Two-stream convolutional
net-works for action recognition in videos.In Advances in neural information
process-ing systems.568–576.
[41]Krishna Kumar Singh and Yong Jae Lee.2017.Hide-and-Seek:Forcing a
Net-work to be Meticulous for Weakly-supervised Object and Action
Localization.arXiv:1704.04232(2017).
[42]S.Satkin and M.Hebert.2010.Modeling the temporal extent of
actions.In ECCV.536–548.
[43]Chen Sun,Sanketh Shetty,Rahul Sukthankar,and Ram
Nevatia.2015.Temporal Localization of Fine-Grained Actions in Videos by
Domain Transfer from Web Images.In ACM International Conference on
Multimedia.371–380.
[44]Du Tran,Lubomir Bourdev,Rob Fergus,Lorenzo Torresani,and Manohar
Paluri.2015.Learning spatiotemporal features with 3d convolutional
networks.In Proceedings of the IEEE International Conference on Computer
Vision.4489–4497.
[45]Heng Wang and Cordelia Schmid.2013.Action recognition with
improved trajectories.In Proceedings of the IEEE International Conference on
Computer Vision.3551–3558.
[46]Limin Wang,Yu Qiao,and Xiaoou Tang.[n.d.].Action Recognition and
Detection by Combining Motion and Appearance Features.([n.d.]).
[47]L.Wang,Y.Qiao,and X.Tang.2015.Action recognition with trajectory-
pooled deep-convolutional descriptors.In CVPR.4305–4314.
[48]Limin Wang,Yuanjun Xiong,Zhe Wang,Yu Qiao,Dahua Lin,Xiaoou Tang,
and Luc Van Gool.2017. Temporal Segment Networks for Action Recognition in
Videos.CoRR abs/1705.02953(2017).arXiv:1705.02953 http://arxiv.org/abs/
1705.02953
[49]Xiaolong Wang,Ross Girshick,Abhinav Gupta,and Kaiming
He.2017.Non-local Neural Networks. arXiv preprint arXiv:1711.07971(2017).
[50]Yunchao Wei,Jiashi Feng,Xiaodan Liang,Ming-Ming Cheng,Yao Zhao,
and Shuicheng Yan.2017. Object Region Mining with Adversarial Erasing:A
Simple Classification to Semantic Segmentation Approach. arXiv:1703.08448
(2017).
[51]Yunchao Wei,Wei Xia,Junshi Huang,Bingbing Ni,Jian Dong,Yao Zhao,
and Shuicheng Yan.2014. CNN:Single-label to Multi-label.Computer Science
(2014).
[52]L Wiskott and T Sejnowski.2002.Slow feature analysis:unsupervised
learning of invariances.Neural Computation 14,4(2002),715.
[53]Yuanjun Xiong,Yue Zhao,Limin Wang,Dahua Lin,and Xiaoou
Tang.2017.A Pursuit of Temporal Accuracy in General Activity Detection.arXiv:
1703.02716(2017).
[54]Huijuan Xu,Abir Das,and Kate Saenko.2017.R-C3D:Region
Convolutional 3D Network for Temporal Activity Detection.In IEEE
International Conference on Computer Vision.5794–5803.
[55]Serena Yeung,Olga Russakovsky,Greg Mori,and Li Fei-Fei.2016.End-
to-end learning of action detection from frame glimpses in videos.In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition.2678–2687.
[56]Jun Yuan,Bingbing Ni,Xiaokang Yang,and Ashraf
A.Kassim.2016.Temporal Action Localization with Pyramid of Score Distribution
Features.In Computer Vision and Pattern Recognition.3093–3102.
[57]Zehuan Yuan,Jonathan C.Stroud,Tong Lu,and Jia Deng.2017.Temporal
Action Localization by Structured Maximal Sums.In IEEE Conference on Computer
Vision and Pattern Recognition.3215–3223.
[58]Yimeng Zhang and Tsuhan Chen.2012.Efficient inference for fully-
connected CRFs with stationarity. 2012 IEEE Conference on Computer Vision and
Pattern Recognition(CVPR)00(2012),582–589.
[59]Yue Zhao,Yuanjun Xiong,Limin Wang,Zhirong Wu,Xiaoou Tang,and
Dahua Lin.2017.Temporal Action Detection with Structured Segment Networks.In
IEEE International Conference on Computer Vision. 2933–2942.
[60]Yi Zhu and Shawn Newsam.2016.Efficient Action Detection in
Untrimmed Videos via Multi-Task Learning.arXiv:1612.07403(2016) 。
Claims (9)
1. a kind of video timing motion detection method based on Weakly supervised study, the specific steps of which are as follows:
Step 1: video input classifier, respectively obtaining different detection confidence levels;
Step 2: score of the fusion video in different classifications device;
Step 3: condition random field accurate adjustment result.
2. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described
Step 1 carry out in the following order:
A) video is divided into the isoplith not being overlapped, using segment as unit extraction feature.
B) classifier provides corresponding detection confidence level to different action classifications respectively according to the feature of these segments.
3. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described
Step 2 carry out in the following order:
C video clips) are given, by preliminary classification device, corresponding category score is obtained and (is detailed in step 1);
D) according to score, video clips partial content is wiped, new video segment is obtained.Concrete operations are as follows: according to video clips class
Other score, calculates the class probability of its classification, then according to probability height, at random corresponding video clip, removes training set.
E) all videos of training set are traversed once, such as above-mentioned removal partial video segment, obtains new training set.
4. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described
Step 3 carry out in the following order:
F) the training classifier on the video of new training set;
G) training convergence judgement when being judged as NO, repeats step second step and third step, when being judged as YES, obtains a series of instructions
The classifier perfected.
5. the video timing motion detection method according to any one of claims 1-4 based on Weakly supervised study, in step
Also include detection-phase after rapid 3, which comprises the concrete steps that:
Step 4: the classifier that video input to be detected is trained obtains different detection confidence levels;
Step 5: passing through the different detection confidence level of FC-CRF optimization fusion.
6. the video timing motion detection method according to claim 5 based on Weakly supervised study, it is characterised in that: described
Step 4 carry out in the following order:
I) video to be detected is divided into the isoplith not being overlapped, using segment as unit extraction feature.
II) trained classifier provides corresponding detection to different action classifications respectively and sets according to the feature of these segments
Reliability.
7. the video timing motion detection method according to claim 5 based on Weakly supervised study, it is characterised in that: described
Step 5 carry out in the following order:
III) according to video clips category score, the class probability of its classification is calculated;
IV full condition of contact random field FC-CRF) is used, in the form of probability graph, receives class probability input, and according to video
The time shaft position of segment, optimization fusion is as a result, export final detection probability.
8. the video timing motion detection method according to any one of claims 1 to 4 based on Weakly supervised study, special
Sign is: the model training process of the trained classifier are as follows:
Given videoComprising N number of editing, the wherein other class label of K videl stageIt gives by joining
Number θ specified classifier, we can obtain classification fractional φ (V;θ)∈RNXC, wherein C is the quantity of all categories.In t
In a erasing step, the rest segment of training video is expressed as V by ust, and classifier is expressed as θt;For the i-th row φ
(Vt;θt) φi, corresponding i-th of editing of original classification score, we calculate probability softmax standard in j-th of segment
The classification p of changeI, j(Vt):
In addition, we define weight factor αI, j:
Wherein δτIt is defined as follows:
Wherein τ is decay factor, is a hyper parameter.Probability of erasure sI, jIt is as follows:
sI, j(Vt)=αI, j(Vt)pI, j(Vt)
Obtain t wheel probability of erasure sI, j(Vt) after, we complete training process as follows:
9. the video timing motion detection method according to any one of claims 1 to 4 based on Weakly supervised study, special
Sign is: the model usage mode of the trained classifier are as follows:
By a series of obtained classifier calculated pI, jWith αI, j, obtain its average valueWithWe establish a full connection strap
Part random field, energy function are as follows:
Wherein, label independent variable liWith ljByIt is specified, indicate the corresponding class label of i-th, j segment;Hereafter, using average
Field near-optimalAnd ask α p result can each segment monitoring confidence level;According to the full condition of contact random field, calculate
Posterior probability is maximized, the final score of every section of video can be obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811181395.4A CN109409257A (en) | 2018-10-11 | 2018-10-11 | A kind of video timing motion detection method based on Weakly supervised study |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811181395.4A CN109409257A (en) | 2018-10-11 | 2018-10-11 | A kind of video timing motion detection method based on Weakly supervised study |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109409257A true CN109409257A (en) | 2019-03-01 |
Family
ID=65467544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811181395.4A Pending CN109409257A (en) | 2018-10-11 | 2018-10-11 | A kind of video timing motion detection method based on Weakly supervised study |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109409257A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110189800A (en) * | 2019-05-06 | 2019-08-30 | 浙江大学 | Furnace oxygen content soft-measuring modeling method based on more granularities cascade Recognition with Recurrent Neural Network |
CN110490055A (en) * | 2019-07-08 | 2019-11-22 | 中国科学院信息工程研究所 | A kind of Weakly supervised Activity recognition localization method and device recoded based on three |
CN111079646A (en) * | 2019-12-16 | 2020-04-28 | 中山大学 | Method and system for positioning weak surveillance video time sequence action based on deep learning |
CN111104855A (en) * | 2019-11-11 | 2020-05-05 | 杭州电子科技大学 | Workflow identification method based on time sequence behavior detection |
CN113516032A (en) * | 2021-04-29 | 2021-10-19 | 中国科学院西安光学精密机械研究所 | Weak supervision monitoring video abnormal behavior detection method based on time domain attention |
-
2018
- 2018-10-11 CN CN201811181395.4A patent/CN109409257A/en active Pending
Non-Patent Citations (1)
Title |
---|
JIA-XING ZHONG 等: "Step-by-step Erasion, One-by-one Collection:AWeakly Supervised Temporal Action Detector", 《ARXIV》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110189800A (en) * | 2019-05-06 | 2019-08-30 | 浙江大学 | Furnace oxygen content soft-measuring modeling method based on more granularities cascade Recognition with Recurrent Neural Network |
CN110189800B (en) * | 2019-05-06 | 2021-03-30 | 浙江大学 | Furnace oxygen content soft measurement modeling method based on multi-granularity cascade cyclic neural network |
CN110490055A (en) * | 2019-07-08 | 2019-11-22 | 中国科学院信息工程研究所 | A kind of Weakly supervised Activity recognition localization method and device recoded based on three |
CN111104855A (en) * | 2019-11-11 | 2020-05-05 | 杭州电子科技大学 | Workflow identification method based on time sequence behavior detection |
CN111104855B (en) * | 2019-11-11 | 2023-09-12 | 杭州电子科技大学 | Workflow identification method based on time sequence behavior detection |
CN111079646A (en) * | 2019-12-16 | 2020-04-28 | 中山大学 | Method and system for positioning weak surveillance video time sequence action based on deep learning |
CN111079646B (en) * | 2019-12-16 | 2023-06-06 | 中山大学 | Weak supervision video time sequence action positioning method and system based on deep learning |
CN113516032A (en) * | 2021-04-29 | 2021-10-19 | 中国科学院西安光学精密机械研究所 | Weak supervision monitoring video abnormal behavior detection method based on time domain attention |
CN113516032B (en) * | 2021-04-29 | 2023-04-18 | 中国科学院西安光学精密机械研究所 | Weak supervision monitoring video abnormal behavior detection method based on time domain attention |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhong et al. | Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector | |
Huang et al. | Foreground-action consistency network for weakly supervised temporal action localization | |
Shi et al. | Weakly-supervised action localization by generative attention modeling | |
Liu et al. | Completeness modeling and context separation for weakly supervised temporal action localization | |
Kukleva et al. | Unsupervised learning of action classes with continuous temporal embedding | |
CN109409257A (en) | A kind of video timing motion detection method based on Weakly supervised study | |
Xiong et al. | A pursuit of temporal accuracy in general activity detection | |
Xu et al. | Segregated temporal assembly recurrent networks for weakly supervised multiple action detection | |
Zhao et al. | Temporal action detection with structured segment networks | |
Richard et al. | Neuralnetwork-viterbi: A framework for weakly supervised video learning | |
Richard et al. | Weakly supervised action learning with rnn based fine-to-coarse modeling | |
Liu et al. | Multi-shot temporal event localization: a benchmark | |
Shou et al. | Autoloc: Weakly-supervised temporal action localization in untrimmed videos | |
Shou et al. | Online detection of action start in untrimmed, streaming videos | |
Wang et al. | Untrimmednets for weakly supervised action recognition and detection | |
Fayyaz et al. | Sct: Set constrained temporal transformer for set supervised action segmentation | |
CN108537119B (en) | Small sample video identification method | |
Vahdani et al. | Deep learning-based action detection in untrimmed videos: A survey | |
CN110969166A (en) | Small target identification method and system in inspection scene | |
Ji et al. | Learning temporal action proposals with fewer labels | |
CN112560827B (en) | Model training method, model training device, model prediction method, electronic device, and medium | |
Shou et al. | Online action detection in untrimmed, streaming videos-modeling and evaluation | |
Javed et al. | Replay and key-events detection for sports video summarization using confined elliptical local ternary patterns and extreme learning machine | |
CN112115996B (en) | Image data processing method, device, equipment and storage medium | |
Ge et al. | Deep snippet selective network for weakly supervised temporal action localization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190301 |
|
WD01 | Invention patent application deemed withdrawn after publication |