CN109409257A

CN109409257A - A kind of video timing motion detection method based on Weakly supervised study

Info

Publication number: CN109409257A
Application number: CN201811181395.4A
Authority: CN
Inventors: 李革; 钟家兴; 李楠楠; 孔伟杰; 张涛; 黄靖佳
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2019-03-01

Abstract

The present invention relates to digital image processing techniques field, specially a kind of video timing motion detection method based on Weakly supervised study.This method comprises the concrete steps that, step 1: video input classifier, respectively obtaining different detection confidence levels；Step 2: score of the fusion video in different classifications device；Step 3: condition random field accurate adjustment result；The step of detection-phase is that step 4: the classifier that video input to be detected is trained obtains different detection confidence levels；Step 5: passing through the different detection confidence level of FC-CRF optimization fusion.This method can combine the output of the priori knowledge of the mankind and neural network, the experimental results showed that FC-CRF improves the detection performance of 20.8%[email protected] on ActivityNet.

Description

A kind of video timing motion detection method based on Weakly supervised study

Technical field

The present invention relates to digital image processing techniques field, specially a kind of video timing based on weak prison study acts inspection Survey method.

Background technique

In the past few years, the inspiration of the immense success by deep learning in terms of the analysis task based on image, perhaps There are multi-model deep learning framework, especially convolutional neural networks (CNN) or recurrent neural network (RNN) to be introduced in base In the motion analysis of video.Karpathy et al. carries out action recognition using deep learning in video first, and at design Manage the various deep learning models of single frames or series of frames.Tran et al. constructs a C3D model, and the model is in space-time 3D convolution is executed in video body and integrates appearance and movement prompt preferably to indicate.Wang et al. proposes time slice network (TSN), it inherits the advantages of double-current feature extraction structure, and longer video clipping is coped with using sparse sampling scheme. Qiu et al. proposes puppet 3D (P3D) remaining network to recycle the ready-made 2D network of 3D CNN.In addition to processing action recognition it Outside, it can solve movement detection there are also some other work or candidate region generate problem.Shou et al. is detected using multistage CNN Network carries out time operating position fixing.Escorcia et al. proposes DAPs model, which uses RNN encoded video sequence, and Retrieval action is suggested during single.Lin et al., which is skipped, generates step using the candidate region of single step motion detector (SSAD) Suddenly.Shou et al. designs convolution-deconvolution (CDC) network to determine accurate timing limits.

In the past few years, behavioural analysis understands that field causes many concerns in video.According to manual character representation Or deep learning model architecture, many researchs have been carried out to this problem.A large amount of work on hands are handled in a manner of supervising by force Action analysis task, wherein the training data of the action example without background by manual annotations or trims.In recent years, some strong prisons The method of superintending and directing achieves satisfactory result.However, mark acts example nowadays in more and more large-scale sets of video data Precise time position be time-consuming and time-consuming.In addition, as indicated, the exact time of movement different from object boundary The definition of range is usually subjectivity, and inconsistent between different observers, this may cause additional deviation and mistake.

It is reasonably to select using Weakly supervised method to overcome these limitations of timing motion detection.The prior art is logical Accurate time-labeling or the video cut out building deep learning model are crossed, and model of the invention directlys adopt unpruned view Frequency evidence gives training, and only needs video rank class label.

Summary of the invention

It is an object of the invention to a kind of video timing motion detection method based on Weakly supervised study.It is dynamic to solve timing It detects, the time location of example is acted in model prediction of the invention action classification and video.Appoint in Weakly supervised study In business, only videl stage tag along sort is provided as supervisory signals, and in the training process, includes the movement mixed with background The video clipping of example will not be modified.

In order to achieve the object of the present invention, following technical solution is specifically taken:

A kind of video timing motion detection method based on Weakly supervised study, specific step is as follows for training:

Step 1: video input classifier, respectively obtaining different detection confidence levels；

Step 2: score of the fusion video in different classifications device；

Step 3: condition random field accurate adjustment result.

Above-mentioned steps 1 carry out in the following order:

A) video is divided into the isoplith not being overlapped, using segment as unit extraction feature.

B) classifier provides corresponding detection confidence level to different action classifications respectively according to the feature of these segments.

The step 2 carries out in the following order:

C video clips) are given, by preliminary classification device, corresponding category score is obtained and (is detailed in step 1)；

D) according to score, video clips partial content is wiped, new video segment is obtained.Concrete operations are as follows: according to piece of video Disconnected category score, calculates the class probability of its classification, then according to probability height, at random corresponding video clip, removes training Collection.

E) all videos of training set are traversed once, such as above-mentioned removal partial video segment, obtains new training set.

The step 3 carries out in the following order:

F) the training classifier on the video of new training set；

G) training convergence judgement when being judged as NO, repeats step second step and third step, when being judged as YES, obtains a system Arrange trained classifier.

In the training process, the segment that there is exceptionally high degree of trust behavior to occur gradually is deleted.By doing so, to obtain one Series has the classifier of respective preference, is used for different types of movement segment.

In service stage, the segment for making example is driven according to the classifier selection trained repeatedly, and pass through full connection strap Part random field (FC-CRF) optimization fusion result.The step of detection-phase, is as follows:

Step 4: the classifier that video input to be detected is trained obtains different detection confidence levels；

Step 5: passing through the different detection confidence level of FC-CRF optimization fusion；

Above-mentioned steps 4 carry out in the following order:

I) video to be detected is divided into the isoplith not being overlapped, using segment as unit extraction feature.

II) trained classifier provides corresponding inspection to different action classifications respectively according to the feature of these segments Survey confidence level.

Above-mentioned steps 5 carry out in the following order:

III) according to video clips category score, the class probability of its classification is calculated.

IV full condition of contact random field FC-CRF) is used, in the form of probability graph, receives class probability and inputs, and according to The time shaft position of video clip, optimization fusion is as a result, export final detection probability.

Due to taking above-mentioned technological means, the invention has the advantages that and good effect:

1. the invention proposes a Weakly supervised models to detect the time movement that do not trim in video.The model by pair Video carries out gradually erasing to obtain a series of classifiers.In test phase, by collecting the detection knot from classifier one by one Fruit is come to apply model of the invention be convenient.

2. according to known to the present invention, this is first by full condition of contact random field [22] (fully connected Conditional ramdom filed, FC-CRF) introduce time motion detection task work, it is used for the elder generation of the mankind The output for testing knowledge and neural network combines.The experimental results showed that FC-CRF improves 20.8% on ActivityNet The detection performance of [email protected].

3. the present invention has carried out extensive experiment to two challenging sets of video data of not trimming, i.e., ActivityNet [11] and THUMOS'14 [20]；Prove the detection effect of the method for the present invention in Average Accuracy (mean Average precision, mAP) it is more than other all Weakly supervised timing motion detection methods, or even be comparable to certain strong Measure of supervision.

In order to illustrate more clearly of conception and technical scheme of the invention, with reference to the accompanying drawing, pass through specific embodiment pair The present invention is described further.

Detailed description of the invention

Fig. 1 is the flow chart of video timing motion detection method of the present invention；

Fig. 2 is training flow chart of the invention.

Specific embodiment

Fig. 1 is the flow chart of video timing motion detection method of the present invention, and as shown in Figure 1, one kind being based on Weakly supervised study Video timing motion detection method, include the following steps: 1, each classifier S1 of video input, respectively obtain different inspections Survey confidence level；2, score S2 of the fusion video in different classifications device；3, condition random field accurate adjustment result S3.

Fig. 2 is training flow chart of the invention, as shown in Fig. 2, training flow chart includes the following steps: that video clips pass through Preliminary classification device obtains corresponding category score 11；According to score, video clips partial content is wiped, new video segment 12 is obtained； The training classifier 13 on new video；Training convergence judgement, is judged as NO 14, repeats step 12 and 13, be judged as YES under entrance One step 15；Obtain a series of trained classifiers 15.

Specific step is as follows for the model training process of the method for the present invention:

Given videoComprising N number of editing, the wherein other class label of K videl stageIt gives The fixed classifier specified by parameter θ, the present invention can obtain classification fractional φ (V；θ)∈R^NXC, wherein C is the number of all categories Amount.In t-th of erasing step, the rest segment of training video is expressed as V by the present invention_t, and classifier is expressed as θ_t.It is right In the i-th row φ (V^t；θ^t) φ_I:, corresponding i-th of editing of original classification score, the present invention calculates probability in j-th of segment The standardized classification p of softmax_{I, j}(V^t):

In addition, the present invention defines weight factor α_{I, j}:

Wherein δ_τIt is defined as follows:

Wherein τ is decay factor, is a hyper parameter.Probability of erasure s_{I, j}It is as follows:

s_{I, j}(V^t)=α_{I, j}(V^t)p_{I, j}(V^t)

Obtain t wheel probability of erasure s_{I, j}(V^t) after, the present invention completes training process as follows:

Step 2: the use of model.

By a series of obtained classifier calculated p_{I, j}With α_{I, j}, obtain its average valueWithThe present invention establishes one entirely Condition of contact random field, energy function are as follows:

Wherein, label independent variable l_iWith l_jByIt is specified, indicate the corresponding class label of i-th, j segment.Hereafter, it uses Mean field approximation optimizationAnd ask α p result can each segment monitoring confidence level.According to the full condition of contact random field, It calculates and maximizes posterior probability, the final score of every section of video can be obtained.

Method of the invention is tested on the present invention is in ActivityNet and THUMOS ' 14, it is as a result as follows.

In below table, the index compared is the friendship of different time axis and the average precision than under, i.e. mAP (mean Average Precision), it measures in the video being retrieved in the friendship of different time axis and than ratio accurate under threshold value.It should Index is the bigger the better.

Strong supervised learning refers to that the markup information of training sample includes video classification information and timing information.

Weakly supervised study refers to that the markup information of training sample only includes video classification information.

Single phase.Cascade, single classification, more classification refer to the distinct methods that respective document proposes, propose to other bibliography Other methods will not enumerate.

Table 1 is that different time axis is handed over and than the average precision under threshold value in ActivityNet data set,

2 mAP@tIoU on THUMOS ' 14 of table --- THUMOS14 data set different time axis friendship is simultaneously average than under threshold value Precision ratio.

Wherein: Strong/Weak Supervision: strong supervision/Weakly supervised study, each method in table first row The method provided for corresponding document and author.

According to other embodiments of the invention, for the technical solution:

1. classifier can be based on any neural network, can also be itself and traditional characteristic.

2. full condition of contact random field can be replaced with the condition random field of any kind.

Bibliography, abbreviation document, interior square brackets are literature number, such as: [53] are document 53, and [59] are document 59,

[1]A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,and L.Fei- Fei.2014.Large-scale video classification with convolutional neural networks.In CVPR.1725–1732.

[2]P.Bojanowski,R.Lajugie,F.R.Bach,I.Laptev,J.Ponce,C.Schmid,and J.Sivic.2014.Weakly supervised action labeling in videos under ordering constraints.In ECCV.628–643.

[3]P.Bojanowski,R.Lajugie,E.Grave,F.Bach,I.Laptev,J.Ponce,and C.Schmid.2015. Weakly-supervised alignment of video with text.In ICCV.4462– 4470.

[4]A.Pinz C.Feichtenhofer and A.Zisserman.2016.Convolutional two- stream network fusion for video action recognition.In CVPR.1933–1941.

[5]Joao Carreira and Andrew Zisserman.2017.Quo Vadis,Action Recognition A New Model and the Kinetics Dataset.In IEEE Conference on Computer Vision and Pattern Recognition.4724–4733.

[6]Xiyang Dai,Bharat Singh,Guyue Zhang,Larry S.Davis,and Yan Qiu Chen.2017.Temporal Context Network for Activity Localization in Videos.In IEEE International Conference on Computer Vision.5727–5736.

[7]Oneata Dan,Jakob Verbeek,and Cordelia Schmid.2014.The LEAR submission at Thumos 2014. Computer Vision and Pattern Recognition[cs.CV] (2014).

[8]J.Donahue,L.Anne Hendricks,S.Guadarrama,M.Rohrbach,S.Venugopalan, K.Saenko,and T. Darrell.2015.Long-term recurrent convolutional networks for visual recognition and description.In CVPR. 2625–2634.

[9]V.Escorcia,F.C.Heilbron,J.C.Niebles,and B.Ghanem.2016.Daps:Deep action proposals for action understanding.In In European Conference on Computer Vision.768–784.

[10]Victor Escorcia,Fabian Caba Heilbron,Juan Carlos Niebles,and Bernard Ghanem.2016.DAPs:Deep Action Proposals for Action Understanding.In European Conference on Computer Vision.768–784.

[11]B.Ghanem F.Caba Heilbron,V.Escorcia and J.Carlos Niebles.2015.Activitynet:A large-scale video benchmark for human activity understanding.In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.961–970.

[12]C.Gan,C.Sun,L.Duan,and B.Gong.2016.Webly-supervised video recognition by mutually voting for relevant web images and web video frames.In ECCV.849–866.

[13]Jiyang Gao,Zhenheng Yang,Chen Sun,Kan Chen,and Ram Nevatia.2017.TURN TAP:Temporal Unit Regression Network for Temporal Action Proposals.arXiv:1703.06189(2017).

[14]A.Richard H.Kuehne and J.Gall.2016.Weakly supervised learning of actions from transcripts.CoRR, abs/1610.02237(2016).

[15]Fabian Caba Heilbron,Wayner Barrios,Victor Escorcia,and Bernard Ghanem.2017.SCC:Semantic Context Cascade for Efficient Action Detection.In IEEE Conference on Computer Vision and Pattern Recognition.

[16]Fabian Caba Heilbron,Juan Carlos Niebles,and Bernard Ghanem.2016.Fast Tem-poral Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos.In Computer Vision and Pattern Recognition.1914–1923.

[17]D.Huang,L.Fei-Fei,and J.C.Niebles.2016.Connectionist temporal modeling for weakly supervised action labeling.In ECCV.137–153.

[18]Dinesh Jayaraman and Kristen Grauman.2016.Slow and Steady Feature Analysis:Higher Order Temporal Coherence in Video.In Computer Vision and Pattern Recognition.3852–3861.

[19]Yangqing Jia,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan Long,Ross Girshick,Sergio Guadarrama,and Trevor Darrell.2014.Caffe: Convolutional Architecture for Fast Feature Embedding.arXiv preprint arXiv: 1408.5093(2014).

[20]Y.-G.Jiang,J.Liu,A.Roshan Zamir,G.Toderici,I.Laptev,M.Shah,and R.Suk-thankar.2014. THUMOS challenge:Action recognition with a large number of classes.http://crcv.ucf.edu/THUMOS14/(2014).

[21]Svebor Karaman,Lorenzo Seidenari,and Alberto Del Bimbo.[n.d.] .Fast saliency based pooling of Fisher encoded dense trajectories.([n.d.]).

[22]P. and V.Koltun.2011.Efficient inference in fully connected crfs with gaussian edge potentials.In NIPS.109–117.

[23]Y.Qiao L.Wang and X.Tang.2016.MoFAP:A multi-level representation for action recognition.IJCV 119,3(2016),254–271.

[24]Ivan Laptev and Tony Lindeberg.2003.Space-time interest points.In 9th Interna-tional Conference on Computer Vision.432–439.

[25]I.Laptev,M.Marszalek,C.Schmid,and B.Rozenfeld.2008.Learning realistic human actions from movies.In CVPR.1–8.

[26]Colin Lea,Michael D.Flynn,Rene Vidal,Austin Reiter,and Gregory D.Hager.2017.Temporal Convolutional Networks for Action Segmentation and Detection.In IEEE Conference on Computer Vision and Pattern Recognition.1003– 1012.

[27]Tianwei Lin,Xu Zhao,and Zheng Shou.2017.Single Shot Temporal Action Detection.In ACM on Multimedia Conference.

[28]L.Wang,Y.Xiong,D.Lin,and L.V.Gool.2017.UntrimmedNets for Weakly Super-vised Action Recognition and Detection.arXiv:1703.03329(2017).

[29]L.Wang,Y.Xiong,Z.Wang,Y.Qiao,D.Lin,X.Tang,and L.Van Gool.2016.Temporal segment networks: Towards good practices for deep action recognition.In ECCV.20–36.

[30]Cordelia Schmid Marcin Marszalek,Ivan Laptev.2009.Actions in context.In CVPR.2929–2936.

[31]Hossein Mobahi,Ronan Collobert,and Jason Weston.2009.Deep learning from temporal coherence in video..In International Conference on Machine Learning,ICML 2009,Montreal,Quebec,Canada,June.93.

[32]Li Nannan,Xu Dan,Ying Zhenqiang,Li Zhihao,and Li Ge.2016.Searching Action Propsoals via Spatial Actionness estimation and Temporal Path Inference and Tracking.In Asian Conference on Computer Vision.384–399.

[33]J.Sivic F.R.Bach O.Duchenne,I.Laptev and J.Ponce.2009.Automatic annotation of human actions in video.In ICCV.1491–1498.

[34]Zhaofan Qiu,Ting Yao,and Tao Mei.2017.Learning Spatio-Temporal Represen-tation with Pseudo-3D Residual Networks.In ICCV.

[35]Alexander Richard and Juergen Gall.2016.Temporal Action Detection Using a Statistical Language Model.In Computer Vision and Pattern Recognition.

[36]Suman Saha,Gurkirt Singh,Michael Sapienza,Philip H.S.Torr,and Fabio Cuz-zolin.2016.Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos.arXiv:1608.01529(2016).

[37]Zheng Shou,Jonathan Chan,Alireza Zareian,Kazuyuki Miyazawa,and Shih Fu Chang.2017.CDC: Convolutional-De-Convolutional Networks for Precise Tem-poral Action Localization in Untrimmed Videos. (2017).

[38]Zheng Shou,Dongang Wang,and Shih-Fu Chang.2016.Temporal action lo-calization in untrimmed videos via multi-stage cnns.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.

[39]Gunnar A.Sigurdsson,Olga Russakovsky,and Abhinav Gupta.2017.What Actions are Needed for Understanding Human Actions in Videos? CoRR abs/ 1708.02696(2017).arXiv:1708.02696 http://arxiv.org/abs/1708.02696

[40]Karen Simonyan and Andrew Zisserman.2014.Two-stream convolutional net-works for action recognition in videos.In Advances in neural information process-ing systems.568–576.

[41]Krishna Kumar Singh and Yong Jae Lee.2017.Hide-and-Seek:Forcing a Net-work to be Meticulous for Weakly-supervised Object and Action Localization.arXiv:1704.04232(2017).

[42]S.Satkin and M.Hebert.2010.Modeling the temporal extent of actions.In ECCV.536–548.

[43]Chen Sun,Sanketh Shetty,Rahul Sukthankar,and Ram Nevatia.2015.Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images.In ACM International Conference on Multimedia.371–380.

[44]Du Tran,Lubomir Bourdev,Rob Fergus,Lorenzo Torresani,and Manohar Paluri.2015.Learning spatiotemporal features with 3d convolutional networks.In Proceedings of the IEEE International Conference on Computer Vision.4489–4497.

[45]Heng Wang and Cordelia Schmid.2013.Action recognition with improved trajectories.In Proceedings of the IEEE International Conference on Computer Vision.3551–3558.

[46]Limin Wang,Yu Qiao,and Xiaoou Tang.[n.d.].Action Recognition and Detection by Combining Motion and Appearance Features.([n.d.]).

[47]L.Wang,Y.Qiao,and X.Tang.2015.Action recognition with trajectory- pooled deep-convolutional descriptors.In CVPR.4305–4314.

[48]Limin Wang,Yuanjun Xiong,Zhe Wang,Yu Qiao,Dahua Lin,Xiaoou Tang, and Luc Van Gool.2017. Temporal Segment Networks for Action Recognition in Videos.CoRR abs/1705.02953(2017).arXiv:1705.02953 http://arxiv.org/abs/ 1705.02953

[49]Xiaolong Wang,Ross Girshick,Abhinav Gupta,and Kaiming He.2017.Non-local Neural Networks. arXiv preprint arXiv:1711.07971(2017).

[50]Yunchao Wei,Jiashi Feng,Xiaodan Liang,Ming-Ming Cheng,Yao Zhao, and Shuicheng Yan.2017. Object Region Mining with Adversarial Erasing:A Simple Classification to Semantic Segmentation Approach. arXiv:1703.08448 (2017).

[51]Yunchao Wei,Wei Xia,Junshi Huang,Bingbing Ni,Jian Dong,Yao Zhao, and Shuicheng Yan.2014. CNN:Single-label to Multi-label.Computer Science (2014).

[52]L Wiskott and T Sejnowski.2002.Slow feature analysis:unsupervised learning of invariances.Neural Computation 14,4(2002),715.

[53]Yuanjun Xiong,Yue Zhao,Limin Wang,Dahua Lin,and Xiaoou Tang.2017.A Pursuit of Temporal Accuracy in General Activity Detection.arXiv: 1703.02716(2017).

[54]Huijuan Xu,Abir Das,and Kate Saenko.2017.R-C3D:Region Convolutional 3D Network for Temporal Activity Detection.In IEEE International Conference on Computer Vision.5794–5803.

[55]Serena Yeung,Olga Russakovsky,Greg Mori,and Li Fei-Fei.2016.End- to-end learning of action detection from frame glimpses in videos.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2678–2687.

[56]Jun Yuan,Bingbing Ni,Xiaokang Yang,and Ashraf A.Kassim.2016.Temporal Action Localization with Pyramid of Score Distribution Features.In Computer Vision and Pattern Recognition.3093–3102.

[57]Zehuan Yuan,Jonathan C.Stroud,Tong Lu,and Jia Deng.2017.Temporal Action Localization by Structured Maximal Sums.In IEEE Conference on Computer Vision and Pattern Recognition.3215–3223.

[58]Yimeng Zhang and Tsuhan Chen.2012.Efficient inference for fully- connected CRFs with stationarity. 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR)00(2012),582–589.

[59]Yue Zhao,Yuanjun Xiong,Limin Wang,Zhirong Wu,Xiaoou Tang,and Dahua Lin.2017.Temporal Action Detection with Structured Segment Networks.In IEEE International Conference on Computer Vision. 2933–2942.

[60]Yi Zhu and Shawn Newsam.2016.Efficient Action Detection in Untrimmed Videos via Multi-Task Learning.arXiv:1612.07403(2016) 。

Claims

1. a kind of video timing motion detection method based on Weakly supervised study, the specific steps of which are as follows:

Step 2: score of the fusion video in different classifications device；

Step 3: condition random field accurate adjustment result.

2. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described Step 1 carry out in the following order:

3. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described Step 2 carry out in the following order:

D) according to score, video clips partial content is wiped, new video segment is obtained.Concrete operations are as follows: according to video clips class Other score, calculates the class probability of its classification, then according to probability height, at random corresponding video clip, removes training set.

4. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described Step 3 carry out in the following order:

F) the training classifier on the video of new training set；

G) training convergence judgement when being judged as NO, repeats step second step and third step, when being judged as YES, obtains a series of instructions The classifier perfected.

5. the video timing motion detection method according to any one of claims 1-4 based on Weakly supervised study, in step Also include detection-phase after rapid 3, which comprises the concrete steps that:

Step 5: passing through the different detection confidence level of FC-CRF optimization fusion.

6. the video timing motion detection method according to claim 5 based on Weakly supervised study, it is characterised in that: described Step 4 carry out in the following order:

II) trained classifier provides corresponding detection to different action classifications respectively and sets according to the feature of these segments Reliability.

7. the video timing motion detection method according to claim 5 based on Weakly supervised study, it is characterised in that: described Step 5 carry out in the following order:

III) according to video clips category score, the class probability of its classification is calculated；

IV full condition of contact random field FC-CRF) is used, in the form of probability graph, receives class probability input, and according to video The time shaft position of segment, optimization fusion is as a result, export final detection probability.

8. the video timing motion detection method according to any one of claims 1 to 4 based on Weakly supervised study, special Sign is: the model training process of the trained classifier are as follows:

Given videoComprising N number of editing, the wherein other class label of K videl stageIt gives by joining Number θ specified classifier, we can obtain classification fractional φ (V；θ)∈R^NXC, wherein C is the quantity of all categories.In t In a erasing step, the rest segment of training video is expressed as V by us_t, and classifier is expressed as θ_t；For the i-th row φ (V^t；θ^t) φ_i, corresponding i-th of editing of original classification score, we calculate probability softmax standard in j-th of segment The classification p of change_{I, j}(V^t):

In addition, we define weight factor α_{I, j}:

Wherein δ_τIt is defined as follows:

s_{I, j}(V^t)=α_{I, j}(V^t)p_{I, j}(V^t)

Obtain t wheel probability of erasure s_{I, j}(V^t) after, we complete training process as follows:

9. the video timing motion detection method according to any one of claims 1 to 4 based on Weakly supervised study, special Sign is: the model usage mode of the trained classifier are as follows:

By a series of obtained classifier calculated p_{I, j}With α_{I, j}, obtain its average valueWithWe establish a full connection strap Part random field, energy function are as follows:

Wherein, label independent variable l_iWith l_jByIt is specified, indicate the corresponding class label of i-th, j segment；Hereafter, using average Field near-optimalAnd ask α p result can each segment monitoring confidence level；According to the full condition of contact random field, calculate Posterior probability is maximized, the final score of every section of video can be obtained.