CN105844239A

CN105844239A - Method for detecting riot and terror videos based on CNN and LSTM

Info

Publication number: CN105844239A
Application number: CN201610168334.9A
Authority: CN
Inventors: 苏菲; 宋凡; 宋一凡; 赵志诚
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2016-03-23
Filing date: 2016-03-23
Publication date: 2016-08-10
Anticipated expiration: 2036-03-23
Also published as: CN105844239B

Abstract

The invention discloses a method for detecting riot and terror videos based on CNN and LSTM, and belongs to the technical field of pattern recognition, video detection and deep learning. The detection method comprises following steps: firstly, key frame sampling is performed on the video to be detected and key frame features are extracted; expression and discrimination at video level are performed, wherein the expression and the discrimination comprise VLAD feature expression and SVM discrimination of a CNN semantic module, VLAD feature expression and SVM discrimination of a CNN scene module, and LSTM discrimination of a LSTM time sequence module; finally, results are fused. According to the method, the advantages of CNN on image feature extraction and LSTM on time sequence expression are utilized, and features of riot and terror videos at scene are taken into full consideration; test index mAP value reaches 98.0% in real tests which approaches manual operation level. In terms of operation speed, only single machine GPU acceleration mode is adopted and 76.4 seconds of network video can be processed per second; the method is suitable for blocking the spread of riot and terror videos on large video websites and therefore it helps maintain social stability and state long-term peace and order.

Description

A kind of sudden and violent probably video detecting method based on CNN and LSTM

Technical field

The invention belongs to pattern recognition, Video Detection, degree of depth learning art field, be specifically related to a kind of based on CNN and LSTM Sudden and violent probably video detecting method.

Background technology

In recent years, a large amount of local and overseas violence terror videos are the most illegally propagated, and have become as the great of harm social stability Malignant tumor.But relevant automatization cruelly fears video detection technology and is still in development, great majority are to use the inspection of existing event video Survey method, these methods can be divided three classes substantially: video detecting method based on image local feature, based on semantic concept Video detecting method and Video Detection based on convolutional neural networks (Convolutional Neural Network is called for short CNN) Method.

List of references [1] (Sun, Chen, and Ram Nevatia. " Large-scale web video event classification by use of fisher vectors."In Applications of Computer Vision(WACV),2013IEEE Workshop on,pp.15-22. IEEE, 2013.) disclose a kind of video detecting method based on image local feature, first in key frame aspect, extract image Local feature, such as Scale invariant features transform (Scale-Invariant Feature Transform, be called for short SIFT) feature； Subsequently in video aspect, the mode using Fisher core vector (Fisher Vector) to express obtains the overall situation expression of video；Finally Recycling support vector machine (Support Vector Machine is called for short SVM) grader, it determines the classification of video, e.g. Sudden and violent probably video or non-sudden and violent probably video.The method need not the most artificial mark in the training process, simple, but exist Following deficiency: (1) Detection accuracy is limited to used local feature.(2) detection speed is slower.The local such as SIFT is special The computing cost levied is relatively big, causes the method should not be applied to extensive Video Detection task, and practicality is the highest.

List of references [2] (Liu, J.；Yu,Qian；Javed,O.；Ali,S.；Tamrakar,A.；Divakaran,A.；Hui Cheng；& Sawhney, H., Video event recognition using concept attributes, WACV, 2013.) disclose one based on language The video detecting method of justice concept, it is necessary first in key frame aspect, uses local shape factor to combine with SVM classifier Mode, it determines (for sudden and violent probably video, these semantic concepts include but not limited to rifle for various default semantic concepts in picture , blast, masked man, cruelly fear tissue marker etc.) confidence level；Subsequently in video aspect, Fisher Vector is used to express Mode, generate video global characteristics；Finally use SVM classifier again, it determines the type of video.Due to default semanteme Concept has guidance quality, and video detecting method based on semantic concept is higher for the precision of sudden and violent probably video identification, but has following Not enough: needing the image pattern having mark in a large number during (1) training, artificial expense is bigger.(2) cruelly probably regard when to be detected When frequency does not comprises arbitrary default concept, accuracy of detection does not ensure.(3) detection speed is slower.

List of references [3] (Xu, Zhongwen, Yi Yang, and Alexander G.Hauptmann. " Adiscriminative CNN Video representation for event detection. " arXiv preprint arXiv:1411.4006 (2014) .) discloses a kind of base In the video detecting method of CNN semantic feature, in the training stage, with there being mark image training CNN semantic model in a large number.And At test phase, utilize the CNN semantic feature (such as features such as FC6, FC7, SPP) of the model extraction key frame trained, Local feature polymerization is used to describe son (Vector of Locally Aggregated Descriptors, VLAD) in video aspect subsequently Method, carries out the expression of feature and obtains the high dimensional feature of video, and the method detects (Multimedia Event at multi-media events Detection, is called for short MED) obtain preferable effect on data set.It is special at still image that the method takes full advantage of CNN Levy the advantage in terms of extraction, preferable effect can be obtained in sudden and violent probably Video Detection, but still suffer from the following aspect that can improve: (1) the method during VLAD feature representation for video temporal characteristics use and insufficient.(2) the method is only It is extracted the CNN semantic feature of key frame, is not concerned with cruelly fearing other individualized features of video.To sum up, based on CNN The video detecting method of semantic feature still has certain performance boost space.

Summary of the invention

In order to solve problems of the prior art, the present invention proposes one based on CNN and long mnemon (Long in short-term Short-term Memory, is called for short LSTM) sudden and violent probably video detecting method.This process employs CNN in image characteristics extraction With LSTM advantage in terms of time series expression, and take into full account sudden and violent probably video characteristic in terms of scene, in actual test Testing index mAP value reaches 98.0%, close to manual work level.In terms of the speed of service, accelerate only with unit GPU Mode, just can process the Internet video (average bit rate is 632kbps) of 76.4 seconds each second, be suitable to block sudden and violent probably video and exist Propagation on Large video website, is conducive to maintaining social stability and country's long-term stability.

By finding the analysis of a large amount of sudden and violent probably videos, sudden and violent probably video is in sequential organization and the great characteristic of photographed scene two aspect.Base Find in this, the present invention on the basis of original video detection module based on CNN semantic feature (be called for short CNN semantic modules), Add video detection module based on CNN scene characteristic (being called for short CNN scene module) and sequential based on LSTM inspection Survey module (being called for short LSTM tfi module).For video to be detected, the present invention uses semanteme, scene and sequential organization tripartite The mode that face testing result blends, differentiates whether video relates to probably, reduces false drop rate, improves the reality of method more comprehensively By value.

Based on CNN and LSTM the sudden and violent probably video detecting method that the present invention provides, specifically includes following steps:

The first step, carries out key frame sampling, and extracts key frame feature video to be detected；

Second step, utilizes the key frame feature extracted, carries out expression and the differentiation of video aspect；Including CNN semantic modules VLAD feature representation and SVM differentiate, the scene VLAD feature representation of CNN scene module differentiates with SVM, and LSTM The LSTM of tfi module differentiates.

3rd step, result merges.Have employed level convergence strategy based on checksum set mAP value, i.e. for a video to be identified, The judgement score of three modules (CNN semantic modules, CNN scene module and LSTM tfi module) of calculating respectively, then with Each module mAP value on checksum set is weighted merging as weight.

Advantages of the present invention or have the beneficial effects that:

(1) single time sequence information using CNN semantic modules to have ignored video in prior art.Exist for making full use of sudden and violent probably video Feature in terms of sequential organization, the present invention, on the basis of original method, adds LSTM tfi module.Test result shows, Introducing time sequence information, the lifting for accuracy of identification is the most notable.

(2) present invention is based on to the extensive sudden and violent statistics fearing video sample and analysis, excavates sudden and violent probably video in terms of recording scene Great characteristic.Therefore, on the basis of original structure, CNN scene module is joined in sudden and violent probably Video Detection by the present invention, protects Demonstrate,prove the accuracy of identification under particular video frequency scene.

Based on CNN and LSTM the sudden and violent probably video detecting method that the present invention provides, is mainly used in government network supervision department With Large video website, whether the video uploaded for detecting user relates to violence horrible content.Once find that video is doubtful to comprise This type of illegal contents, should give a warning in time, friendship manual review:

(1) present invention could apply to the series of rows disorder of internal organs that online sudden and violent probably audio frequency and video " are rooted out " by government network supervision department, original On the basis of artificial report, the present invention is used to be sampled detection for the Online Video of major video website, for finding The video website of problem issues rectification notice, safeguards the safety of domestic internet environment.

(2) present invention could apply in the content safety system of Large video website, both can be during user's uploaded videos Filtering out sudden and violent probably content, can checking for existing stock's video again, it is to avoid because touching the red line of content safety to website Cause unnecessary loss.

Accompanying drawing explanation

Fig. 1 is the video detecting method flow process frame diagram that the present invention provides.

Fig. 2 is SPP feature extraction schematic diagram in the present invention.

Fig. 3 is LSTM neural unit structural representation in the present invention.

Detailed description of the invention

The present invention is described in detail with embodiment below in conjunction with the accompanying drawings.

The present invention provides a kind of sudden and violent probably video detecting method based on CNN and LSTM, as it is shown in figure 1, described video inspection Survey method specifically includes following steps:

(1) for video to be detected, first carrying out key frame sampling at equal intervals, the sampling interval is 1 second, obtains key frame images.

(2) key frame images is down-sampled to 227 × 227, is input in CNN semantic model and CNN model of place, carries respectively Take CNN semantic feature and the CNN scene characteristic of this key frame images.

Described CNN semantic feature and CNN scene characteristic specifically include FC6 feature, FC7 feature and SPP feature the most respectively Three kinds of features.Wherein, FC6 feature and FC7 are characterized as 4096 dimensional vectors commonly used, and SPP characteristic extraction procedure is the most special Very, the following detailed description of.

Such as the SPP feature extraction schematic diagram be given in Fig. 2, SPP feature extraction is from Conv5 layer (Conv5 full name convolutional Layer 5, i.e. CNN model convolution the 5th layer) after, Conv5 layer has been sufficiently reserved the spatial positional information of target, but due to Its characteristic dimension is too high, is not easy to directly utilize.For avoiding this problem, first by the Eigen Structure of Conv5 layer according to 1 × 1, 2 × 2 and 3 × 3 carry out Spacial domain decomposition, then use the method in maximum pond to obtain 14 256 in each zoning The vector of dimension (256D), every one-dimensional characteristic of each vector correspond to a certain semantic concept explicitly or implicitly, i.e. SPP Feature.

For each key frame images, the present invention extracts three kinds of CNN semantic features and (includes SPP, FC6 and FC7 Feature) and three kinds of CNN scene characteristic (including SPP, FC6 and FC7 feature), it is separately input to they are on-demand not subsequently In same video aspect discrimination module, it is further processed.

Second step, utilizes the key frame feature extracted, carries out expression and the differentiation of video aspect；

Described video layer bread contains three independent feature representations and differentiation, the VLAD feature of respectively CNN semantic modules Express the scene VLAD feature representation with SVM differentiation, CNN scene module to differentiate with SVM, and LSTM sequential mould The LSTM of block differentiates.

The semantic VLAD feature representation of described CNN semantic modules differentiates with SVM, and input feature vector is three kinds of CNN semantemes Feature (SPP, FC6, FC7).Initially with principal component analysis (Principal Components Analysis is called for short PCA) Method, respectively three kinds of features are down to 128 dimensions, 256 peacekeepings 256 are tieed up.

Subsequently, VLAD method is applied, to the D dimensional feature vector after dimensionality reduction, to beforehand through K-mean cluster (K-Means) The cluster centre set C={c obtained₁,c₂,...,c_KCarry out difference accumulation projection.Make V={v₁,v₂,...,v_NRepresent one comprise N The set of the individual characteristic vector of dimensionality reduction, then with cluster centre c_kRelevant difference accumulation vector diff_kCan be expressed as:

{diff}_{k} = \underset{i : N N (v_{i}) = c_{k}}{Σ} (v_{i} - c_{k}) - - - (1)

Wherein, i=1,2 ..., N；K=1,2 ..., K.NN(v_i) represent dimensionality reduction characteristic vector v_iEuclidean in cluster centre set C The arest neighbors of distance.To each difference accumulation vector diff_j(1≤j≤K) is carried out respectivelyNorm normalization, then by K difference Accumulation vector cascade, has just obtained final K × D and has tieed up VLAD feature representation.Cluster centre number K is set to herein 256, then SPP, the dimension after FC6, FC7 correspondence VLAD feature representation is respectively 32,768 dimensions, 65,536 peacekeepings 65,536 Dimension.

Finally, training Linear SVM grader completes video and relates to the judgement of probably confidence level.Video VLAD feature representation is made to form Sample set be X={x₁,x₂,...,x_N, corresponding video classification (cruelly fear, non-sudden and violent probably) collection is combined into Y={y₁,y₂,...,y_N, Wherein y_i{+1 ,-1}, be converted into the convex double optimization problem that solves to ∈ by geometry margin maximization, and the segmentation that study obtains surpasses Plane is:

W x+b=0 (2)

Wherein, w and b is respectively slope and the amount of bias of segmentation hyperplane.The geometry interval of segmentation hyperplane can be maximized, It is expressed as the optimization problem of band inequality constraints condition:

\begin{matrix} \underset{w, b}{m a x} & γ \end{matrix} - - - (3)

\begin{matrix} s . t . & y_{i} (\frac{w}{| | w | |} \cdot x_{i} + \frac{b}{| | w | |}) &GreaterEqual; γ, i = 1, 2, ..., N \end{matrix} - - - (4)

Wherein, γ represents sample point x_iGeometric distance to segmentation hyperplane.This problem can pass through minimax method Lagrange duality Problem is optimized, and minimizes (Sequential Minimal Optimization, be called for short SMO) algorithm by sequence and carry out Solve.Parameter w of the segmentation hyperplane of optimum is obtained after solving^*And b^*, the most cruelly fearing visual classification decision function can be expressed as:

F (x)=sign (w^*·x+b^*) (5)

Wherein, sign (x) represents sign function.Current VLAD feature representation is identified as sudden and violent confidence level probably:

P (y = + 1) = \frac{1}{1 + e^{- (w^{*} \cdot x + b^{*})}} - - - (6)

The VLAD feature representation of SPP, FC6, FC7 is respectively by Linear SVM grader, and finally three kinds of CNN semantemes of output are special Levy the differentiation confidence level P corresponding to FC6, FC7 and SPP feature_s ^(fc6), P_s ^(fc7) and P_s ^(spp)。

The scene VLAD feature representation of described CNN scene module differentiates with SVM, and input feature vector is three kinds of CNN scenes Feature (SPP, FC6, FC7).The handling process of this module and semantic VLAD feature representation and SVM discrimination module basic Cause, finally output differentiation confidence level corresponding to three kinds of CNN scene characteristic FC6, FC7 and SPP featuresWith

The LSTM of described LSTM tfi module differentiates, input feature vector is two kinds of CNN semantic features (FC6, FC7). First two category features are separately input in LSTM discrimination model.This model comprises 2 layers of LSTM unit, and ground floor comprises 1024 Individual neuron, the second layer comprises 512 neurons.The structure of each LSTM neuron is as shown in Figure 3.LSTM nerve list The forward conduction process of unit can be expressed as:

i_t=σ (W_ix_t+U_ih_t-1+b_i) (7)

f_t=σ (W_fx_t+U_fh_t-1+b_f) (8)

o_t=σ (W_ox_t+U_oh_t-1+ b_o) (9)

c_t=f_t*c_t-1+ i_t*φ(W_cx_t+U_ch_t-1+b_c) (10)

h_t=o_t*φ(c_t) (11)

Wherein, two kinds of nonlinear activation functions are respectivelyWith φ (x_t)=tanh (x_t)。i_t, f_t, o_tAnd c_tPoint Do not represent t input gate, Memory-Gate, out gate and the quantity of state corresponding to core door.For each gate, W_i, W_f, W_oAnd W_cRepresent input gate, Memory-Gate, out gate and the transferring weights matrix corresponding to core door, U respectively_i, U_f, U_oWith U_cRepresent input gate, Memory-Gate, out gate and the t-1 moment hidden layer variable h corresponding to core door respectively_t-1Corresponding weight turns Move matrix, b_i,b_f,b_o,b_cThen represent bias vector corresponding to input gate, Memory-Gate, out gate and core door.

First, t input feature vector x_tWith t-1 moment hidden layer variable h_t-1, at transferring weights matrix W and U, and biasing Under the common effect of vector b, generate the quantity of state i of t_t, f_tAnd o_t, see that formula (7) is to formula (9).Further at t-1 Moment core door state amount c_t-1Auxiliary under, generate t core door state amount c_t, see formula (10).Finally, in t Core door state amount c_tWith out gate quantity of state o_tEffect under, generate t hidden layer variable h_t, and then affect the t+1 moment The interior change of LSTM neuron, is shown in formula (11).

The output of second layer LSTM neuron is connected with full articulamentum grader, two kinds of CNN semantic features FC6 of final output The sequential corresponding with FC7 feature differentiates confidence level P_t ^(fc6)And P_t ^(fc7)。

3rd step, result merges.

For ensureing fusion efficiencies, in terms of result fusion, have employed level based on checksum set mAP value merge (Hierarchical Fusion) strategy, i.e. for a video to be identified, calculates three modules (CNN semantic modules, CNN scene module respectively With LSTM tfi module) judgement score, then using each module mAP value on checksum set as weight be weighted merge. In practical operation, the score carrying out CNN semantic modules, CNN scene module and LSTM tfi module the most respectively merges, Score followed by the overall situation merges:

P_{s} = \frac{ω_{s}^{(f c 6)} P_{s}^{(f c 6)} + ω_{s}^{(f c 7)} P_{s}^{(f c 7)} + ω_{s}^{(s p p)} P_{s}^{(s p p)}}{ω_{s}^{(f c 6)} + ω_{s}^{(f c 7)} + ω_{s}^{(s p p)}} - - - (12)

P_{p} = \frac{ω_{p}^{(f c 6)} P_{p}^{(f c 6)} + ω_{p}^{(f c 7)} P_{p}^{(f c 7)} + ω_{p}^{(s p p)} P_{p}^{(s p p)}}{ω_{p}^{(f c 6)} + ω_{p}^{(f c 7)} + ω_{p}^{(s p p)}} - - - (13)

P_{t} = \frac{ω_{t}^{(f c 6)} P_{t}^{(f c 6)} + ω_{t}^{(f c 7)} P_{t}^{(f c 7)}}{ω_{t}^{(f c 6)} + ω_{t}^{(f c 7)}} - - - (14)

P_{o} = \frac{ω_{s} P_{s} + ω_{p} P_{p} + ω_{t} P_{t}}{ω_{s} + ω_{p} + ω_{t}} - - - (15)

Wherein, P_s, P_pAnd P_tRepresent respectively and based on CNN semantic modules, CNN scene module and LSTM tfi module sentence Certainly score；ω_s、ω_pAnd ω_tIt is respectively the verification that CNN semantic modules, CNN scene module are corresponding with LSTM tfi module Collection mAP value；P_s ^(fc6)、P_s ^(fc7)And P_s ^(spp)It is respectively corresponding the adjudicating of FC6, FC7, SPP feature in CNN semantic modules Point；WithBe respectively FC6, FC7, SPP feature in CNN semantic modules corresponding checksum set mAP Value；WithIt is respectively the judgement score that in CNN scene module, FC6, FC7, SPP feature is corresponding； WithIt is respectively the checksum set mAP value that in CNN scene module, FC6, FC7, SPP feature is corresponding；P_t ^(fc6)And P_t ^(fc7) It is respectively the judgement score that in LSTM tfi module, FC6, FC7 feature is corresponding；WithIt is respectively LSTM sequential mould The checksum set mAP value that in block, FC6, FC7 feature is corresponding.Final sudden and violent probably Video Detection result (confidence level) P_oIt is to pass through Three modules obtain based on the mode that mAP value is weighted, and see formula (15).

Claims

1. a sudden and violent probably video detecting method based on CNN and LSTM, it is characterised in that:

Specifically include following steps:

Second step, utilizes the key frame feature extracted, carries out expression and the differentiation of video aspect；Including CNN semantic modules VLAD feature representation and SVM differentiate, the scene VLAD feature representation of CNN scene module differentiates with SVM, and LSTM The LSTM of tfi module differentiates；

3rd step, result merges: use level convergence strategy based on checksum set mAP value, i.e. for a video to be identified, Calculate CNN semantic modules, CNN scene module and the judgement score of LSTM tfi module respectively, then with each module in verification MAP value on collection is weighted merging as weight.

A kind of sudden and violent probably video detecting method based on CNN and LSTM the most according to claim 1, it is characterised in that: first In step, the key frame sampling interval is 1 second, CNN semantic feature that key frame feature includes and CNN scene characteristic, described CNN semantic feature and CNN scene characteristic specifically include FC6 feature, FC7 feature and three kinds of features of SPP feature the most respectively.

A kind of sudden and violent probably video detecting method based on CNN and LSTM the most according to claim 1 and 2, it is characterised in that: SPP feature extraction is from Conv5 layer, and first by the Eigen Structure of Conv5 layer according to 1 × 1,2 × 2 and 3 × 3 carry out area of space Divide, then use the method in maximum pond to obtain the vector of 14 256 dimensions in each zoning, each vector Every one-dimensional characteristic all correspond to a certain semantic concept explicitly or implicitly, i.e. SPP feature.

A kind of sudden and violent probably video detecting method based on CNN and LSTM the most according to claim 1, it is characterised in that: second The semantic VLAD feature representation of the CNN semantic modules described in step differentiates with SVM, and input feature vector is three kinds of CNN semantemes Feature SPP, FC6, FC7；Initially with the method for principal component analysis, respectively three kinds of features are down to 128 dimensions, 256 peacekeepings 256 dimensions；Subsequently, VLAD method is applied, to the characteristic vector after dimensionality reduction, to the cluster obtained beforehand through K-mean cluster Centralization C={c₁,c₂,...,c_KCarry out difference accumulation projection；Make V={v₁,v₂,...,v_NRepresent that comprises N number of dimensionality reduction spy Levy the set of vector, then with cluster centre c_kRelevant difference accumulation vector diff_kIt is expressed as:

{diff}_{k} = \underset{i : N N (v_{i}) = c_{k}}{Σ} (v_{i} - c_{k}) - - - (1)

Wherein, i=1,2 ..., N；K=1,2 ..., K.NN(v_i) represent dimensionality reduction characteristic vector v_iEuclidean in cluster centre set C The arest neighbors of distance；To each difference accumulation vector diff_j(1≤j≤K) carries out l respectively₂Norm normalization, then by K difference Accumulation vector cascade, has just obtained final K × D and has tieed up VLAD feature representation；Cluster centre number K is set to herein 256, then SPP, the dimension after FC6, FC7 correspondence VLAD feature representation is respectively 32,768 dimensions, 65,536 peacekeepings 65,536 Dimension；

Finally, training Linear SVM grader completes video and relates to the judgement of probably confidence level.

A kind of sudden and violent probably video detecting method based on CNN and LSTM the most according to claim 4, it is characterised in that: described Training Linear SVM grader complete video and relate to the judgement of probably confidence level, particularly as follows: make video VLAD feature representation form Sample set be X={x₁,x₂,...,x_N, corresponding video category set is Y={y₁,y₂,...,y_N, wherein y_i∈+1 ,-1}, logical Crossing geometry margin maximization and be converted into the convex double optimization problem that solves, the segmentation hyperplane that study obtains is:

W x+b=0 (2)

Wherein, w and b is respectively slope and the amount of bias of segmentation hyperplane；The geometry interval of segmentation hyperplane, table will be maximized It is shown as the optimization problem of band inequality constraints condition:

\begin{matrix} \underset{w, b}{m a x} & γ \end{matrix} - - - (3)

\begin{matrix} s . t . & y_{i} (\frac{w}{| | w | |} \cdot x_{i} + \frac{b}{| | w | |}) &GreaterEqual; γ, i = 1, 2, ..., N \end{matrix} - - - (4)

Wherein, γ represents sample point x_iGeometric distance to segmentation hyperplane；This problem passes through minimax method lagrange duality problem It is optimized, and minimizes algorithm by sequence and solve；Parameter w of the segmentation hyperplane of optimum is obtained after solving^*And b^*, The most cruelly fear visual classification decision function to be expressed as:

F (x)=sign (w^*·x+b^*) (5)

Wherein, sign (x) represents sign function；Current VLAD feature representation is identified as sudden and violent confidence level probably:

P (y = + 1) = \frac{1}{1 + e^{- (w^{*} \cdot x + b^{*})}} - - - (6)

The VLAD feature representation of SPP, FC6, FC7 is respectively by Linear SVM grader, and finally three kinds of CNN semantemes of output are special Levy the differentiation confidence level corresponding to FC6, FC7 and SPP featureWith

A kind of sudden and violent probably video detecting method based on CNN and LSTM the most according to claim 1, it is characterised in that: second The LSTM of the LSTM tfi module described in step differentiates, input feature vector is two kinds of CNN semantic features FC6, FC7；First Being separately input in LSTM discrimination model by two category features, this model comprises 2 layers of LSTM unit, and ground floor comprises 1024 Individual neuron, the second layer comprises 512 neurons；The forward conduction procedural representation of each LSTM neural unit is:

i_t=σ (W_ix_t+U_ih_t-1+b_i) (7)

f_t=σ (W_fx_t+U_fh_t-1+b_f) (8)

o_t=σ (W_ox_t+U_oh_t-1+b_o) (9)

c_t=f_t*c_t-1+i_t*φ(W_cx_t+U_ch_t-1+b_c) (10)

h_t=o_t*φ(c_t) (11)

Wherein, two kinds of nonlinear activation functions are respectivelyWith φ (x_t)=tanh (x_t)；i_t, f_t, o_tAnd c_tPoint Do not represent t input gate, Memory-Gate, out gate and the quantity of state corresponding to core door；For each gate, W_i, W_f, W_oAnd W_cRepresent input gate, Memory-Gate, out gate and the transferring weights matrix corresponding to core door respectively；U_i, U_f, U_oAnd U_c Represent input gate, Memory-Gate, out gate and the t-1 moment hidden layer variable h corresponding to core door respectively_t-1Corresponding transferring weights Matrix, b_i,b_f,b_o,b_cThen represent bias vector corresponding to input gate, Memory-Gate, out gate and core door；

The output of second layer LSTM neuron is connected with full articulamentum grader, two kinds of CNN semantic features FC6 of final output The sequential corresponding with FC7 feature differentiates confidence levelWith

A kind of sudden and violent probably video detecting method based on CNN and LSTM the most according to claim 1, it is characterised in that: the 3rd Step result merges, and the score carrying out CNN semantic modules, CNN scene module and LSTM tfi module the most respectively merges, Score followed by the overall situation merges:

P_{s} = \frac{ω_{s}^{(f c 6)} P_{s}^{(f c 6)} + ω_{s}^{(f c 7)} P_{s}^{(f c 7)} + ω_{s}^{(s p p)} P_{s}^{(s p p)}}{ω_{s}^{(f c 6)} + ω_{s}^{(f c 7)} + ω_{s}^{(s p p)}} - - - (12)

P_{p} = \frac{ω_{p}^{(f c 6)} P_{p}^{(f c 6)} + ω_{p}^{(f c 7)} P_{p}^{(f c 7)} + ω_{p}^{(s p p)} P_{p}^{(s p p)}}{ω_{p}^{(f c 6)} + ω_{p}^{(f c 7)} + ω_{p}^{(s p p)}} - - - (13)

P_{t} = \frac{ω_{t}^{(f c 6)} P_{t}^{(f c 6)} + ω_{t}^{(f c 7)} P_{t}^{(f c 7)}}{ω_{t}^{(f c 6)} + ω_{t}^{(f c 7)}} - - - (14)

P_{o} = \frac{ω_{s} P_{s} + ω_{p} P_{p} + ω_{t} P_{t}}{ω_{s} + ω_{p} + ω_{t}} - - - (15)

Wherein, P_s, P_pAnd P_tRepresent respectively and based on CNN semantic modules, CNN scene module and LSTM tfi module sentence Certainly score；ω_s、ω_pAnd ω_tIt is respectively the verification that CNN semantic modules, CNN scene module are corresponding with LSTM tfi module Collection mAP value；WithIt is respectively corresponding the adjudicating of FC6, FC7, SPP feature in CNN semantic modules Point；WithBe respectively FC6, FC7, SPP feature in CNN semantic modules corresponding checksum set mAP Value；WithIt is respectively the judgement score that in CNN scene module, FC6, FC7, SPP feature is corresponding； WithIt is respectively the checksum set mAP value that in CNN scene module, FC6, FC7, SPP feature is corresponding；With It is respectively the judgement score that in LSTM tfi module, FC6, FC7 feature is corresponding；WithIt is respectively LSTM sequential mould The checksum set mAP value that in block, FC6, FC7 feature is corresponding；Final sudden and violent probably Video Detection result P_oIt is by three module bases The mode being weighted in mAP value obtains.