Background technology
Physical activity analysis is one of most active research theme in computer vision field, and its core is utilize computer vision technique from image sequence, to detect, follow the tracks of, identify people and its behavior is understood and described.Human action detection method based on computer vision is the core technology of human motion analysis research, and it comprises detects the human body in visual field, obtains the parameter of reflection human action, to reach the object of understanding human action; Have broad application prospects in intelligent monitoring, intelligent appliance, man-machine interaction, the content-based field such as video frequency searching and compression of images and great economic worth and social value.In actual applications, be subject to illumination variation, block, complex scene, visual angle change, especially individual difference, as the restriction of the unfavorable factors such as expression, attitude, motion, clothing, human action is detected and have difficulty.Refer to document: Aggarwal J K, Ryoo M S.Human activity analysis:A review[J] .ACM Computing Surveys (CSUR), 2011,43 (3): the 16. rivers woods of scalding. the human action detection and Identification method research [D] based on computer vision. South China Science & Engineering University's doctorate paper, 2010.
The method that human action detects mainly contains serial method, space-time body method and the method based on feature bag (Bag of Feature, BOF) model.Human action detects and mainly comprises two steps: action represents and motion detection.Wherein action represents human action information to encode, and its main method is divided into that global characteristics represents, local feature represents and expression based on manikin; Motion detection method mainly contains Direct Classification method, template matching method, three-dimensional branch and bound method.Refer to document: Gaidon A, Harchaoui Z, Schmid C.Actom sequence models for efficient action detection[C] //Computer Vision and Pattern Recognition, IEEE Computer Society Conference on.IEEE, 2011:3201-3208.
Global characteristics method for expressing is by as a whole human action observed quantity coding.Global characteristics represent to regard as from top and under method, first human body is positioned, then utilize human body boundary rectangle to define a region-of-interest, then this region-of-interest is carried out to Global Information coding, to represent human action.Common global characteristics method for expressing mainly contains profile, light stream and space-time shape.In global characteristics, take full advantage of body shape information and movable information, in testing process, often use as template, utilize serial method or space-time body method, carry out similarity comparison with the global characteristics extracting in video sequence, what similarity was larger is testing result.The shortcoming of global characteristics is to be too dependent on the result that accurate location, background are wiped out and followed the tracks of, and to visual angle change, noise with block more responsive.Refer to document: Yilmaz A, Shah M.Actions sketch:A novel action representation[C] //Computer Vision and Pattern Recognition, IEEE Computer Society Conference on.IEEE, 2005,1:984-989.
Human action is expressed as independently image block (patch) or image cube (cuboid) set by local feature method for expressing.Local feature method for expressing can regard as the end of from and on method, first utilize point of interest detecting device to detect space-time interest points, then centered by these space-time interest points, extract two dimensional image piece or 3-D view cube, and carry out local feature description, finally, from image block or the information extraction of image cube, represent thereby obtain a human action.The extracting method of space-time interest points mainly contains: Harris3D, Cuboids and Hessian; Carrying out the adoptable method of local feature description around space-time interest points mainly contains: HOG/HOF, HOG3D and expansion SURF.Represent with respect to global characteristics, local feature represents to have the unchangeability such as good rotation, Pan and Zoom, the impact of the unfavorable factor such as can effectively reduce complex background, human body attitude, visual angle change and block, its shortcoming is to depend on a large amount of space-time interest points, and sometimes needs pre-service to compensate the error that camera motion produces.Refer to document: Wang L M, Qiao Y, Tang X.Motionlets:Mid-Level 3D Parts for Human Motion Recognition[C] //Computer Vision and Pattern Recognition, IEEE Computer Society Conference on.IEEE, 2013.
The main thought of the expression based on manikin is: manikin is to be supported by bone, and skeletal architecture can be regarded as by some human parts and links the power system forming, and the running of this system has formed different human body behaviors.Method for expressing based on manikin attempts to propose a kind of human action expression way that is rich in information, the structure of each Body composition part in study video.The latent defect of the method is too to rely on target and motion detection algorithm, thereby these methods are not suitable for the video under natural scene.Also someone attempts using space-time patch to substitute human part, but selects the standard of space-time patch and need how much space-time patch to capture likely changing not yet of human action to solve.Refer to document: Tian Y, Sukthankar R, Shah M.Spatiotemporal Deformable Part Models for Action Detection[C] //Computer Vision and Pattern Recognition, IEEE Computer Society Conference on.IEEE, 2013.
Summary of the invention
Technical matters to be solved by this invention is, provide a kind of reconstructed error less, differentiate the strong human action detection method of performance.
The present invention is that a kind of human action detection method based on action dictionary learning, comprises the following steps for solving the problems of the technologies described above adopted technical scheme:
Step 1) gather training sample, the coloured image in sample is converted into gray level image, and spatial resolution and the duration of unified video segment;
Step 2 is calculated the part three binarization mode LTP features of every section of video, obtains a high dimensional feature vectors
represent n dimensional feature space, n represents total dimension of high dimensional feature vectors, ()
trepresent transposition;
Step 3) by random metric matrix of LTP feature premultiplication of every section of video
carry out dimensionality reduction, i.e. y=Ay
0, it is dropped to m dimension, the feature constitutive characteristic matrix Y after dimensionality reduction from n dimension;
represent that m × n random metric matrix is the linear space of element composition, the wherein each element a of random metric matrix A
ijobeying average is 0, the Gaussian distribution that variance is 1; M < < n, < < represent much smaller than, meet m than n to order of magnitude when young for much smaller than;
Step 4) action dictionary model training:
4-1) action dictionary D is expressed as following formula:
Wherein, action dictionary D is made up of M sub-dictionary corresponding to the anthropoid action of M,
be sub-dictionary corresponding to the anthropoid action of k, K represents to move all dictionary item numbers in dictionary D, and L=K/M is each the dictionary item number in sub-dictionary,
dictionary item, K > > M, > > represent much larger than, meet K than at least large order of magnitude of M for much larger than;
Set up action dictionary learning model, as shown in the formula:
Wherein, parameter value when arg min represents that getting objective function gets minimum value, Y is eigenmatrix, D represents action dictionary to be learnt, W presentation class device parameter, A represents random metric matrix, X represents sparse matrix,
row x in sparse matrix
icorrespond to the sparse coding of sample characteristics, i=1,2 ..., N,
represent K × N dimensional linear space, N represents training sample sum, and α, β are weight coefficient, and H, for indicating matrix, indicates the sign vector of the respectively corresponding anthropoid action of every row hi in matrix, H=[h
1..., h
n] ∈ R
m × N, Q is discrimination matrix, in discrimination matrix every row respectively a corresponding training sample belong to certain anthropoid action sentence vector of norm, Q=[q
1..., q
n] ∈ R
k × N, || ||
2represent 2 norms, s.t. represents constraint condition, and T represents degree of rarefication threshold value,
represent any number, || ||
0represent l
0norm;
4-2) utilize and pass through iterative based on core svd K-SVD algorithm:
Known quantity
Unknown intermediate quantity
After finite iteration is tried to achieve intermediate quantity D ', bring intermediate quantity D ' into action dictionary learning model and obtain final optimized action dictionary D, random metric matrix A, classifier parameters W;
Wherein, in K-SVD algorithm, interative computation initial value is determined by the following method:
Randomly drawing sample from the anthropoid action of M, utilizes K-SVD algorithm to obtain the anthropoid action of M initial dictionary separately
thereby construct the initial value of action dictionary
Determine discrimination matrix Q, indicate matrix H according to the class label of the label of each dictionary item and training sample; Recycle the initial sparse matrix X that orthogonal coupling track algorithm obtains training sample;
The initial value A of random metric matrix
0=(XX
t+ λ
2i)
-1xQ
t;
The initial value W of classifier parameters
0=(XX
t+ λ
1i)
-1xH
t;
Step 5) human action detection:
Space-time sliding window slides in video sequence to be measured, add up respectively sparse coding that in space-time sliding window, image the is corresponding response sum on dictionary item in each sub-dictionary in action dictionary, whether the response that judges the highest dictionary item is more than or equal to threshold value, in this way, using the corresponding classification of dictionary item the highest response and that exceed threshold value as current human action testing result, otherwise, judge current without human action.
The invention provides a kind of human action detection method based on action dictionary learning.In the training stage, utilize local feature method for expressing to extract the human action feature in different video fragment, there is the human action dictionary of differentiating more by force power by one of training study, wherein different dictionary item in different human action respective action dictionaries; In the time that action dictionary is carried out to modeling, not only consider reconstructed error, also considered that new error term makes modeling more excellent;
At test phase, given one section of video, whole section of video of space-time moving window traversal, the action dictionary obtaining based on training calculates the sparse coding of space-time moving window, and according to judging whether comprise a certain human action in space-time moving window for the response of different dictionary items in sparse coding, thereby complete human action Detection task.The method does not need negative sample to train to obtain human action dictionary, and training process is simple and quick, in illumination variation, block, have good detection effect in the situation such as complex background and visual angle change, and can be similar to requirement of real time.
The invention has the beneficial effects as follows between dictionary item and human action, to have clear and definite corresponding relation, interpretation is good; The action dictionary that utilizes training to obtain can be expressed and the different types of human action of reconstruct well, and reconstructed error is little; Utilize the sparse coding being obtained by action dictionary, can determine in space-time moving window and whether comprise and certain anthropoid action there is stronger differentiation power.
Embodiment
In order to describe easily embodiment content, first some terms are defined.
Definition 1: local three binarization modes (Local Trinary Patterns, LTP).It is a kind of local feature method for expressing, local binary patterns (Local Binary Patterns, LBP) in the expansion of time-space domain, by being carried out to interframe encode, motion image sequence effectively catches movable information, thereby avoid carrying out the complicated calculations process of any light stream, can regard that a kind of space-time local grain describes algorithm as.The LTP feature that comprises certain human action video segment by extraction can obtain an one-dimensional characteristic vector.This local feature method for expressing has the advantages such as discriminating power is strong, computing velocity is fast.Refer to document: Yeffet L, Wolf L.Local trinary patterns for human action recognition[C] //Computer Vision, International Conference on.IEEE, 2009:492-497.
Definition 2: accidental projection.It is a kind of dimensionality reduction technology, utilizes random metric matrix that sparse signal or compressible signal (as image, video) are projected to lower dimensional space from higher dimensional space.Refer to document: Baraniuk R G, Wakin M B.Random projections of smooth manifolds[J] .Foundations of computational mathematics, 2009,9 (1): 51-77.
Definition 3: action dictionary learning.Eigenmatrix might as well be designated as
any row of Y represent that a m dimension LTP feature is denoted as
if
be complete dictionary, that is to say action dictionary, by one group of normalization base vector
form.In the training stage, how to design complete dictionary D and made y
jonly, by the better reconstruct of linear combination of a small amount of dictionary item, be action dictionary learning problem, as shown in the formula.Wherein || ||
0l
0norm, represents sparse vector x
ithe number of middle non-zero entry, T is x
isatisfied degree of rarefication threshold value.
Definition 4:K-SVD algorithm.It is a classical solution of crossing complete dictionary by iterative, and iteration speed is fast, and less to the reconstructed error of signal.Refer to document: Aharon M, Elad M, Bruckstein A.K-SVD:Design of dictionaries for sparse representation[J] .Proceedings of SPARS, 2005,5:9-12.
Definition 5: sparse coding.Fixing action dictionary D, solves test sample y
testcorresponding sparse vector x
test, make y
test≈ Dx
testset up, be called y
testsparse coding under dictionary D, as shown in the formula.
Definition 6:OMP algorithm.Its full name is orthogonal matching pursuit, is one of typical solution solving sparse coding problem, has computation complexity low, and fast convergence rate can be estimated the advantages such as globally optimal solution preferably.Refer to document: Tropp J A.Greed is good:Algorithmic results for sparse approximation[J] .Information Theory, IEEE Transactions on, 2004,50 (10): 2231-2242.
Definition 7: space-time sliding window.Sliding window refers generally in target detection process, thereby the rectangle frame of fixed measure travels through localizing objects in image.Space-time sliding window is that sliding window is generalized to three-dimensional result from two-dimensional case.In continuous videos, detect human action, need the space-time sliding window traversal video sequence of fixed measure, to locate human action.
Step 1: gather training sample.Train positive sample to come from respectively the video segment of internet and TV programme, in selected sample, only consider single human action, do not relate to the situation of multiple human actions, take into account the influence factors such as illumination variation, complex scene, visual angle change, individual difference simultaneously.
Step 2: image pre-service.Pre-service comprises following two key steps: coloured image is converted into gray level image; Spatial resolution and the duration of unified video segment.
Step 3: calculate the LTP feature of every section of video, obtain a high dimensional feature vectors
Step 4: Feature Dimension Reduction.Adopt accidental projection method, by random metric matrix of LTP feature premultiplication
carry out dimensionality reduction, i.e. y=Ay
0, it is dropped to m dimension (m < < n), the wherein each element a of random metric matrix from n dimension
ijobeying average is 0, the Gaussian distribution that variance is 1, i.e. a
ij~N (0,1), the feature constitutive characteristic matrix Y after dimensionality reduction.
Step 5: the foundation of action dictionary model with solve.
Step 5-1: the foundation of action dictionary model; Action dictionary form as shown in the formula:
Wherein, action dictionary D is made up of M sub-dictionary corresponding to the anthropoid action of M,
sub-dictionary corresponding to the anthropoid action of k, L=[K/M] be each the dictionary item number in sub-dictionary,
it is dictionary item.Human action, sub-dictionary and dictionary item three's corresponding relation is as figure.Action dictionary learning is modeled as to optimization problem, as shown in the formula:
Wherein the Section 1 of objective function is reconstructed error, and Section 2 is that sparse coding is differentiated error, and Section 3 is error in classification.D is action dictionary to be learnt; Sparse matrix X ∈ R
k × Nrow correspond to the sparse coding of sample characteristics; W presentation class device parameter; α, β is scalar, represents the weight of Section 2 and Section 3 in objective function; Matrix H=[h
1..., h
n] ∈ R
m × Nevery row correspond to the sign vector h of certain anthropoid action
i=[0 ..., 0,1,0 ..., 0]
t; Q=[q
1..., q
n] ∈ R
k × Nthe discrimination matrix of the corresponding sparse coding of training sample, if i training sample belongs to the anthropoid action of k, this row discriminant vector
Matrix of a linear transformation A can transform to sparse matrix X discrimination matrix Q.
Step 5-2: action the solving of dictionary model; Cannot, to above formula direct solution, therefore be deformed into following optimization problem:
Wherein
For known quantity,
It is the unknown parameter that needs training.Utilize K-SVD algorithm can solve this optimization problem, and obtain the globally optimal solution of all parameters.Because K-SVD algorithm is a kind of iterative algorithm, so need to determine iteration initial value D
0, A
0, W
0.Concrete grammar is as follows: randomly drawing sample from the anthropoid action of M, utilizes K-SVD algorithm to obtain the anthropoid action of M initial dictionary separately
thereby construct initial dictionary
according to the label of each dictionary item, and the class label of training sample, just can determine discrimination matrix Q; Recycling OMP algorithm obtains the initial sparse matrix X of training sample; Initial value A
0=(XX
t+ λ
2i)
-1xQ
t, initial value W
0=(XX
t+ λ
1i)
-1xH
t.Try to achieve dictionary D ' through finite iteration, thereby can obtain the optimized parameter D of optimization problem in step 5-1, A, W.
Step 6: human action detects.Given one section of video, space-time sliding window slides in video sequence, if do not comprise the human action class of training set in window, the sparse coding of this window LTP feature likely responds at all dictionary Xiang Shangjun; If comprise certain anthropoid action in moving window, the sparse coding of this window LTP feature can have stronger response on dictionary item corresponding to such human action, and a little less than response on dictionary item corresponding to other human action.In the time detecting, the response sum of the sparse coding of adding up respectively space-time sliding window on dictionary item corresponding to all kinds of human actions, if the response on sub-dictionary corresponding to certain anthropoid action is maximum and be greater than a certain threshold value, be judged to be such human action occurs, thereby complete human action Detection task.
In order to verify effect of the present invention, use Matlab, C/C++ language, at hardware platform: Intel core2 E7400+4G DDR RAM, software platform: Matlab2012a, carries out emulation on VisualStdio2010, and concrete implementation step and parameter arrange following:
Step 1: gather training sample.Consider the influence factors such as illumination variation, complex scene, visual angle change, individual difference, 6 kinds of different human body actions such as the training sample of choosing mainly comprises running, stroll, applaud, jump, stand up, sit down, the video segment intercepting amounts to 300, fragment duration 5s-20s not etc., corresponding 50 fragments of every class action.
Step 2: image pre-service.Every section of short-sighted frequency time span is fixed as 400 milliseconds, comprises altogether 10 two field pictures; 300 video segments that gather are partitioned into 3000 sections of short-sighted frequencies altogether.The coloured image of every frame is converted into gray level image, and spatial resolution unification is scaled 320 pixel × 240 pixels, and the size of data that is to say every section of video is 320 × 240 × 10.
Step 3: the LTP feature of calculating every section of short-sighted frequency.Design parameter arranges as follows: every two field picture is divided into 3 × 3 regions, by the 1st of short-sighted frequency, 3,5,7,9 frames calculate respectively the LTP feature of the 3rd, 5,7 frames, and LTP characteristic threshold value is chosen as 800, generate the eigenvector of 13824 × 1, finally obtaining the not eigenmatrix size of dimensionality reduction is 13824 × 3000.
Step 4: Feature Dimension Reduction.Utilize accidental projection algorithm, the random metric matrix that premultiplication is 1500 × 13824, by big or small dimensionality reduction to 1500 × 3000 of eigenmatrix, this matrix is eigenmatrix Y.
Step 5: the foundation of action dictionary model with solve.The weight coefficient that sparse coding differentiation error and error in classification are two is respectively α=0.3, β=0.1, degree of rarefication threshold value T=10, the action dictionary size that training obtains is 1500 × 3300, wherein the training sample under 6 anthropoid actions is corresponding to 6 sub-dictionaries, i.e. M=6; In every sub-dictionary, dictionary item number is 550, i.e. L=550.
Step 6: human action detects.Test video is two sections of recorded video, be divided into indoor, outdoor two kinds of situations, video comprises 5 people's action, 5 people clothing, visual angle, yardstick in video is different, action classification comprises running, 6 kinds of human actions such as stroll, applaud, jump, stand up, sit down, and video duration is about 18 minutes.Space-time sliding window is of a size of 320 pixel × 400 millisecond, pixel × 240.Evaluation index adopts OV20 standard: if the window of testing result overlaps 20% with groundtruth, detect correctly, otherwise detect mistake.In the situation that recall rate is 90%, accuracy of detection is 89.2%; Final average detected precision is 86.6%, and this shows that the method has good detection effect.
Adopt method of the present invention, first, in the enterprising line algorithm emulation of Matlab platform, be then transplanted on C/C++ platform.Be in the image sequence of 320 pixel × 240 pixels in resolution, on Matlab platform, the processing speed of the method was 7 frame/seconds, and on C/C++ platform, the processing speed of the method reached for 15 frame/seconds, can be similar to the requirement that meets real-time.