CN102136066A

CN102136066A - Method for recognizing human motion in video sequence

Info

Publication number: CN102136066A
Application number: CN 201110109440
Authority: CN
Inventors: 李宏亮; 覃耀辉
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2011-04-29
Filing date: 2011-04-29
Publication date: 2011-07-27
Anticipated expiration: 2031-04-29
Also published as: CN102136066B

Abstract

The invention discloses a method for recognizing human motion in a video sequence, which aims at contradict between accuracy and real-time performance achieved by the conventional method for recognizing the human motion in a video image. The method provided by the invention comprises a characteristic extraction process and a characteristic training and recognition process. In the characteristic extraction, a differential edge histogram of the video sequence is calculated so as to greatly reduce video characteristics which are used, increase the recognition speed and meet the real-time performance of human motion recognition; and a pixel change histogram and an edge gradient histogram are calculated for a target area and a plurality of sub-areas respectively so as to improve the accuracy of motion detail recognition. By the method provided by the invention, not only the recognition accuracy is improved, but also the recognition real-time performance is met.

Description

The recognition methods of human action in a kind of video sequence

Technical field

The invention belongs to technical field of computer vision, particularly a kind of recognition methods of human action.

Background technology

Now the paces of digital network are progressively accelerated, and video monitoring system rises and comes true in the management that participates in the whole industry, particularly gains great popularity in the safety precaution field in all trades and professions with the advantage of its intuitive and real-time.Along with the reduction day by day of watch-dog costs such as video camera, video monitoring system can be widely used in places such as bank, post and telecommunications, prison, court, large common facility, bulk storage plant and military base, and public safety field plays a part to become more and more important.But at present the function of supervisory system only rests on the monitor staff mostly to the artificial supervision of vision signal with record a video analytically afterwards, does not make full use of on the huge computing power that the computer technology high speed development is provided up till now.In fact, most supervisory systems still are analog, and the system of a few digits formula also only provides the simple functions of many pictures demonstrations and hard disc recording class.Existing supervisory system all can not realize the supervisory role of active in real time, i.e. Jian Kong intellectuality and unmanned.Intelligent monitor system can realize that whole day monitors in real time, and analyze the view data that video camera is caught automatically, and when unusual generation, can be to security personnel transmission alarm accurately and timely, thereby avoid the generation of crime, and the core of video monitoring is the identification to human action.

To the identification of human action, mainly contain 3 kinds of methods: the method for (1) template matches at present; (2) method of state space; (3) based on the method for model.

Its advantage of template matching method (template matching) is that realize easily in the algorithm separate room, time overhead is few, better to differing bigger behavior recognition effect, but to nuance the behavior recognition effect relatively poor, responsive to the variation and the noise ratio of exercise duration.

In recent years, it is more to adopt state-space method to carry out the research of human action behavior identification, representative is markov network, hidden Markov model (HMM, Hidden Markov Model) has been widely used in prediction, estimation, detection and the behavior identification of video, image sequence.Yet, it is very big that state-space method needs a large amount of training samples to come its accuracy of physical training condition transition probability parameter to be subjected to the influence of training sample quantity, its principle of status switch identifying also is a template matches, and because the complicacy of a behavior master plate coupling obviously is not enough.

Now more human action behavior Study of recognition person be more prone to sight invest use natural language description method to human action behavior discern, make the semantic description analysis and research of human action behavior obtain certain progress, the natural language description that is used for the behavior of video image human action, as set up 2D, 3D model: at first, use the 3D model that the instantaneous posture of human body is described, make the model of structure similar to the human body target attitude as far as possible, it is the marginal information of human body in the image, or behavior regarded as 2D static posture sequence, by method, to the human body two dimension based on model, three-dimensional posture, angle, the position and with environment in the variation etc. of some other target relative distance rebuild and the natural language text of estimating to generate at last the human action behavior description.But the structure complexity based on the method for model is high-leveled and difficult in realization, and present very many of the feature extracted of the feature extracting method of natural language description, video about one 100 frame extracts hundreds of to several thousand features, and the processing time expense is big, thereby is not easy to be applied to real-time system.Recently abroad mainly be that to carry out with the foundation of natural language description and various probability models be identification, but behavior is discerned and still is in the junior stage.

Above-mentioned the whole bag of tricks can not reach balance between recognition correct rate and real-time, promptly otherwise recognition correct rate than higher, but computation complexity height, real-time is poor; Computation complexity is low, and real-time is good, but recognition correct rate is lower.

Summary of the invention

The objective of the invention is to have proposed the recognition methods of human action in a kind of video sequence in order to solve the contradiction of human action recognition methods between accuracy rate and real-time in the existing video image.

Technical scheme of the present invention is: the recognition methods of human action in a kind of video sequence, comprise two processes of feature extraction and features training and identification, wherein,

Feature extraction comprises the steps:

S1. calculate the pixel motion change frequency figure of video sequence;

S2. divide the zone of pixel motion change frequency figure, determine the zone of pixel motion change frequency figure intermediate value greater than a certain threshold value, find the minimum ordinate and the minimum horizontal ordinate of the pixel in this zone, and maximum ordinate and maximum horizontal ordinate, then with this minimum ordinate and minimum horizontal ordinate, maximum ordinate and maximum horizontal ordinate are determined a target area, and divide plurality of sub-regions according to a certain percentage on the longitudinal axis or X direction;

S3. ask pixel to change histogram respectively to target area and plurality of sub-regions, specific as follows:

S31. be N quantized value with the value non-uniform quantizing among the pixel motion change frequency figure of target area;

S32. obtain respectively the value of each quantized value correspondence of target area and plurality of sub-regions in pixel motion change frequency figure and, each zone just obtains the histogram of a N dimension like this;

S33. the histogram of target area and plurality of sub-regions is tiled into the vector and the normalization of a multidimensional, obtains pixel and change histogram;

S4. target area and plurality of sub-regions are asked the edge gradient histogram respectively;

S5. calculate the difference edge histogram of video sequence, detailed process is as follows: the difference image that calculates present frame and former frame, if the maximal value of the element in the difference image absolute value greater than the threshold value that presets, is calculated the edge histogram of difference image, obtain the difference edge histogram;

S6. ask the motion histogram, calculate the motion history figure of video sequence, the motion history figure that obtains is asked the edge gradient histogram, and then obtain the motion histogram;

S7. the pixel that step S3 obtained changes the motion histogram that histogram, edge gradient histogram that step S4 obtains, difference edge histogram that step S5 obtains and step S6 obtain and is tiled into a feature pool, is the video sequence characteristics that finally obtains.

Above-mentioned features training specifically comprises the steps: with identification

S8. to online dictionary training of video sequence characteristics and study, obtain dictionary;

S9. with dictionary video sequence characteristics is carried out k neighbour local restriction uniform enconding;

S10. the distance metric study that exercises supervision obtains a mahalanobis distance transition matrix and replaces Euclidean distance k mean cluster to form code book to the coding characteristic that obtains behind the coding, calculates feature behind each video coding then corresponding to the statistic histogram of code book;

S11. with the tf_idf sorter statistic histogram is classified, obtain final recognition result.

Above-mentioned steps S4 asks the histogrammic detailed process of edge gradient as follows:

S41. the target area is obtained the x direction respectively, the gradient px on the y direction, py, and obtain squared magnitude and gradient direction, and then to amplitude normalization;

S42. gradient direction is quantified as M quantized value, obtain in each zone respectively each quantized interval amplitude and, obtain each interval histogram, be a kind of new feature;

S43. ask each regional area and each regional amplitude and ratio, then itself and step S42 are obtained histogram and multiply each other, obtain another kind of new feature;

S44. ask each zone each quantized value institute to right amplitude with, obtain its pairing amplitude number simultaneously, all directions amplitude and with the ratio of amplitude number as the third new feature.

The detailed process of the pixel motion change frequency figure of step S1 calculating video sequence is as follows: accumulate with the current three-frame difference of video sequence and the difference result, obtain an image onesize with frame of video, squared to the value of its each point then again divided by maximal value, be pixel motion change frequency figure.

Beneficial effect of the present invention: method of the present invention has significantly reduced employed video features by calculating the difference edge histogram of video sequence, has improved the speed of identification, has satisfied the real-time of human action identification; By asking pixel to change histogram and edge gradient histogram respectively, can improve the accuracy rate of action details identification to target area and plurality of sub-regions.

Description of drawings

Fig. 1 is a particular flow sheet of the present invention.

Fig. 2 is the synoptic diagram that the pixel of the running of the embodiment of the invention changes probability graph.

Fig. 3 is the statistic histogram synoptic diagram that the running pixel of the embodiment of the invention changes probability graph.

Fig. 4 is the difference image gradient magnitude synoptic diagram of the embodiment of the invention.

Fig. 5 is the edge gradient histogram synoptic diagram of the difference image of the embodiment of the invention.

Fig. 6 is the motion history figure synoptic diagram of the embodiment of the invention.

Fig. 7 is the edge gradient histogram synoptic diagram of the motion history figure of the embodiment of the invention.

Fig. 8 is the recognition result synoptic diagram of the embodiment of the invention.

Embodiment

For making technical scheme of the present invention clearer, the invention will be further described below in conjunction with accompanying drawing and specific embodiment.

Present embodiment is example with the video monitoring.Mix up guarded region earlier, in guarded region to from the specific frame number image of camera collection, frame number is 100 in the present embodiment, that is to say that 100 frames constitute a video, the size of video sequence is 240*320, with four human actions---and " fight, stretch out one's hand, run, walk " is that example describes.

The recognition methods of human action in the video sequence of the present invention, idiographic flow comprise two processes of feature extraction and features training and identification as shown in Figure 1, and wherein, feature extraction comprises the steps:

S1. calculate the pixel motion change frequency figure of video sequence.Here can adopt classic method, detailed process is as follows: accumulate with the current three-frame difference of video sequence and the difference result, obtain an image onesize with frame of video, squared to the value of its each point then again divided by maximal value (just having finished the normalization operation of image), promptly obtained pixel motion change frequency figure.Fig. 2 is the synoptic diagram that " running " pixel changes probability graph.

S2. divide the zone of pixel motion change frequency figure, determine the zone of pixel motion change frequency figure intermediate value greater than a certain threshold value, here threshold value is got 0.03 (with respect to 0～1 image), and then find the minimum ordinate and the minimum horizontal ordinate of the pixel in this zone, and maximum ordinate and maximum horizontal ordinate, then with this minimum ordinate and minimum horizontal ordinate, maximum ordinate and maximum horizontal ordinate are determined a target area, and divide plurality of sub-regions according to a certain percentage on the longitudinal axis or X direction;

In order to match, here by on X direction being divided into three sub regions at 3: 4: 8 with the human body image structure.

S3. ask pixel to change histogram respectively to target area and three sub regions, specific as follows:

S31. be N quantized value with the value non-uniform quantizing among the pixel motion change frequency figure of target area, N is a natural number here, and in order to reach optimal effectiveness, N gets 8 in the present embodiment;

S32. ask the value of each quantized value correspondence of target area and three sub regions in pixel motion change frequency figure respectively, each zone just obtains the histogram of one 8 dimension like this;

S33. the histogram of target area and three sub regions is tiled into the vector and the normalization of a multidimensional, obtains pixel and change histogram, video of this example is finally had to a pixel and is changed histogram, and the result as shown in Figure 3;

S4. target area and three sub regions are asked the edge gradient histogram respectively.Detailed process is as follows:

S41. the target area is obtained the x direction respectively, the gradient px on the y direction, py, and obtain gradient magnitude quadratic sum gradient direction, and then to amplitude normalization;

P=px ²+ py ²,

Here P represents gradient magnitude, p _y(x, y) gradient map, the p of expression y direction _x(θ represents the gradient direction angle for x, the y) gradient map of expression x direction.

S42. gradient direction is quantified as M quantized value, M is a natural number here, and in order to reach optimal effectiveness, M gets 8 in the present embodiment, obtain in each zone respectively then each quantized interval amplitude and, obtain each interval histogram, be a kind of new feature;

S45. S42, the S43 of target area and three sub regions, S44 three seed characteristics are tiled into a feature pool and normalization obtains the edge gradient histogram, video of present embodiment is finally had to such edge gradient histogram.

S5. calculate the difference edge histogram of video sequence, detailed process is as follows: calculate present frame I _iWith former frame I _I-1Difference image d (x, y)=I _i-I _I-1If, difference image d (x, y) maximal value of the element in the absolute value is greater than the threshold value that presets, and what preset here gets 7.65 (with respect to 0～255 gray level images), and (its method is used the described method of step S4 for x, y) edge histogram to calculate difference image d; Difference gradient magnitude image as shown in Figure 4, the difference edge histogram as shown in Figure 5, horizontal ordinate is an intrinsic dimensionality, ordinate is each dimension value size, here the dimension of feature be 96 the dimension.If a video has the L frame, have to L-1 such difference edge histogram at most.

S6. calculate the motion history figure of video sequence, the motion history figure that obtains is asked the edge gradient histogram, obtain the motion histogram; Motion history figure as shown in Figure 6, video of present embodiment is finally had to such motion histogram, as shown in Figure 7, horizontal ordinate is an intrinsic dimensionality, ordinate is each dimension value size, here the dimension of feature be 96 the dimension.

Also can divide into several sub-fragments to 100 two field pictures among step S3 and the step S6 obtains several sub-pixels and changes histograms and sub-motion histogram.

S7. the pixel that step S3 obtained changes the motion histogram that histogram, edge gradient histogram that step S4 obtains, difference edge histogram that step S5 obtains and step S6 obtain and is tiled into a feature pool, is the video sequence characteristics that finally obtains.In being tiled into a feature pool process, each difference edge histogram feature shared pixel changes histogram feature, edge gradient histogram feature, motion histogram feature, such video is finally had to and the as many feature of difference edge histogram quantity, mostly be most L-1 feature, the feature that feature quantity is extracted than other natural language description method significantly reduces.

Obtain after the video sequence characteristics, and then can carry out features training and identification.In order further to improve the preparation rate of identification, adopt following steps to carry out features training and identification:

S8. to online dictionary training of sample video sequence characteristics and study, obtain dictionary; Online dictionary study is list of references Julien Mairal specifically, Online Learning for Matrix Factorization and Sparse Coding, Journal of Machine Learning Research 11 (2010), 19-60.

S9. the dictionary that obtains with training carries out k neighbour local restriction uniform enconding to video sequence characteristics; The local restriction uniform enconding specifically can be referring to document: Wang, Jinjun; Yang, Jianchao, Locality-constrained Linear Coding for Image Classification, Computer Vision and Pattern Recognition (CVPR), 2010,3360-3367.

S10. the distance metric study that exercises supervision obtains a mahalanobis distance transition matrix and replaces Euclidean distance k mean cluster to form code book to the coding characteristic that obtains behind the coding, calculates feature behind each video coding then corresponding to the statistic histogram of code book.Supervision distance metric study specifically can be referring to document: Kilian Q.Weinberger, Distance Metric Learning for Large Margin Nearest Neighbor Classification, Journal of Machine Learning Research 10 (2009) 207-244.

S11. with the tf_idf sorter statistic histogram is classified, obtain final recognition result.As shown in Figure 8.The tf_idf sorter specifically can be referring to document: Salton, G.and Buckley, C.1988 Term-weighting approaches in automatic text retrieval.Information Processing﹠amp; Managemen 24 (5): 513-523.

From recognition result as can be seen: it is effective that this method is used for human action behavior identification, not only can discern common normal behaviour simple, that run, walk, also can discern the complicated behavior of fighting of the suspicious behavior of stretching out one's hand (stealing the recurrent behavior of stretching out one's hand).The diagonal line data are correct recognition rata among Fig. 8, and remainder data is an error recognition rate, represent that as fourth line the behavior correct recognition rata is 91% on foot, wherein have 3% erroneous judgement to be the behavior of fighting, and 6% erroneous judgement is the behavior of stretching out one's hand.

Method of the present invention has execution speed faster, with the simulated program of matlab2009a exploitation, at double-core 2.5GCPU, on the PC platform of 2G internal memory, the video image of 240*320 pixel is discerned, and the video image of discerning per 100 frames needs 7～9s.Use C if program changes into, under the VC environment, can reach real-time effect.

Method of the present invention has significantly reduced employed video features by calculating the difference edge histogram of video sequence, has improved the speed of identification, has satisfied the real-time of human action identification; By asking pixel to change histogram and edge gradient histogram respectively, can improve the accuracy rate of action details identification to target area and plurality of sub-regions.

Those of ordinary skill in the art will appreciate that embodiment described here is in order to help reader understanding's principle of the present invention, should to be understood that protection scope of the present invention is not limited to such special statement and embodiment.Those of ordinary skill in the art can make various other various concrete distortion and combinations that do not break away from essence of the present invention according to these technology enlightenments disclosed by the invention, and these distortion and combination are still in protection scope of the present invention.

Claims

1. the recognition methods of human action in the video sequence comprises two processes of feature extraction and features training and identification, it is characterized in that feature extraction comprises the steps:

S1. calculate the pixel motion change frequency figure of video sequence;

S5. calculate the difference edge histogram of video sequence, detailed process is as follows: calculate the difference image of present frame and former frame, if the maximal value of the element in the difference image absolute value is greater than the threshold value that presets.Calculate the edge histogram of difference image, obtain the difference edge histogram;

S6. calculate the motion history figure of video sequence, the motion history figure that obtains is asked the edge gradient histogram, obtain the motion histogram;

2. the recognition methods of human action is characterized in that in the video sequence according to claim 1, and described features training specifically comprises the steps: with identification

3. the recognition methods of human action is characterized in that in the video sequence according to claim 1 and 2, and step S4 asks the histogrammic detailed process of edge gradient as follows:

4. the recognition methods of human action in the video sequence according to claim 1 and 2, it is characterized in that, the detailed process of the pixel motion change frequency figure of the described calculating video sequence of step S1 is as follows: accumulate with the current three-frame difference of video sequence and the difference result, obtain an image onesize with frame of video, squared to the value of its each point then again divided by maximal value, be pixel motion change frequency figure.

5. the recognition methods of human action in the video sequence according to claim 1 and 2, it is characterized in that step S2 is described to divide plurality of sub-regions according to a certain percentage for according on X direction being divided into three sub regions at 3: 4: 8 on the longitudinal axis or X direction.

6. the recognition methods of human action is characterized in that in the video sequence according to claim 1 and 2, and step S31 and the described N of step S32 are 8.

7. the recognition methods of human action is characterized in that in the video sequence according to claim 3, and the described M of step S42 is 8.