CN102136066B

CN102136066B - Method for recognizing human motion in video sequence

Info

Publication number: CN102136066B
Application number: CN 201110109440
Authority: CN
Inventors: 李宏亮; 覃耀辉
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2011-04-29
Filing date: 2011-04-29
Publication date: 2013-04-03
Anticipated expiration: 2031-04-29
Also published as: CN102136066A

Abstract

The invention discloses a method for recognizing human motion in a video sequence, which aims at contradict between accuracy and real-time performance achieved by the conventional method for recognizing the human motion in a video image. The method provided by the invention comprises a characteristic extraction process and a characteristic training and recognition process. In the characteristic extraction, a differential edge histogram of the video sequence is calculated so as to greatly reduce video characteristics which are used, increase the recognition speed and meet the real-time performance of human motion recognition; and a pixel change histogram and an edge gradient histogram are calculated for a target area and a plurality of sub-areas respectively so as to improve the accuracy of motion detail recognition. By the method provided by the invention, not only the recognition accuracy is improved, but also the recognition real-time performance is met.

Description

The recognition methods of human action in a kind of video sequence

Technical field

The invention belongs to technical field of computer vision, particularly a kind of recognition methods of human action.

Background technology

Now the paces of digital network are progressively accelerated, and video monitoring system rises in the management that participates in the whole industry, and oneself comes true, and particularly gains great popularity in the safety precaution field in all trades and professions with the advantage of its intuitive and real-time.Along with the day by day reduction of the watch-dog costs such as video camera, video monitoring system can be widely used in the places such as bank, post and telecommunications, prison, court, large common facility, bulk storage plant and military base, and public safety field plays a part to become more and more important.But the function of supervisory system mostly only rests on the monitor staff to the artificial supervision of vision signal and on video recording is analyzed, does not take full advantage of on the huge computing power that the computer technology high speed development provides up till now afterwards at present.In fact, most supervisory systems or analog, the system of a few digits formula also only provide the simple functions of many picture displies and hard disc recording class.Existing supervisory system all can not realize the in real time supervisory role of active, the i.e. intellectuality of monitoring and unmanned.Intelligent monitor system can be realized the whole day Real Time Monitoring, and the view data of automatic analysis video camera seizure, when unusual generation, can be to security personnel transmission alarm accurately and timely, thereby avoid the generation of crime, and the core of video monitoring is the identification to human action.

To the identification of human action, mainly contain 3 kinds of methods: the method for (1) template matches at present; (2) method of state space; (3) model-based methods.

Its advantage of template matching method (template matching) is that easily realize in the algorithm separate room, time overhead is few, better to differing larger behavior recognition effect, but to nuance the behavior recognition effect relatively poor, responsive to variation and the noise ratio of exercise duration.

In recent years, adopt state-space method to carry out the research of human action behavior identification more, representative is markov network, hidden Markov model (HMM, Hidden Markov Model) has been widely used in prediction, estimation, detection and the behavior identification of video, image sequence.Yet, it is very large that state-space method needs a large amount of training samples to come its accuracy of physical training condition transition probability parameter to be subject to the impact of training sample quantity, its principle of status switch identifying also is template matches, and because the complicacy of a behavior master plate coupling obviously is inadequate.

Now more human action behavior Study of recognition person be more prone to sight invest use natural language description method to human action behavior identify, so that the analysis and research of the semantic description of human action behavior have obtained certain progress, the natural language description that is used for the behavior of video image human action, as set up 2D, 3D model: at first, use the 3D model to the instantaneous Posture description of human body, make the model of structure similar to the human body target attitude as far as possible, it is the marginal information of human body in the image, or behavior regarded as 2D static posture sequence, by model-based methods, to the human body two dimension, three-dimensional posture, angle, the position and with environment in the variation etc. of some other target relative distance rebuild and the natural language text of estimating to generate at last the human action behavior description.But the structure complexity of model-based methods is high-leveled and difficult in realization, and present very many of the feature extracted of the feature extracting method of natural language description, video extraction about 100 frame goes out hundreds of to several thousand features, and the processing time expense is large, thereby is not easy to be applied to real-time system.Abroad mainly be to carry out as identification take the foundation of natural language description and various probability models recently, but behavior identification still is in the junior stage.

Above-mentioned the whole bag of tricks can not reach balance between recognition correct rate and real-time, namely or recognition correct rate is higher, but computation complexity is high, and real-time is poor; Computation complexity is low, and real-time is good, but recognition correct rate is lower.

Summary of the invention

The objective of the invention is to have proposed the recognition methods of human action in a kind of video sequence in order to solve the contradiction of human action recognition methods between accuracy rate and real-time in the existing video image.

Technical scheme of the present invention is: the recognition methods of human action in a kind of video sequence, comprise two processes of feature extraction and features training and identification, wherein,

Feature extraction comprises the steps:

S1. calculate the pixel motion change frequency figure of video sequence;

S2. divide the zone of pixel motion change frequency figure, determine that pixel motion change frequency figure intermediate value is greater than the zone of predefined first threshold, find minimum ordinate and the minimum horizontal ordinate of the pixel in this zone, and maximum ordinate and maximum horizontal ordinate, then with this minimum ordinate and minimum horizontal ordinate, maximum ordinate and maximum horizontal ordinate are determined a target area, and divide several subregions at the longitudinal axis or X direction according to a certain percentage;

S3. ask respectively pixel to change histogram to target area and several subregions, specific as follows:

S31. be N quantized value with the value non-uniform quantizing among the pixel motion change frequency figure of target area;

S32. obtain respectively target area and the value of each quantized value correspondence of several subregions in pixel motion change frequency figure and, each zone just obtains the histogram of a N dimension like this;

S33. the histogram of target area and several subregions is tiled into vector and the normalization of a multidimensional, obtains pixel and change histogram;

S4. the edge gradient histogram is asked respectively in target area and several subregions;

S5. calculate the difference edge histogram of video sequence, detailed process is as follows: the difference image that calculates present frame and former frame, if the maximal value of the element in the difference image absolute value greater than the Second Threshold that presets, is calculated the edge histogram of difference image, obtain the difference edge histogram;

S6. ask the motion histogram, calculate the motion history figure of video sequence, the motion history figure that obtains is asked the edge gradient histogram, and then obtain the motion histogram;

S7. the pixel that step S3 obtained changes the motion histogram that histogram, edge gradient histogram that step S4 obtains, difference edge histogram that step S5 obtains and step S6 obtain and is tiled into a feature pool, is the video sequence characteristics that finally obtains.

Above-mentioned features training specifically comprises the steps: with identification

S8. to the online dictionary training of video sequence characteristics and study, obtain dictionary;

S9. with dictionary video sequence characteristics is carried out k neighbour local restriction uniform enconding;

S10. the learning distance metric that exercises supervision of the coding characteristic that obtains behind the coding is obtained a mahalanobis distance transition matrix and replaces Euclidean distance k mean cluster to form code book, then calculate feature behind each Video coding corresponding to the statistic histogram of code book;

S11. with the tf_idf sorter statistic histogram is classified, obtain final recognition result.

Above-mentioned steps S4 asks the histogrammic detailed process of edge gradient as follows:

S41. the x direction is obtained respectively in the target area, the gradient px on the y direction, py, and obtain squared magnitude and gradient direction, and then to amplitude normalization;

S42. gradient direction is quantified as M quantized value, respectively regional obtain each quantized interval amplitude and, obtain each interval histogram, be a kind of New Characteristics;

S43. ask the area of regional and regional amplitude and ratio, then itself and step S42 are obtained histogram and multiply each other, obtain another kind of New Characteristics;

S44. ask each quantized value of regional to right amplitude and, obtain simultaneously its corresponding amplitude number, all directions amplitude and with the ratio of amplitude number as the third New Characteristics.

The detailed process of the pixel motion change frequency figure of step S1 calculating video sequence is as follows: accumulate with the current three-frame difference of video sequence and the difference result, obtain an image onesize with frame of video, then squared again divided by maximal value to the value of its each point, be pixel motion change frequency figure.

Beneficial effect of the present invention: method of the present invention has greatly reduced employed video features by calculating the difference edge histogram of video sequence, has improved the speed of identification, has satisfied the real-time of human action identification; By asking respectively pixel to change histogram and edge gradient histogram to target area and several subregions, can improve the accuracy rate of action details identification.

Description of drawings

Fig. 1 is particular flow sheet of the present invention.

Fig. 2 is the schematic diagram that the pixel of the running of the embodiment of the invention changes probability graph.

Fig. 3 is the statistic histogram schematic diagram that the running pixel of the embodiment of the invention changes probability graph.

Fig. 4 is the difference image gradient magnitude schematic diagram of the embodiment of the invention.

Fig. 5 is the edge gradient histogram schematic diagram of the difference image of the embodiment of the invention.

Fig. 6 is the motion history figure schematic diagram of the embodiment of the invention.

Fig. 7 is the edge gradient histogram schematic diagram of the motion history figure of the embodiment of the invention.

Fig. 8 is the recognition result schematic diagram of the embodiment of the invention.

Embodiment

For making technical scheme of the present invention clearer, the invention will be further described below in conjunction with accompanying drawing and specific embodiment.

The present embodiment is take video monitoring as example.Mix up first guarded region, in guarded region to from the specific frame number image of camera collection, frame number is 100 in the present embodiment, that is to say that 100 frames consist of a video, the size of video sequence is 240*320, take four human actions---and " fight, stretch out one's hand, run, walk " describes as example.

The recognition methods of human action in the video sequence of the present invention, idiographic flow comprise two processes of feature extraction and features training and identification as shown in Figure 1, and wherein, feature extraction comprises the steps:

S1. calculate the pixel motion change frequency figure of video sequence, detailed process is as follows: accumulate with the current three-frame difference of video sequence and the difference result, obtain an image onesize with frame of video, then squared again divided by maximal value (namely having finished the normalization operation of image) to the value of its each point, namely obtained pixel motion change frequency figure.Fig. 2 is the schematic diagram that " running " pixel changes probability graph.

S2. divide the zone of pixel motion change frequency figure, determine that pixel motion change frequency figure intermediate value is greater than the zone of predefined first threshold, here predefined first threshold is got 0.03(with respect to 0 ~ 1 image), and then find minimum ordinate and the minimum horizontal ordinate of the pixel in this zone, and maximum ordinate and maximum horizontal ordinate, then with this minimum ordinate and minimum horizontal ordinate, maximum ordinate and maximum horizontal ordinate are determined a target area, and divide several subregions at the longitudinal axis or X direction according to a certain percentage;

In order to match with the human body image structure, be divided into three sub regions by 3:4:8 in X direction here.

S3. ask respectively pixel to change histogram to target area and three sub regions, specific as follows:

S31. be N quantized value with the value non-uniform quantizing among the pixel motion change frequency figure of target area, N is natural number here, and in order to reach optimal effectiveness, N gets 8 in the present embodiment;

S32. ask respectively the value of each quantized value correspondence of target area and three sub regions in pixel motion change frequency figure, each zone just obtains the histogram of one 8 dimension like this;

S33. the histogram of target area and three sub regions is tiled into vector and the normalization of a multidimensional, obtains pixel and change histogram, video of this example is finally had to a pixel and is changed histogram, and the result as shown in Figure 3;

S4. the edge gradient histogram is asked respectively in target area and three sub regions.Detailed process is as follows:

S41. the x direction is obtained respectively in the target area, the gradient px on the y direction, py, and obtain gradient magnitude quadratic sum gradient direction, and then to amplitude normalization;

P=px ²+ py ², Here P represents gradient magnitude, p _yGradient map, the p of (x, y) expression y direction _xThe gradient map of (x, y) expression x direction, θ represents the gradient direction angle.

S42. gradient direction is quantified as M quantized value, M is natural number here, and in order to reach optimal effectiveness, M gets 8 in the present embodiment, then respectively regional obtain each quantized interval amplitude and, obtain each interval histogram, be a kind of New Characteristics;

S45. S42, the S43 of target area and three sub regions, S44 three seed characteristics are tiled into a feature pool and normalization obtains the edge gradient histogram, video of the present embodiment is finally had to such edge gradient histogram.

S5. calculate the difference edge histogram of video sequence, detailed process is as follows: calculate present frame I _iWith former frame I _I-1Difference image d (x, y)=I _i-I _I-1If the maximal value of the element in difference image d (x, the y) absolute value is greater than the Second Threshold that presets, what preset here gets 7.65(with respect to 0 ~ 255 gray level image), calculate difference image d (x, y) edge histogram, its method is used the described method of step S4; Difference gradient magnitude image as shown in Figure 4, the difference edge histogram as shown in Figure 5, horizontal ordinate is intrinsic dimensionality, ordinate is each dimension value size, the dimension of feature is 96 dimensions here.If a video has the L frame, have at most L-1 such difference edge histogram.

S6. calculate the motion history figure of video sequence, the motion history figure that obtains is asked the edge gradient histogram, obtain the motion histogram; Motion history figure as shown in Figure 6, video of the present embodiment is finally had to such motion histogram, as shown in Figure 7, horizontal ordinate is intrinsic dimensionality, ordinate is each dimension value size, the dimension of feature is 96 dimensions here.

Also can be divided into several sub-fragments to 100 two field pictures among step S3 and the step S6 and obtain several sub-pixels variation histograms and sub-motion histogram.

S7. the pixel that step S3 obtained changes the motion histogram that histogram, edge gradient histogram that step S4 obtains, difference edge histogram that step S5 obtains and step S6 obtain and is tiled into a feature pool, is the video sequence characteristics that finally obtains.In being tiled into a feature pool process, each difference edge histogram feature shared pixel changes histogram feature, edge gradient histogram feature, motion histogram feature, such video is finally had to and the as many feature of difference edge histogram quantity, mostly be most L-1 feature, the feature that feature quantity is extracted than other natural language description method greatly reduces.

Obtain after the video sequence characteristics, and then can carry out features training and identification.In order further to improve the preparation rate of identification, adopt following steps to carry out features training and identification:

S8. to the online dictionary training of sample video sequence characteristics and study, obtain dictionary; Online dictionary learning is list of references Julien Mairal specifically, Online Learning for Matrix Factorization and Sparse Coding, Journal ofMachine Learning Research 11 (2010), 19-60.

S9. the dictionary that obtains with training carries out k neighbour local restriction uniform enconding to video sequence characteristics; The local restriction uniform enconding specifically can be referring to document: Wang, Jinjun; Yang, Jianchao, Locality-constrained Linear Coding for ImageClassification, Computer Vision and Pattern Recognition (CVPR), 2010,3360-3367.

S10. the learning distance metric that exercises supervision of the coding characteristic that obtains behind the coding is obtained a mahalanobis distance transition matrix and replaces Euclidean distance k mean cluster to form code book, then calculate feature behind each Video coding corresponding to the statistic histogram of code book.The supervision learning distance metric specifically can be referring to document: Kilian Q.Weinberger, Distance Metric Learning for LargeMargin Nearest Neighbor Classification, Journal of Machine Learning Research 10 (2009) 207-244.

S11. with the tf_idf sorter statistic histogram is classified, obtain final recognition result.As shown in Figure 8.The tf_idf sorter specifically can be referring to document: Salton, G.and Buckley, C.1988Term-weighting approaches inautomatic text retrieval.Information Processing﹠amp; Managemen 24 (5): 513 – 523.

Can find out from recognition result: it is effective that this method is used for human action behavior identification, not only can identify common normal behaviour simple, that run, walk, also can identify the complicated behavior of fighting of the suspicious behavior of stretching out one's hand (stealing the recurrent behavior of stretching out one's hand).The diagonal line data are correct recognition rata among Fig. 8, and remainder data is error recognition rate, represent that such as fourth line the behavior correct recognition rata is 91% on foot, wherein have 3% to be mistaken for the behavior of fighting, and 6% is mistaken for the behavior of stretching out one's hand.

Method of the present invention has faster execution speed, with the simulated program of matlab2009a exploitation, at double-core 2.5GCPU, on the PC platform of 2G internal memory, the video image of 240*320 pixel is identified, and the video image of identifying per 100 frames needs 7 ~ 9s.Use C if program changes into, under the VC environment, can reach real-time effect.

Method of the present invention has greatly reduced employed video features by calculating the difference edge histogram of video sequence, has improved the speed of identification, has satisfied the real-time of human action identification; By asking respectively pixel to change histogram and edge gradient histogram to target area and several subregions, can improve the accuracy rate of action details identification.

Those of ordinary skill in the art will appreciate that, embodiment described here is in order to help reader understanding's principle of the present invention, should to be understood to that protection scope of the present invention is not limited to such special statement and embodiment.Those of ordinary skill in the art can make various other various concrete distortion and combinations that do not break away from essence of the present invention according to these technology enlightenments disclosed by the invention, and these distortion and combination are still in protection scope of the present invention.

Claims

1. the recognition methods of human action in the video sequence comprises two processes of feature extraction and features training and identification, it is characterized in that, feature extraction comprises the steps:

S1. calculate the pixel motion change frequency figure of video sequence, detailed process is as follows: accumulate with the current three-frame difference of video sequence and the difference result, obtain an image onesize with frame of video, then squared again divided by maximal value to the value of its each point, be pixel motion change frequency figure;

S6. calculate the motion history figure of video sequence, the motion history figure that obtains is asked the edge gradient histogram, obtain the motion histogram;

2. the recognition methods of human action in the video sequence according to claim 1 is characterized in that, described features training specifically comprises the steps: with identification

3. the recognition methods of human action in the video sequence according to claim 1 and 2 is characterized in that, step S4 asks the histogrammic detailed process of edge gradient as follows:

4. the recognition methods of human action in the video sequence according to claim 1 and 2, it is characterized in that, step S2 is described to divide several subregions for to be divided into three sub regions according to 3:4:8 in X direction at the longitudinal axis or X direction according to a certain percentage.

5. the recognition methods of human action in the video sequence according to claim 1 and 2 is characterized in that, step S31 and the described N of step S32 are 8.

6. the recognition methods of human action in the video sequence according to claim 3 is characterized in that the described M of step S42 is 8.