CN103886293A

CN103886293A - Human body behavior recognition method based on history motion graph and R transformation

Info

Publication number: CN103886293A
Application number: CN201410106957.4A
Authority: CN
Inventors: 肖俊; 李潘; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-03-21
Filing date: 2014-03-21
Publication date: 2014-06-25
Anticipated expiration: 2034-03-21
Also published as: CN103886293B

Abstract

The invention discloses a human body behavior recognition method based on a history motion graph and R transformation. According to the method, a depth video is used as a recognition basis, firstly, the minimum enclosure rectangle of human body motion is calculated according to a foreground segmentation technology, then the history motion graph is extracted within a depth video area limited by the minimum enclosure rectangle, motion intensity constraint is exerted on the extracted history motion graph, so that a motion energy diagram is obtained, R transformation is calculated on the obtained motion energy graph, and therefore a characteristic vector used for behavior recognition is obtained. A method of a support vector machine is adopted for training and recognition processes. The minimum enclosure rectangle of human body behavior motion is adopted for preprocessing, and behavior characteristic extraction is accelerated; a method of history motion graph sequences is adopted for reducing influences of noise in depth graphs; characteristics are extracted through performing R transformation on the energy graph, so that calculation speed is high.

Description

A kind of human body behavior recognition methods based on motion history figure and R conversion

Technical field

The present invention relates to computer vision and image processing field, relate in particular to a kind of human body behavior recognition methods based on motion history figure and R conversion.

Background technology

Video monitoring is focus and the Important Problems of current vision area research, in the field such as safety-security area and man-machine interaction, producing continuously the video data of One's name is legion, these data are weighed with the unit of G easily, only can expend undoubtedly huge manpower with artificial cognition. video content is abundant, and most of time, we only pay close attention to some part in video, such as human body behavior, if can identify automatically and efficiently, the manpower of amplification quantity will be separated.Current behavior Study of recognition achievement mainly concentrates in the behavior Study of recognition of rgb video.

Rgb video is the modal a kind of form of video, wide material sources, have for many years more achievement in research, the behavior recognition methods based on rgb video at present is mainly divided into space-time analysis method (Space-timeapproach), sequence analysis method (Sequential approach) and hierarchical parsing approach (Hierarchical approach) three major types.Through development for many years, the research bottleneck of the human body behavior identification aspect based on rgb video highlights day by day, and when reason is rgb video as the data source of human body behavior identification, background interference is difficult to remove.Prior thing, rgb video has only utilized two dimensional surface information, describes 3 D human body behavior obviously lost a lot of key messages by two-dimensional signal.

Along with the progress of technology, there is in recent years a kind of camera-Kinect that is furnished with depth transducer of cheapness.This Kinect camera of Microsoft can, in obtaining normal RGB image, obtain quality acceptable depth information.The algorithm of integrated bone study in camera, can obtain the bone information of normal human in three-dimensional scenic.The feature extraction of depth map is at present main or using for reference the experience of in the past extracting feature on RGB.Meanwhile, many common data sets propose, and are very easy to the research of feature extraction on depth map.The people such as Zicheng Liu have proposed the method based on three-dimensional data profile (A bag of3D words), he sees depth map as three-dimensional data, then in cartesian space from, left and first three direction projection obtain projected outline, the point that after this, down-sampling goes out fixed number in projected outline is as feature, and the feature drawing is inserted in Action Graph model and identified.Bingbing Ni has independently gathered one and has been referred to as the depth data collection of RGBD-HuDaAct, and first the thought of 3D-MHIs has been used in depth map sequence feature extraction.These methods have limitation separately: the method recognition accuracy of A bag of3D words is higher, but due to needs uniform sampling on human body contour outline, the depth data that requirement obtains is very pure, cannot in the human body behavior identification of actual scene, use; Directly the method speed of application 3D-MHIs is enough fast, but recognition accuracy is inadequate; DMM-HOG behavior identification to complex background in guaranteeing recognition accuracy is also more effective, but the method is too consuming time, cannot realize real-time human body behavior identification.

Summary of the invention

The present invention is directed to the deficiencies in the prior art, proposed a kind of human body behavior recognition methods based on motion history figure and R conversion.The method is used deep video as basis of characterization, the concept of motion history figure and R conversion has been applied among behavioural characteristic leaching process, and has utilized the method for support vector machine to carry out training and the identifying of behavior identification.

The method comprises off-line training step and ONLINE RECOGNITION stage, and concrete steps are as follows:

Step (1). off-line training step

Described off-line training step object is to obtain a human body behavior model of cognition, and its step is as follows:

Deep video S to be trained is cut into multiple deep video fragments that time span is identical by step 1-1., then stamps different behavior marks according to the different behavior classifications of each deep video fragment, obtained thus the training set T of human body behavior identification.

Described training set T is the set of each deep video fragment of different behavior mark;

Described time span is the time span of the video segment to be identified of ONLINE RECOGNITION stage definitions;

The minimum that step 1-2. uses " foreground segmentation technology " to obtain human body behavior campaign in each deep video fragment is surrounded square, and the minimum video content that surrounds square restriction in deep video fragment is zoomed to unified size.

Described " foreground segmentation technology " operation is as follows:

A) for a given deep video fragment V of training set T, it is by some frame depth map { P ₁, P ₂..., P _iform, wherein i represents i frame depth map; For any depth map P wherein _i, by P _imiddle pixel carries out k-means binary clusters according to the depth value of pixel position, obtains foreground pixel set and background pixel set; Described foreground pixel is less than the mean depth value of background pixel.

B) at depth map P _ion find out a rectangle frame R _i, all foreground pixels that step a) obtains are included at this rectangle frame R _iin, R _iby

with

form, wherein

with

represent respectively R _ithe pixel coordinate of left margin, right margin, coboundary and lower boundary; Then by rectangle frame R _iaccording to being laterally divided into wide two parts, if rectangle frame R _ileft-half pixel number more than right half part, and if

being moved to the left K(K is constant, can regulate according to practical application scene) pixel number after individual pixel in new rectangle frame is greater than original rectangular frame R _ithe η ﹪ (50< η <100 can regulate according to practical application scene) of interior number, will

adjust K pixel left, if the pixel number in new rectangle frame is less than the η ﹪ of pixel number in original rectangular frame Ri after moving boundary, right margin adjustment completes; If rectangle frame R _ithe pixel of right half part more than left-half, and will

pixel number after K the pixel that move right in rectangle frame is greater than original rectangular frame R _ithe η ﹪ of interior number, will adjust K pixel to the right, if the pixel number in new rectangle frame is less than original rectangular frame R after moving boundary _ithe η ﹪ of middle pixel number, left margin adjustment completes; If rectangle frame R _ileft and right two halves pixel in number of pixels differ and be no more than ε (ε is threshold parameter), judge whether remaining pixel number in K/2 the stylish rectangle frame of pixel is drawn close in border, left and right simultaneously to center be greater than original rectangular frame R _iinterior all η ﹪ of pixels, if set up, by rectangle frame R _irespectively draw K/2 pixel according to border, left and right and adjust, repeating step (b) afterwards, until remaining pixel number is less than the interior all η ﹪ of pixels of original rectangular frame Ri in new rectangle frame.Adopt above-mentioned same method to rectangle frame R _iup-and-down boundary adjust.

C) deep video fragment V is by horizontal ordinate x, the three-dimensional space mesosome that tri-dimensions of ordinate y and time coordinate t are described, this three-dimensional space mesosome through step b) adjust after, any frame P in deep video fragment V _iforeground pixel out divided, this foreground pixel scope is by R _ibe described.If the minimum of human body behavior is surrounded four coboundary R of square R in deep video S _up, lower boundary R _down, left margin R _leftwith right margin R _rightcan use respectively according to formula (1) and calculate:

R^{up} = \min R_{i}^{up}, R^{down} = \max R_{i}^{down}, R^{left} = \min R_{i}^{left}, R^{right} = \max R_{i}^{right}

Formula (1);

A cross-talk sequence S who is τ since moment j random time length of window in step 1-3. deep video fragment V _j, can obtain a motion history figure

, its account form is as follows:

{MHI}_{τ}^{I} (x, y, t) = \{\begin{matrix} τ, if | I (x, y, t) - I (x, y, t - 1) | > δ I_{th} \\ \max (0, {MHI}_{τ}^{I} (x, y, t - 1) - 1,0), else \end{matrix}

Formula (2);

Wherein, I (x, y, t) represents that deep video is engraved in the depth value of catching of pixel (x, y) position in the time of t; The scope of t is [j, j+ τ-1]; δ I _thfor constant threshold, j, τ are natural number;

The present invention gets three random time length of window τ ^s, τ ^m, τ ^l, obtain corresponding motion history figure

wherein s, m, l is natural number, m=2s, l=4s, and s is proportional to the time span of deep video fragment V;

Through the processing of step 1-3, deep video fragment is converted to motion history graphic sequence, three time window length motion history figure that note is obtained by the present invention

the motion history graphic sequence of the deep video fragment V forming in the extension of time dimension is expressed as MHIsO, wherein o=s, m, l.

Step 1-4., for any one the motion history graphic sequence MHIsO obtaining in step 1-3, establishes H ^o(x, y, t) represents motion history graphic sequence MHIs ^oin the intensity of t frame pixel (x, y) position.In order to get rid of the interference of noise in depth map, to motion history graphic sequence MHIs ^ocarry out further strength constraint, according to motion history graphic sequence MHIs ^obe calculated as follows energygram D ^o, wherein D ^oin the value D of each position (x, y) ^o(x, y) computing method are shown in formula (3):

D^{o} (x, y) = Σ_{i = 1}^{N - 1} μ (| H^{o} (x, y, i + 1) - H^{o} (x, y, i) | - ϵ)

Formula (3);

Wherein, μ (θ) is unit-step function, and μ in the time of θ >=0 (θ) is 1, and in the time of θ < 0, μ (θ) is 0; ε is threshold constant, can regulate according to design application scenarios; N is the time span of deep video fragment V.

Step 1-5. is to the energygram D obtaining ^o, ask its R conversion, calculate R conversion, obtain the behavioural characteristic of deep video fragment V

specific as follows:

First calculating energy figure D ^oradon conversion, computing method are shown in formula (4):

p_{o} (ρ, θ) = {&Integral;}_{- \infty}^{\infty} {&Integral;}_{- \infty}^{\infty} D^{o} (x, y) δ (x \cos θ + y \sin θ - ρ) dxdy

Formula (4);

Then, θ direction omnirange is carried out to integration, obtain R conversion, account form is as formula (5):

formula (5);

In order to prevent scale affects, right

be normalized,

will

with

be spliced to form the behavioural characteristic of deep video fragment V

Step 1-6. is according to the behavioural characteristic of deep video fragment V

the behavior mark of the deep video fragment obtaining with step 1-1, adopts support vector machine to train model of cognition M.

Step (2). the ONLINE RECOGNITION stage

Described ONLINE RECOGNITION stage object is that the model of cognition M that utilizes off-line training step to obtain carries out behavior identification, and its step is as follows:

The behavioural characteristic of the step 2-1. method identical with off-line training step operation steps 1-1～1-6 to video extraction to be identified video to be identified.

When described ONLINE RECOGNITION stage identification granularity and off-line training step training, be consistent.

The behavioural characteristic of step 2-2. based on video to be identified, utilizes support vector machine to carry out behavior identification to video to be identified according to training model M out.

Method proposed by the invention has following beneficial effect compared with traditional human body behavior recognition methods:

1. in off-line training step and ONLINE RECOGNITION phase characteristic leaching process, use the minimum of human body behavior campaign to surround this preprocessing process of square, accelerated the process that behavioural characteristic is extracted, got rid of the interference of complex background simultaneously.

2. adopt the method for motion history graphic sequence that the key message of human body behavior campaign is kept down, there is three-dimensional motion information because depth map is natural, therefore there is stronger human body descriptive power compared to the behavior identification based on rgb video, the human body behavior description ability that the key message retaining is also more strengthened, the motion history figure strength constraint of back to back time dimension has reduced the impact of noise in depth map.

3. the final step that behavioural characteristic is extracted is carried out R conversion and is extracted feature on energygram, on the basis that fully obtains intensity and profile information on energygram, retain fast this advantage of computing velocity, therefore this method can be carried out in real time behavior identification in guaranteeing recognition accuracy, it should be noted that, on energygram, having retained profile and the strength information of motion, is to the well-refined of original motion behavior and description.

Based on above-mentioned three features, the invention provides one fast, effectively human body behavioural characteristic and the human body behavior recognition methods based on this feature.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the inventive method behavioural characteristic leaching process, and wherein figure (a) is concrete flow process, and figure (b) is the image preview corresponding with figure (a);

Fig. 2 is the outline flowchart of the inventive method.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is further illustrated.

As shown in Figure 1 and Figure 2, the present invention includes off-line training step and ONLINE RECOGNITION stage.

Step (1). off-line training step

Off-line training step object is to obtain a human body behavior model of cognition, and its step is as follows:

Deep video S to be trained is all cut into multiple deep video fragments that time span is identical by step 1-1., then stamps different behavior marks according to the different behavior classifications of each deep video fragment, obtained thus the training set T of human body behavior identification.

Step 1-2. uses " foreground segmentation technology " to obtain the minimum encirclement square of human body behavior campaign in each deep video fragment, and the video content that in deep video fragment, minimum encirclement square limits is zoomed to the unified big or small 320*240 of being.

Described " foreground segmentation technology " is described below:

A) for a given deep video fragment V of training set T, it is by some frame depth map { P ₁, P ₂..., P _iform, wherein i represents natural number, for any depth map P wherein _i, by P _imiddle pixel carries out k-means binary clusters according to the depth value of pixel position, obtains two set that comprise respectively foreground pixel and background pixel; Described foreground pixel is less than the mean depth value of background pixel.

B) at depth map P _ion find out a rectangle frame R _i, all foreground pixels that step a obtains are included at this rectangle frame R _iin, R _iby

with

form, wherein

with

being moved to the left K(K is constant, can regulate according to practical application scene) pixel number after individual pixel in new rectangle frame is greater than original rectangular frame R _i90 ﹪ (90 ﹪ are recommended value, can regulate according to practical application scene) of interior number, will

adjust K pixel left, if the original rectangular frame R of the pixel number deficiency in rectangle frame after moving boundary _i90 ﹪ of interior pixel number, right margin adjustment completes; If rectangle frame R _ithe pixel number of right half part more than left-half, and will

pixel number after K the pixel that move right in rectangle frame is greater than original rectangular frame R _i90 ﹪ of interior number, will

adjust K pixel to the right, if the original rectangular frame R of the pixel number deficiency in rectangle frame after moving boundary _i90 ﹪ of middle pixel number, left margin adjustment completes; If number of pixels differs and is no more than ε (ε is threshold parameter) in the left and right two halves pixel of rectangle frame Ri, judge whether remaining pixel number in K/2 the stylish rectangle frame of pixel is drawn close in border, left and right simultaneously to center be greater than original rectangular frame R _iinterior all 90 ﹪ of pixels, if set up, by rectangle frame R _irespectively draw K/2 pixel according to border, left and right and adjust, repeating step (b) afterwards, until remaining pixel number is less than original rectangular frame R in new rectangle frame _iinterior all 90 ﹪ of pixels.Adopt above-mentioned same method to rectangle frame R _iup-and-down boundary adjust.

C) deep video fragment V is by horizontal ordinate x, the three-dimensional space mesosome that tri-dimensions of ordinate y and time coordinate t are described, through step b) after, any frame P in deep video fragment V _iforeground pixel out divided, this foreground pixel scope is by R _ibe described.If the minimum of human body behavior is surrounded four coboundary R of square R in deep video S ^up, lower boundary R ^down, left margin R ^leftwith right margin R ^rightcan use respectively according to formula (1) and calculate:

R^{up} = \min R_{i}^{up}, R^{down} = \max R_{i}^{down}, R^{left} = \min R_{i}^{left}, R^{right} = \max R_{i}^{right}

Formula (1);

, its account form is as follows:

{MHI}_{τ}^{I} (x, y, t) = \{\begin{matrix} τ, if | I (x, y, t) - I (x, y, t - 1) | > δ I_{th} \\ \max (0, {MHI}_{τ}^{I} (x, y, t - 1) - 1,0), else \end{matrix}

Formula (2);

Wherein, I (x, y, t) represents that deep video is engraved in the depth value of catching of pixel (x, y) position in the time of t.The scope of t is [j, j+ τ-1].δ I _thfor constant threshold, j, τ are natural number;

Since any time t, the present invention gets length of window τ continuous time ^s=4, τ ^m=8 and τ ^l=16, obtain corresponding motion history graphic sequence

Through the processing of step 1-3, deep video fragment is converted to motion history graphic sequence,, three time window length motion history figure that note is obtained by the present invention the motion history graphic sequence of the deep video fragment V forming in the extension of time dimension is expressed as MHIs ^o, wherein o=s, m, l.

Step 1-4. is for any one the motion history graphic sequence MHIs obtaining in step 1-3 ^o, wherein o=s, m, l, establishes H ^o(x, y, t) represents motion history graphic sequence MHIs ^oin the intensity of t frame pixel (x, y) position.In order to get rid of the interference of noise in depth map, to motion history graphic sequence MHIs ^ocarry out further strength constraint, according to motion history graphic sequence MHIs ^obe calculated as follows energygram D ^o, wherein D ^oin the value D of each position (x, y) ^o(x, y) computing method are shown in formula (3):

D^{o} (x, y) = Σ_{i = 1}^{N - 1} μ (| H^{o} (x, y, i + 1) - H^{o} (x, y, i) | - ϵ)

Formula (3);

Wherein, μ (θ) is unit-step function, and μ in the time of θ >=0 (θ) is 1, and in the time of θ < 0, μ (θ) is 0.ε is threshold constant, can regulate according to design application scenarios.N is the time span of deep video fragment V.

specific as follows:

p_{o} (ρ, θ) = {&Integral;}_{- \infty}^{\infty} {&Integral;}_{- \infty}^{\infty} D^{o} (x, y) δ (x \cos θ + y \sin θ - ρ) dxdy

Formula (4);

formula (5);

In order to prevent scale affects, right

be normalized,

will

be spliced to form the behavioural characteristic of deep video fragment V

the behavior mark of the deep video fragment obtaining with step (1), adopts support vector machine to train model of cognition M.

Step (2). the ONLINE RECOGNITION stage

ONLINE RECOGNITION stage object is that the model of cognition M that utilizes off-line training step to obtain carries out behavior identification, and its step is as follows:

The behavioural characteristic of the step 2-1. method identical with off-line training step operation steps 1-1～1-5 to video extraction to be identified video to be identified.

Above-described embodiment is not that the present invention is not limited only to above-described embodiment for restriction of the present invention, as long as meet requirement of the present invention, all belongs to protection scope of the present invention.

Claims

1. the human body behavior recognition methods based on motion history figure and R conversion, is characterized in that the method comprises off-line training step and ONLINE RECOGNITION stage, and concrete steps are as follows:

Step (1). off-line training step:

Deep video S to be trained is cut into multiple deep video fragments that time span is identical by step 1-1., then stamps different behavior marks according to the different behavior classifications of each deep video fragment, obtained thus the training set T of human body behavior identification;

The minimum that step 1-2. uses " foreground segmentation technology " to obtain human body behavior campaign in each deep video fragment is surrounded square, and the minimum video content that surrounds square restriction in deep video fragment is zoomed to unified size;

Described " foreground segmentation technology " operation is as follows:

A) for a given deep video fragment V of training set T, it is by some frame depth map { P ₁, P ₂..., P _iform, wherein i represents i frame depth map; For any depth map P wherein _i, by P _imiddle pixel carries out k-means binary clusters according to the depth value of pixel position, obtains foreground pixel set and background pixel set; Described foreground pixel is less than the mean depth value of background pixel;

with

form, wherein

with

be moved to the left the η ﹪ that pixel number in rectangle frame new after K pixel is greater than number in original rectangular frame Ri, wherein K is constant, and 50< η <100 will

adjust K pixel left, if the pixel number in new rectangle frame is less than original rectangular frame R after moving boundary _ithe η ﹪ of interior pixel number, right margin adjustment completes; If rectangle frame R _ithe pixel of right half part more than left-half, and will

pixel number after K the pixel that move right in rectangle frame is greater than original rectangular frame R _ithe η ﹪ of interior number, will

adjust K pixel to the right, if the pixel number in new rectangle frame is less than original rectangular frame R after moving boundary _ithe η ﹪ of middle pixel number, left margin adjustment completes; If rectangle frame R _ileft and right two halves pixel in number of pixels differ and be no more than ε, ε is threshold parameter, judges whether remaining pixel number in K/2 the stylish rectangle frame of pixel is drawn close in border, left and right simultaneously to center be greater than original rectangular frame R _iinterior all η ﹪ of pixels, if set up, by rectangle frame R _irespectively draw K/2 pixel according to border, left and right and adjust, repeating step (b) afterwards, until remaining pixel number is less than original rectangular frame R in new rectangle frame _iinterior all η ﹪ of pixels; Adopt above-mentioned same method to rectangle frame R _iup-and-down boundary adjust;

C) deep video fragment V is by horizontal ordinate x, the three-dimensional space mesosome that tri-dimensions of ordinate y and time coordinate t are described, this three-dimensional space mesosome through step b) adjust after, any frame P in deep video fragment V _iforeground pixel out divided, this foreground pixel scope is by R _ibe described; If the minimum of human body behavior is surrounded four coboundary R of square R in deep video S _up, lower boundary R _down, left margin R _leftwith right margin R _rightcan use respectively according to formula (1) and calculate:

R^{up} = \min R_{i}^{up}, R^{down} = \max R_{i}^{down}, R^{left} = \min R_{i}^{left}, R^{right} = \max R_{i}^{right}

Formula (1);

its account form is as follows:

{MHI}_{τ}^{I} (x, y, t) = \{\begin{matrix} τ, if | I (x, y, t) - I (x, y, t - 1) | > δ I_{th} \\ \max (0, {MHI}_{τ}^{I} (x, y, t - 1) - 1,0), else \end{matrix}

Formula (2);

Get three random time length of window τ ^s, τ ^m, τ ^l, obtain corresponding motion history figure wherein s, m, l is natural number, m=2s, l=4s, and s is proportional to the time span of deep video fragment V;

Through the processing of step 1-3, deep video fragment is converted to motion history graphic sequence, three time window length motion history figure that remember

the motion history graphic sequence of the deep video fragment V forming in the extension of time dimension is expressed as MHIs ^o, wherein o=s, m, l;

Step 1-4. is for any one the motion history graphic sequence MHIs obtaining in step 1-3 ^o, establish H ^o(x, y, t) represents motion history graphic sequence MHIs ^oin the intensity of t frame pixel (x, y) position; According to motion history graphic sequence MHIs ^obe calculated as follows energygram D ^o, wherein D ^oin the value D of each position (x, y) ^o(x, y) computing method are shown in formula (3):

D^{o} (x, y) = Σ_{i = 1}^{N - 1} μ (| H^{o} (x, y, i + 1) - H^{o} (x, y, i) | - ϵ)

Formula (3);

Wherein, μ (θ) is unit-step function, and μ in the time of θ >=0 (θ) is 1, and in the time of θ < 0, μ (θ) is 0; ε is threshold constant; N is the time span of deep video fragment V;

specific as follows:

p_{o} (ρ, θ) = {&Integral;}_{- \infty}^{\infty} {&Integral;}_{- \infty}^{\infty} D^{o} (x, y) δ (x \cos θ + y \sin θ - ρ) dxdy

Formula (4);

formula (5);

Right

be normalized,

will

be spliced to form the behavioural characteristic of deep video fragment V

the behavior mark of the deep video fragment obtaining with step 1-1, adopts support vector machine to train model of cognition M;

Step (2). the ONLINE RECOGNITION stage:

The behavioural characteristic of the step 2-1. method identical with off-line training step operation steps 1-1～1-6 to video extraction to be identified video to be identified;

When described ONLINE RECOGNITION stage identification granularity and off-line training step training, be consistent;