CN111310689A

CN111310689A - Method for recognizing human body behaviors in potential information fusion home security system

Info

Publication number: CN111310689A
Application number: CN202010116795.8A
Authority: CN
Inventors: 李颀; 姜莎莎
Original assignee: Shaanxi University of Science and Technology
Current assignee: Shaanxi University of Science and Technology
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-19
Anticipated expiration: 2040-02-25
Also published as: CN111310689B

Abstract

A human body behavior recognition method in a home security system with potential information fusion is characterized in that a human body motion time sequence obtained by tracking is used as a research object, correlations between posture characteristics and behaviors, between interactive object characteristics and behaviors and between behaviors are used as potential information, and the influence of the potential information on human body behavior recognition in the home security system is fully mined by introducing constraint conditions in the extraction of the posture space-time characteristics and the extraction of the characteristics of the interactive objects, so that the difference between behavior classes is increased, the difference in the behavior classes is reduced, and the accuracy and the generalization of the human body behavior recognition method are improved. And (3) taking the mutual information of each joint point about the behavior category as a constraint condition, sequencing all the obtained mutual information, reserving the joint point group with the maximum mutual information capable of representing the specific behavior, and performing behavior recognition by using the screened joint point group and the interactive object characteristic fusion, thereby improving the real-time performance and accuracy of the recognition.

Description

Method for recognizing human body behaviors in potential information fusion home security system

Technical Field

The invention relates to the technical field of computer vision, in particular to a human behavior identification method in a home security system with potential information fusion.

Background

At present, many people begin to install video monitoring systems at home to guarantee their own property and life safety. However, these monitoring systems are installed in the home and cannot prevent the user from getting ill, and the traditional digital monitoring system mainly relies on the monitoring and analysis of the monitoring images by the monitoring personnel, which is not only inefficient, but also cannot meet the higher and higher security requirements in real-time and effectiveness.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a human behavior identification method in a potential information fusion home security system, which carries out automatic analysis on human behaviors in a monitoring picture by replacing family members with a computer, can immediately prompt the family members to pay attention to the condition of a door when an abnormal phenomenon is found, can carry out 24 × 7 all-weather reliable monitoring, and greatly improves the instantaneity and the effectiveness.

In order to achieve the purpose, the invention adopts the technical scheme that:

the method for providing human behavior identification in the home security system with potential information fusion comprises the following steps;

step one, a camera is used for collecting images;

secondly, detecting a human body target of the acquired image by using an illumination self-adaption method based on a background difference method, and then tracking the detected human body target by using a repeat method to obtain a human body motion time sequence;

step three, identifying the detected face, judging whether the face is a family or not, if so, not performing any operation on the motion time sequence obtained in the step two, otherwise, identifying human behaviors;

step four, extracting the posture space-time characteristics of the human body from the motion time sequence obtained in the step two;

step five, extracting the characteristics of the interactive objects by adopting a clue enhanced deep convolutional neural network;

step six, fusing the global posture space-time characteristics and the local interactive object characteristics extracted in the step four and the step five;

and step seven, inputting the fused feature vectors into an SVM classifier for behavior recognition.

In the second step, the human body entering the detection range is detected by using an illumination self-adaptive method based on a background difference method, a Vibe algorithm is used for background modeling, the number of pixels of the human body target detected in the previous frame is recorded and is represented by Y, the number of pixels of the foreground target detected in the current frame is represented by L, and the system can falsely detect the background as the foreground because large-area white appears at the moment of illumination mutation, wherein L is larger than Y. Therefore, a threshold (the number of pixels of a human body target range detected in a previous frame) is set during foreground detection to judge the detection range of the foreground, if the range exceeds the threshold, illumination mutation is caused, otherwise, illumination mutation is not caused, if illumination mutation is caused, illumination compensation is performed on the background model by using the brightness change values of the pixels in two adjacent frames of images, and the compensation formula is as follows:

Δ_t(x,y)＝|V_t(x,y)-V_t-1(x,y)|

wherein:

V_trepresenting an image I_t(x, y), where n is the total number of pixels in the image, n is 1280 × 480 is 614400, and I is_t(x,y)_max(R,G,B)And I_t(x,y)_min(R,G,B)Respectively representing the maximum value and the minimum value of the R, G and B components at the pixel point (x, y);

after the human body target is detected, the detected human body target is tracked by using a Stacke method, the position of the target is found by using a translation filter and a color filter in the tracking process, then the size of the target is obtained by using a scale filter, and finally, a human body motion time sequence is obtained.

In the fourth step, the posture space-time characteristics of the obtained human motion time sequence are extracted, and the specific process comprises the following steps:

1) calculating mutual information of each joint point, judging the response degree of each joint point to a certain specific behavior through the mutual information, and finally reserving a joint point group which can represent the specific behavior and has the maximum mutual information, wherein a formula for calculating the mutual information of each joint point is as follows:

I(f^j,Y)＝H(f^j)-H(f^j|Y)

wherein H (f)^j) Information entropy, j 1,2, 20,

the dynamic process of the j-th joint point changing along with time is represented, N represents the frame number of a human motion time sequence, Y is the category of human behaviors, and under the home security scene, water delivery, express delivery, takeaway, friends, cleaning personnel, other people and the like are mainly identified, so that Y is 1,2, 3, 4, 5 and 6, wherein the calculation formula of entropy is as follows:

wherein p (f)^j) Is a probability density function, i represents the frame number i of the time sequence is 1,2.

2) Extracting posture space-time characteristics from the screened joint points, wherein the characteristics in space dimension are as follows:

the method comprises the following steps that K represents joint points of human body postures, K is 1,2, 20, N represents the frame number of a human body motion time sequence, human body hip joint points are selected as the mass center of a human body, T represents a joint coordinate track characteristic matrix, theta represents a direction matrix of each deleted joint point relative to the mass center of the human body, D represents a space distance matrix of any two joint points, psi represents a direction matrix of a vector formed by any 2 joints relative to an upward vector of the mass center, and A represents a 3 internal angle size matrix formed by any 3 joint points;

the features in the time dimension are:

wherein, Δ T is a trajectory displacement matrix of the joint points, Δ θ is a direction of the same joint point along with displacement, Δ D is a matrix of a distance change of any two joint points along with time, Δ ψ is a direction change of a vector of any two joint points relative to the centroid upward, and Δ a is an internal angle size change matrix formed by any 3 joint points.

The extracted pose spatiotemporal features are represented as:

F_pose＝F_spatial+F_temporal。

in the fifth step, the detected human body is used as a clue, the effective object interacted with the human body is used as a high-level clue, the feature of the object interacted with the human body is extracted by using a convolutional neural network, the position relation between the object and the human body in the detected human body is implicitly integrated into the convolutional neural network, and the feature of the effective object interacted with the human body is extracted;

a loss function is used in the training process, parameters are adjusted when loss is propagated reversely, and a mixed loss function calculation formula is as follows:

L(M,D)＝L_main(M,D)+αL_hint(M,D)

wherein L is_main(M, D) loss function for interactive object feature extraction, L_hint(M, D) represents the distance implying that the loss of the task is a function, M represents the network model,

as a training set of N sample pictures,

the number of N images is shown,

representing the related category label, and α takes a value between 0 and 1.

In the sixth step, because the gesture space-time characteristics and the interactive object characteristics have different response degrees to different human behavior identifications, the two obtained characteristics are subjected to weighted fusion, and the formula is as follows:

F＝w₁F_pose+w₂F_object

wherein, w₁Weighting coefficients, w, being space-time features of the attitude₂A weighting coefficient for a human interaction object feature, and w₁+w₂＝1，F_poseRepresenting a pose spatio-temporal feature, F_objectRepresenting the interactive object features.

And seventhly, inputting the fused feature vectors into an SVM classifier for classification to obtain a final recognition result.

The invention has the beneficial effects that:

the invention uses the machine vision technology to perform behavior detection on a human body in a home security environment, firstly performs face recognition before the behavior detection, judges whether the human body is a family or not, does not perform the behavior detection if the human body is the family, otherwise performs the behavior detection, integrates the interactive object detection in the behavior detection, improves the recognition accuracy, can prevent the human body from getting ill before detecting the behavior before pressing a doorbell, and improves the instantaneity and the effectiveness. The invention combines the body characteristics of the human body interaction object and the human body movement to perform behavior detection, and has important research value for processing human body behaviors identified by the interaction object under different scenes.

Drawings

Fig. 1 is a flowchart of a behavior recognition method according to an embodiment of the present invention.

Fig. 2 is a flowchart of human target detection according to an embodiment of the present invention.

FIG. 3 is a flow chart of pose spatio-temporal feature extraction according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The system aims at the problem of outsider invasion existing in the existing home security system and the problem that the real-time performance and effectiveness of the traditional digital monitoring can not meet the safety requirement. The invention uses the machine vision technology in the security scene, does not need to analyze the monitoring image by a monitoring person, adopts the face recognition technology for family members, adopts the human body behavior recognition technology for people except the family members, detects the behavior of the human body before knocking the door, and timely informs the family members of the behavior before knocking the door, thereby preventing the people from getting ill and improving the real-time property and the accuracy.

The application principle of the invention is further explained in the following with the attached drawings:

as shown in fig. 1, which is a general flow diagram of the method of the present invention, the method for identifying human behavior in a home security system with latent information fusion according to the present invention comprises the following steps:

step one, a camera installed at a door of a house is used for collecting images in a detection range.

Step two, as shown in fig. 2: the method comprises the steps of firstly carrying out real-time human target detection on an acquired image, establishing and updating a background model when the system carries out human target detection by using a background difference method, and adopting a Vibe background modeling method to realize establishment of the model because a home security system has higher requirements on real-time performance and has the problem of light mutation. The number of pixel points of the human body target range detected in the previous frame is recorded and is represented by Y, the number of pixel points of the foreground detected in the current frame is represented by L, and due to the fact that large-area white appears at the moment of sudden change of illumination, the background can be mistakenly detected as the foreground by the system, and L is larger than Y. Therefore, a threshold (the number of pixel points of a target range of a human body detected in a previous frame) is set during foreground detection to judge the detection range of the foreground, if the range exceeds the threshold, the illumination mutation is generated, otherwise, the illumination mutation is not generated, if the illumination mutation is generated, illumination compensation is performed on the background model by using the brightness change values of the pixel points in two adjacent frames of images, and the compensation formula is as follows:

Δ_t(x,y)＝|V_t(x,y)-V_t-1(x,y)|

wherein:

representing an image I_t(x, y) where n is the total number of pixels in the image, n 1280 × 480 is 614400, and I is_t(x,y)_max(R,G,B)And I_t(x,y)_min(R,G,B)Respectively representing the maximum value and the minimum value of the R, G and B components at the pixel point (x, y).

In practical situations, because there are some background objects with boundary portions or strong reflection coefficients in a captured picture, the background objects cannot be completely cancelled after background difference processing, and they are represented as some point-like, small block-like and linear noises, which need to be reasonably judged in the process of target detection to distinguish actual moving targets. Therefore, the obtained binary image uses morphological filtering to solve the problem, for a plurality of targets, the binary image after the morphological filtering generally comprises a plurality of areas, and as the multi-target area generally comprises a plurality of sub-areas which are not communicated with each other, it is necessary to detect the communication condition of each area, then distinguish the areas by marks, and frame each target in the original image according to the marks, thereby further calculating the position of each target in each frame of image.

Further, a kernel method is adopted to establish a correlation filter model and a color filter template for a first frame of image, for a new frame of image, firstly a translation filter and a color filter are used to find the position of a target, then different candidate target frames are extracted by using a scale filter with the target position as a central point, the scale corresponding to the value with the maximum response value is taken as the final target scale, the position and the size of the target are obtained, and then the correlation filter model and the color filter model are updated.

After the scores of the translation filter and the color histogram are obtained, weighted summation is carried out, and the calculation formula is as follows:

f(x)＝γ_tmplf_tmpl(x)+γ_histf_hist(x)

wherein x ═ T (x)_t,p)；θ_t-1T is a feature extraction function, x_tIs to represent the t-th frame image, p represents the rectangular frame in any frame image, theta represents the model parameter, theta_t-1The method is characterized in that a scoring function is obtained in a linear mode according to target model parameters established by a first t-1 frame and in order to combine gradient characteristics and color characteristics on the basis of meeting the real-time performance, wherein gamma is_tmplIs the score coefficient, gamma, of the filter template_histIs the histogram score coefficient, γ_tmpl+γ_hist＝1。

Step three, carrying out face recognition on the detected human body, judging whether the human body is a family or not, and if the human body is the family, displaying a recognition result on a human-computer interaction interface; otherwise, behavior recognition is carried out.

Step four, as shown in fig. 3: the human body posture space-time characteristics are extracted by taking a human body motion time sequence as a research object, and the specific process comprises the following steps:

1) in a home security scene, a person moves forwards, so that the distance of the human body shot by the camera changes, and the position coordinates of the human body joint points may have a large difference.

The original coordinate of a certain joint point of a human body is assumed to be (x)₀,y₀) And the normalized coordinates are (x, y), and the normalization formula is as follows:

where d is max { w, h }, w and h are the width and height of the image, respectively, and x, y ∈ (-1,1) after normalization.

2) And after obtaining the normalized human body posture coordinate, extracting the time characteristic and the space characteristic of the posture time sequence, wherein the space characteristic describes the position and the mutual position relation of the joint points on the same frame of image, and the time characteristic describes the change of the joint position caused by the posture change.

Because the response degree of each joint point of the human body to a certain specific behavior is different, if all the joint points are considered and not distinguished, the joint points with low response degree will inevitably introduce noise to reduce the identification effect, so some noise points are selectively discarded, and therefore, by calculating the mutual information of each joint point about the behavior category, the joint point group with the maximum mutual information capable of representing the specific behavior is reserved.

Assuming a time-series sequence of behaviors with a total of N frames, the dynamic process of the j (j ═ 1,2.., 20) th joint point over time can be expressed as:

mutual information of each joint point to the human behavior category is as follows:

I(f^j,Y)＝H(f^j)-H(f^j|Y)

wherein H (f)^j) The information entropy of the jth joint point is represented, Y is the category of human behavior, and mainly identifies water delivery, express delivery, takeaway, friends, cleaning personnel, other people and the like in a home security scene, so that Y is 1,2, 3, 4, 5 and 6. The above formula is used to measure the degree of response of each joint to a particular category of behavior. The calculation formula of the entropy is as follows:

wherein p (f)^j) Is a probability density function, p (Y, f)_i ^j) As a joint probability density function, p (f)_i ^jY) is a conditional probability density function, i represents the number of frames i of the time series, 1,2.

After the mutual information of each joint point to the behavior category of the human body is calculated, the mutual information is sorted from big to small, and the joint point group with the maximum mutual information capable of representing the specific behavior is selected.

The selection rule of the joint point group with the maximum mutual information is as follows: for human behavior recognition in a home security scene, for a normal human body, the information of joint points of arms, hands and legs of the human body is mainly concerned, and for a person with an invasive behavior, the information of the whole joint points is concerned, so that the constraint condition for selecting a sequenced mutual information matrix is as follows:

the matrix formed by the mutual information of the joint points obtained by each action is as follows:

wherein, N represents the N (N is 1,2, 3, 4, 5, 6) type behavior, K represents the K (K is 1,2, 20) type joint point, the mutual information obtained by each joint point is sorted, and since the arm, the hand and the leg are the main concerned joint points, the joint point group which can represent the specific behavior and is composed of the three joint points is selected from the sorted mutual information group.

The attitude matrix after screening the response degree of the joint points to the behaviors is as follows:

wherein r is_ij＝(x_ij,y_ij) I belongs to {1, 2.,. N }, j belongs to {1, 2.,. K }, N is 6, K has a value range of 4-14, and K is the maximum subscript of the obtained maximum joint point group.

After obtaining the screened attitude matrix, extracting features in a space dimension and features in a time dimension, wherein the features in the space dimension are as follows:

wherein a human hip joint point (x) is selected_i0,y_i0) As the centroid of the human body, T ═ T_ij)_N×KRepresenting a characteristic matrix of the joint coordinate trajectory, t_ij＝(x_ij-x_i0,y_ij-y_i0)，

A direction matrix of each screened joint point relative to the center of mass of the human body is shown,

a spatial distance matrix representing any two joint points,

a direction matrix representing the vector of any 2 joints with respect to the vector up the centroid,

representing a matrix of 3 interior angles formed by any 3 joint points.

The features in the time dimension are:

wherein Δ T ═ x_i+s,j-x_ij,y_i+s,j-y_ij)_(N-s)×2KIs a track displacement matrix of the joint points,

is the direction of the same joint point along with the displacement,

is a matrix of the distance of any two joint points over time,

for the change in direction of any two joint points with respect to the vector with the centroid up,

and the internal angle size change matrix is formed by any 3 joint points.

Obtaining the posture space-time characteristic through the time characteristic and the space characteristic, wherein the posture space-time characteristic is expressed as:

F_pose＝F_spatial+F_temporal

further, the features of the object interacting with the human are extracted by using a convolutional neural network by taking the detected human body as a clue and the effective object interacting with the human as a high-level clue, and the features of the effective object interacting with the human are extracted by implicitly integrating the position relationship between the object and the human in the detected human body into the convolutional neural network.

In the invention, two tasks are jointly executed, including a main task of interactive object recognition and an auxiliary task of distance hint enhancement. The auxiliary task plays a role in regularizing the network and enhancing the expression capacity of the network. Implying that the effect of the task on the primary task is reflected in sharing all convolutional layers before the full connection. In order to learn the weights of these layers mixedly, a mixed loss function is used, which combines the loss functions of both tasks. The method comprises the following specific steps: the network model is expressed in terms of M,

as a training set of N sample pictures,

the number of N images is shown,

representing related class labels, wherein α values are between 0 and 1, and the formula of the mixing loss function is as follows:

L(M,D)＝L_main(M,D)+αL_hint(M,D)

M_main(. and M)_hint(·) Respectively representing the output of the primary task and the output of the implied task. The model parameters are trained and fine tuned by random gradient descent. A random gradient descent algorithm is used to optimize L (M, D). After the gradient is calculated, the weight ω is updated using a rule expressed by the following formula.

Further, because the response degrees of the posture characteristic and the interactive object characteristic to different human behavior recognition are different, the obtained two characteristics are subjected to weighted fusion, and the formula is as follows:

F＝w₁F_pose+w₂F_object

wherein, w₁Weighting coefficients, w, being space-time features of the attitude₂A weighting coefficient for a human interaction object feature, and w₁+w₂＝1。F_poseFor extracted pose spatiotemporal features, F_objectAnd the extracted interactive object features.

Furthermore, the integrated features are classified, and the system mainly identifies behaviors of water delivery, express delivery, take-away, friends, other people and the like. Therefore, a multi-classification support vector machine is needed, which is realized by designing a two-classification model between any two classes, and finally combining a plurality of two classifiers to realize the construction of a multi-classifier, wherein the two classifications still use the above method. In the system, 6 categories exist, 1 category is taken as a positive sample in each classification, the other 1 category is taken as a negative sample, and the like. This gives a total of 15 classifiers. During classification, the 15 classifiers answer which of the two categories belongs to in turn, and the category with the highest vote count in the final voting statistics is the category to which the 15 classifiers belong.

And step five, because the image processing is performed on the cloud server, the recognition result cannot be seen by each user, and therefore the human-computer interaction module is required to receive and display the recognition result. And sending the recognition result to the family in a short message form when the family does not stay at home and some people move at the door.

Claims

1. The method for providing the human behavior recognition in the home security system with the potential information fusion is characterized by comprising the following steps;

step one, a camera is used for collecting images;

2. The method for providing human behavior recognition in a home security system with potential information fusion according to claim 1, wherein in the second step, the human body entering the detection range is detected by an illumination adaptive method based on a background difference method, a background modeling is performed by using a Vibe algorithm, the number of pixels of the human body target detected in the previous frame is recorded and is represented by Y, the number of pixels of the foreground target detected in the current frame is represented by L, and the system falsely detects the background as the foreground because a large area of white color appears at the moment of illumination mutation, wherein L is greater than Y. Therefore, a threshold (the number of pixels of a human body target range detected in a previous frame) is set during foreground detection to judge the detection range of the foreground, if the range exceeds the threshold, illumination mutation is caused, otherwise, illumination mutation is not caused, if illumination mutation is caused, illumination compensation is performed on the background model by using the brightness change values of the pixels in two adjacent frames of images, and the compensation formula is as follows:

Δ_t(x,y)＝|V_t(x,y)-V_t-1(x,y)|

wherein:

3. The method for providing human behavior recognition in the home security system with potential information fusion according to claim 1, wherein in the fourth step, the posture space-time feature is extracted from the obtained human motion time sequence, and the specific process comprises:

I(f^j,Y)＝H(f^j)-H(f^j|Y)

wherein H (f)^j) Information entropy, j 1,2, 20,

the features in the time dimension are:

The extracted pose spatiotemporal features are represented as:

F_pose＝F_spatial+F_temporal。

4. the method for providing human behavior recognition in a home security system with potential information fusion as claimed in claim 1, wherein in the fifth step, the features of the object interacting with the human are extracted by using a convolutional neural network by taking the detected human body as a clue and taking the effective object interacting with the human as a high-level clue, and the features of the effective object interacting with the human are extracted by implicitly integrating the position relationship between the object and the human in the detected human body into the convolutional neural network;

L(M,D)＝L_main(M,D)+αL_hint(M,D)

as a training set of N sample pictures,

the number of N images is shown,

representing the related category label, and α takes a value between 0 and 1.

5. The method for human behavior recognition in a home security system for providing potential information fusion as claimed in claim 1, wherein in the sixth step, since the response degree of the attitude space-time feature and the interactive object feature to different human behavior recognition is different, the two obtained features are weighted and fused, and the formula is as follows:

F＝w₁F_pose+w₂F_object

6. The method for providing human behavior recognition in the home security system with potential information fusion according to claim 1, wherein in the seventh step, the fused feature vector is input into an SVM classifier for classification, and a final recognition result is obtained.