CN111028271A

CN111028271A - Multi-camera personnel three-dimensional positioning and tracking system based on human skeleton detection

Info

Publication number: CN111028271A
Application number: CN201911244121.XA
Authority: CN
Inventors: 王锦文; 麦全深
Original assignee: Haoyun Technologies Co Ltd
Current assignee: Haoyun Technologies Co Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-17
Anticipated expiration: 2039-12-06
Also published as: CN111028271B

Abstract

The invention discloses a multi-camera personnel three-dimensional positioning and tracking system based on human body skeleton detection, which comprises: the parameter calibration module is used for calibrating a plurality of cameras in the target area to obtain and store internal parameters and external parameters of each camera; the synchronous setting module is used for calculating the foreground of two frames of images collected by each camera front and back to obtain a foreground change value, and when the foreground change values of the cameras are larger than a preset foreground threshold value, the cameras are determined to be synchronous; the detection matching module is used for identifying the target person image acquired by the camera through a mathematical model and extracting human skeleton points, mapping the human skeleton points through internal parameters and external parameters of the camera to obtain space rays, and matching the space rays through a matching algorithm; and the tracking display module is used for calculating space average coordinates of all points according to the position of each target person in the world coordinate system to serve as tracking points, tracking the target by a tracking method and displaying the matched target person.

Description

Multi-camera personnel three-dimensional positioning and tracking system based on human skeleton detection

Technical Field

The invention relates to the technical field of image processing, in particular to a multi-camera personnel three-dimensional positioning and tracking system based on human body skeleton detection.

Background

Because the monocular camera has limited visual field and inaccurate monocular positioning, in an actual target positioning and tracking system, a multi-camera system is more adopted, and the multi-camera system can be favorable for solving the problems of shielding, tracking of moving targets under the condition of sudden change of ambient illumination and the like, but introduces new problems, mainly comprising the problem of time synchronization among multiple cameras and the problem of matching among multiple targets. The main idea of traditional multi-camera target positioning and tracking is that internal and external reference calibration is carried out on a camera, then a traditional method is used for extracting targets, such as background modeling, frame difference and other modes, then data are registered and fused to obtain objects belonging to the same target, and then positioning and tracking are carried out.

The difficulty of target detection in the traditional method is mainly shown in the problems of poor robustness such as illumination change resistance, jitter resistance, background disturbance resistance, shadow resistance, color similarity resistance and the like, poor target detection effect, poor target matching robustness based on the traditional characteristic fusion method and poor multi-camera tracking effect due to the combination of two factors.

Disclosure of Invention

The invention provides a multi-camera personnel three-dimensional positioning and tracking system based on human body skeleton detection, which is characterized in that detected human body posture points are mapped to a space according to internal parameters of a camera to obtain corresponding rays, a value obtained by using the closest distance between two straight lines in the space is used as a cost matrix, a matching result is obtained by combining with a matching algorithm for calculation, and then the result is positioned and tracked to solve the technical problems of poor detection effect and robustness in the traditional method for target detection, so that the target detection effect is improved, the robustness is increased, and the multi-camera tracking effect is further improved.

In order to solve the above technical problem, an embodiment of the present invention provides a multi-camera personnel three-dimensional positioning and tracking system based on human skeleton detection, including:

the parameter calibration module is used for calibrating a plurality of cameras in the target area to obtain and store internal parameters and external parameters of each camera;

the synchronous setting module is used for calculating the foreground of two frames of images collected by each camera front and back to obtain a foreground change value, and when the foreground change value of each camera is greater than a preset foreground threshold value, the cameras are determined to be synchronous;

the detection matching module is used for identifying a target person image acquired by the camera through a mathematical model and extracting human skeleton points after the camera synchronization is determined, mapping the human skeleton points through internal parameters and external parameters of the camera to obtain space rays, and matching the space rays through a matching algorithm;

and the tracking display module is used for respectively calculating the space average coordinates of all the points according to the position of each target person in the world coordinate system after matching is finished, tracking the target by using the space average coordinates as tracking points through a tracking method, and displaying the obtained matched target person.

As a preferred scheme, the parameter calibration module is configured to perform a calibration step on a plurality of cameras in a target area, and specifically includes:

respectively collecting a plurality of images in the visual field range of each camera, and inputting the collected images into a calibration tool to obtain internal parameters of each camera;

and selecting corner points in the acquisition public areas of the plurality of cameras as a pair of mapping points of a world coordinate system and the corner point coordinates in the acquired images, and inputting the mapping points into an open source tool to obtain the external parameters of each camera.

Preferably, the internal parameters include a focal length and principal point coordinates of the camera; the extrinsic parameters are parameters of the camera in a world coordinate system.

As a preferred scheme, the synchronization setting module is configured to calculate a foreground variation value, and specifically includes:

converting two frames of images collected before and after the acquisition into gray level images and carrying out Gaussian filtering smoothing treatment to obtain two frames of processed images;

performing difference operation on the two frames of processed images, and performing binarization processing on the two frames of processed images after difference operation to obtain two frames of binarized images;

and respectively calculating the ratio of the foreground change of the two frames of binary images to the whole image, and taking the ratio as the foreground change value of the frame image.

As a preferred scheme, the detection matching module is used for performing specific steps of mapping to obtain a spatial ray, and the specific steps include:

converting the human skeleton points into points of a camera coordinate system;

and converting the principal point coordinates of the camera coordinate system and the coordinate points converted from the human body skeleton points into a world coordinate system to obtain a space ray.

As a preferred scheme, the detection matching module is configured to perform a step of matching the spatial ray through a matching algorithm, and specifically includes:

calculating the distance between the space straight lines of each person in the images acquired by the first camera and the second camera, taking the distance as a loss value, and multiplying the loss value by one time when the loss value exceeds a preset loss threshold value;

and taking the loss value as an input parameter, and performing calculation and prediction through a matching algorithm to obtain a matching result.

As an optimal scheme, the matching algorithm is a Hungarian maximum matching algorithm.

Preferably, the tracking method is a euclidean distance tracking method.

As a preferred scheme, the tracking display module is configured to display the obtained matched target person, and specifically includes:

setting a displayed model matrix, a projection matrix and a viewport matrix;

connecting the space average coordinate and the detected human skeleton point as corresponding points;

setting ID for the tracking point to display the position of the tracking point;

and carrying out interactive operation on the tracking points to realize that the target person is observed from different angles.

Preferably, the interaction comprises rotation, zooming or translation.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

according to the method, the detected human body posture points are mapped to the space according to the internal parameters of the camera to obtain corresponding rays, the value obtained by the closest distance between two straight lines in the space is used as a cost matrix, the matching result is obtained by combining the calculation of a matching algorithm, and then the result is positioned and tracked, so that the technical problems of poor detection effect and robustness in the traditional method for target detection are solved, the target detection effect is improved, the robustness is increased, and the multi-camera tracking effect is further improved.

Drawings

FIG. 1: the invention discloses a structural schematic diagram of a multi-camera personnel three-dimensional positioning and tracking system based on human body skeleton detection.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a preferred embodiment of the present invention provides a multi-camera personnel three-dimensional positioning and tracking system based on human skeleton detection, which includes: the device comprises a parameter calibration module, a synchronous setting module, a detection matching module and a tracking display module.

The parameter calibration module is used for calibrating a plurality of cameras in the target area to obtain and store internal parameters and external parameters of each camera; in this embodiment, the intrinsic parameters include a focal length and principal point coordinates of the camera; the extrinsic parameters are parameters of the camera in a world coordinate system.

In this embodiment, the parameter calibration module is configured to perform calibration on a plurality of cameras in a target area, and specifically includes: 1. respectively collecting a plurality of images in the visual field range of each camera, and inputting the collected images into a calibration tool to obtain internal parameters of each camera; 2. and selecting corner points in the acquisition public areas of the plurality of cameras as a pair of mapping points of a world coordinate system and the corner point coordinates in the acquired images, and inputting the mapping points into an open source tool to obtain the external parameters of each camera.

In the embodiment of the system, 3 cameras are used for monitoring the same area, the installation mode of the camera positions is overlook installation, and the cameras are installed on two sides of a wall, so that the problem of matching missing caused by personnel shielding can be effectively solved, and multi-target positioning and matching tracking are realized. Specifically, each camera is calibrated respectively to obtain the internal reference and the external reference of each camera, and the internal reference and the external reference are stored for subsequent use. The specific calibration process comprises the following steps: (1) and (3) using the prefabricated 5-by-6 checkerboard to acquire about 40 images with different distances in each camera view range, using an opencv calibration tool to obtain and store the internal parameters of each camera. Considering the particularity of an application environment, the coordinates of the corner points in the image and the corner points between tiles on a floor are manually selected as a pair of mapping points from the coordinates of the corner points in the image to the corner points of the world coordinate system in a public area of three cameras, and the collected points are used by each camera to obtain the external parameters of the camera and are stored, so that the internal and external parameter calibration process of the camera is completed.

The internal parameters of the camera refer to the relevant parameters of the camera, and mainly comprise the focal length, the principal point coordinates and the distortion parameters of the camera; here the distortion parameter we do not need to be processed and is therefore negligible. The internal reference matrix is described as:

where fx, fy denote the focal length of the camera and Cx and Cy denote the principal point coordinates.

The calibration tool obtains an internal reference process: respectively calibrating internal references for each camera, and respectively acquiring about 40 images at different positions and distances of the camera by using a 6-by-5 checkerboard; the collected image is firstly converted into a gray image, then an opencv open source tool function cv2. findchessboardcorrers function is called to obtain the corner points (the intersection points of the adjacent diagonal squares of the checkerboard) on the checkerboard, and then a cv2. calibretacarama function is called to transmit the obtained corner point coordinates into the checkerboard, and the internal parameters of the camera are returned.

External reference of the camera is as follows: the parameters of the camera in the world coordinate system, the extrinsic parameters determining the position and orientation of the camera in some three-dimensional space. The external parameters mainly include a rotation matrix R and a translation matrix T. The external parameter calibration result is special, and the external parameter is calibrated by mainly utilizing a part of common areas in an indoor environment.

The synchronous setting module is used for calculating the foreground of two frames of images collected by each camera front and back to obtain a foreground change value, and when the foreground change value of each camera is greater than a preset foreground threshold value, the cameras are determined to be synchronous; in this embodiment, the step of calculating the foreground variation value by the synchronization setting module specifically includes: 1. converting two frames of images collected before and after the acquisition into gray level images and carrying out Gaussian filtering smoothing treatment to obtain two frames of processed images; 2. performing difference operation on the two frames of processed images, and performing binarization processing on the two frames of processed images after difference operation to obtain two frames of binarized images; 3. and respectively calculating the ratio of the foreground change of the two frames of binary images to the whole image, and taking the ratio as the foreground change value of the frame image.

Specifically, in the use process of the cameras, the cameras may be misaligned in time to cause inaccurate positioning, so that the synchronization of the cameras is very important, at present, the cameras are aligned in a manual mode, the main principle is to align by an indoor light switching method, respectively calculate foreground changes of front and back two images of each camera, and when the front and back foreground changes of the camera are greater than a certain threshold value, wait for other foreground changes of other cameras to be greater than the threshold value, and consider that the three cameras are synchronized. Calculating two frames of images before and after the foreground change, converting the images into gray level images, smoothing by using Gaussian filtering, (2) carrying out difference operation on the two images, (3) carrying out binarization processing on the images after the difference operation, and (4) calculating the proportion of the foreground change in the whole image and comparing the proportion with a set threshold value.

The system is suitable for indoor positioning, and the simple and convenient method is to judge synchronization by turning on the light once and calculating the characteristic that the image gray level changes greatly when the light is turned off.

in this embodiment, the specific step of the detection matching module for obtaining the spatial ray through mapping includes: 1. converting the human skeleton points into points of a camera coordinate system; 2. and converting the principal point coordinates of the camera coordinate system and the coordinate points converted from the human body skeleton points into a world coordinate system to obtain a space ray.

In this embodiment, the detecting and matching module is configured to perform a matching step on the spatial ray through a matching algorithm, and specifically includes: 1. calculating the distance between the space straight lines of each person in the images acquired by the first camera and the second camera, taking the distance as a loss value, and multiplying the loss value by one time when the loss value exceeds a preset loss threshold value; 2. and taking the loss value as an input parameter, and performing calculation and prediction through a matching algorithm to obtain a matching result. In this embodiment, the matching algorithm is a hungarian maximum matching algorithm.

Specifically, human skeleton information in images of three cameras is extracted by using a deep learning openposition model respectively, the information is stored in a queue, and skeleton information is taken from the queue by a matching algorithm for matching. Skeleton information points of three cameras need to be matched, the matching principle is to solve local optimum, and the strategy is as follows: the camera 1 is matched with the camera 2, the camera 2 is matched with the camera 3, and the final matching result is obtained to be fused and output.

And then calculating a space ray corresponding to each skeleton point of each person, matching the space rays by a matching algorithm in a human unit, mapping skeleton points output by the position of each person in each camera into space rays through internal and external parameters of the camera, and simply describing the principle that points in an image are converted into points under a camera coordinate system and are recorded as p, and then converting points (0,0,0) and p in the camera coordinate system into a world coordinate system so as to obtain the space rays. The camera coordinate system is a three-dimensional rectangular coordinate system established by taking the optical center of the camera and the optical axis as the Z axis, and the X axis and the Y axis of the three-dimensional rectangular coordinate system are parallel to the X axis and the Y axis of the image.

And matching the space rays of each camera in a human unit, numbering the cameras as 1,2 and 3, and assigning id to the person of each camera, wherein the matching method is to solve a local optimal solution, only to solve the matching of the camera No. 1 and the camera No. 2 and the matching of the camera No. 2 and the camera No. 3, and finally to solve a human target. The matching algorithm principle is as follows (taking camera 1 and 2 matching as an illustration):

1. and calculating the distance between each person in the camera 1 and each person in the camera 2, so that each person in the camera 1 and each person in the camera 2 have a corresponding distance, and punishing the loss with too large distance by taking the distance as a loss value.

The matching described here is a spatial distance matching in units of people, and the result of matching is obtained by finding that all points of each person are closest to all points of other people in terms of spatial straight-line distance. The method comprises the following steps of mapping points of a person into rays of a space through internal and external parameters of a camera, and specifically operating the following steps: assuming a point P (u, v) in the image, the conversion is to the camera coordinate system formula:

wherein (fx, fy, cx, Cy) is an internal reference obtained by calibrating the camera, and Z is a certain value greater than 0. Obtaining the coordinate Pm (x, y, z) of the point in the camera coordinate system, wherein the origin of the camera coordinate system is Pn (0,0,0), converting Pm and Pn into the point in the world coordinate system, and solving by the following formula:

Pw＝[R|T]^-1*P_c；

pw is a point coordinate of a world coordinate system, Pc is a coordinate point of a camera coordinate system, and R and T are external parameters obtained by calibrating the camera.

According to the method, the spatial straight line of each person in each camera is obtained, then the distance between every two spatial straight lines is obtained by a human unit, and the distances are accumulated to be used as loss values. For a loss value that is too large, we need to penalize, a larger loss value represents a higher probability that the two people are not the same object in different cameras, so when the loss value is larger than 50cm, the matching of the two people should be penalized, i.e. their loss value is enlarged by 1 time.

2. If one person in the camera 1, denoted p1, has a smaller loss value than one person in the camera 2, and if the loss value of two persons in the camera 2 is smaller than p1, denoted p2, p3, in order to improve the robustness of the algorithm, additional processing is added, the processing method describes that, assuming that the sum of all the spatial straight-line distances of p1 and p2 is loss12, and the sum of all the point spatial straight-line distances of p1 and p3 is loss13, assuming that the height of each joint point of a known person is known, the world coordinate positions (X _ w, Y _ w, Z _ w) of the corresponding two-dimensional coordinate (xc, yc) positions at each height are found from the camera internal and external parameters, and the three-dimensional coordinate averages of these points are calculated, denoted (X, Y, Z), a gaussian weight distribution is made with (X, Y) as the center, the scores of p1 and p2 and the distribution are found, the sum of p1 and p2 is taken as the second score of p2 and p 8536, recorded as loss12_1, the scores for p1 and p3 and the Gaussian distribution are similarly calculated, then the final loss values for p1 and p2 are formulated as:

loss12+ b 12 — 1, where a is 0.75 and b is 0.25;

wherein, the Gaussian distribution is as follows:

where ux, uy are the origin of the gaussian function. When calculating the score, the (x, y) value of each point of the person is substituted into a formula by a human unit to obtain the Gaussian distribution probability density of each point, then the SUM of the Gaussian distribution probability densities of all the points of the person is obtained and recorded as SUM, and the probability density of each point is divided by SUM to obtain the Gaussian distribution score of each point of the person, and the Gaussian distribution score of one point of the person is recorded as pscore. Then, calculating the space coordinates (Xp, Yp, Zp) of all points of the person and the space coordinates (x _ w, y _ w, z _ w) of the corresponding points estimated according to the height of the person, taking the x-axis direction and the y-axis direction to calculate the Euclidean distance as the difference between the real point and the estimated point, and recording the difference as dist, wherein the final distance of the point is as follows: d dist pscore. All the points are calculated by the method, and finally the distance loss under the Gaussian distribution of the person is obtained through addition.

3. And (3) predicting by using the Loss value of Loss as a parameter and using a Hungarian maximum matching algorithm to obtain a matching result, wherein the ID number pair corresponding to the matched person is obtained by matching. The ID is actually an identifier assigned to a person to distinguish it from others.

4. The same processing is performed for the 2-camera and the 3-camera, and a matching result is obtained.

Finally, the matching results of the 1 and 2 cameras and the 2 and 3 cameras are fused. Matching fusion is exemplified as follows: assuming that 2 persons are shared in the picture of the camera 1, the ID numbers are 1 and 2, the number of persons in the picture of the camera 2 is 2, the ID numbers are 10 and 20, the number of persons in the picture of the camera 3 is 2, the ID numbers are 100 and 200, and assuming that after matching, the person with the ID number of 1 in the camera 1 matches the person with the ID number of 20 in the camera 2, and the person with the ID number of 20 in the camera 2 matches the person with the ID number of 100 in the camera 3, the ID numbers of 1,20 and 100 are the same person. The remaining persons with IDs 2, 10,200 belong to the same person if their ID2 matches ID10 and ID10 matches ID 200, and if ID2 matches ID10 but ID10 does not match ID 200, indicating that 2 and 10 are the same person and ID 200 is handled as a separate other person in the camera 3.

And the tracking display module is used for respectively calculating the space average coordinates of all the points according to the position of each target person in the world coordinate system after matching is finished, tracking the target by using the space average coordinates as tracking points through a tracking method, and displaying the obtained matched target person. In this embodiment, the tracking method is a euclidean distance tracking method.

In this embodiment, the tracking display module is configured to display the obtained matched target person, and specifically includes: 1. setting a displayed model matrix, a projection matrix and a viewport matrix; 2. connecting the space average coordinate and the detected human skeleton point as corresponding points; 3. setting ID for the tracking point to display the position of the tracking point; 4. and carrying out interactive operation on the tracking points to realize that the target person is observed from different angles. In this embodiment, the interaction operation includes rotation, zooming or translation.

Specifically, after matching is completed, the spatial average coordinates of all points are calculated and recorded as (Xu, Yu, Zu) according to the position of each person in the world coordinate system, and the tracking target is performed by using the tracking point by the euclidean distance tracking method. The resulting spatial points of the matched person are displayed using opengl. The display steps are as follows: (1) setting a model matrix and a projection matrix of opengl display and a viewport matrix; (2) connecting point coordinates (x, y, z) of a space according to connecting LINES of corresponding points of detected human key points, wherein the used functions are glBegin (GL _ LINES) and glEnd (), (3) setting the display position of the tracking ID by using glRasterPos3f, and displaying the ID of person tracking by using a glutBitmapcharacter function; (4) simple interactive functionality is provided, using glRotatef rotation, using glScalef scaling, using glTranslatef translation, enabling objects to be viewed from different angles.

The invention mainly improves the technical problem of poor extraction effect caused by various factors in the traditional algorithm, and simultaneously improves the problem of poor matching effect of the traditional feature fusion algorithm; the idea of combining deep learning and a matching algorithm is used for completing a target positioning and tracking algorithm of multiple cameras, so that the algorithm is more robust and accurate. The method for detecting the deep learning posture is capable of effectively resisting illumination change, shaking, background disturbance, shadow, color similarity and the like, target detection and extraction are greatly superior to those of detection of a traditional method, meanwhile, the linear distance of the space is used as the cost matrix, the Hungarian algorithm is used for maximum binary matching, and the result is superior to that of the traditional characteristic extraction and fusion method.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the invention, may occur to those skilled in the art and are intended to be included within the scope of the invention.

Claims

1. A multi-camera personnel three-dimensional positioning and tracking system based on human skeleton detection is characterized by comprising:

2. The human skeleton detection-based multi-camera personnel three-dimensional positioning and tracking system of claim 1, wherein the parameter calibration module is used for calibrating the plurality of cameras in the target area, and specifically comprises:

3. The human skeleton detection-based multi-camera human three-dimensional positioning and tracking system of claim 1, wherein the internal parameters comprise focal lengths and principal point coordinates of the cameras; the extrinsic parameters are parameters of the camera in a world coordinate system.

4. The human skeleton detection-based multi-camera human three-dimensional positioning and tracking system of claim 1, wherein the synchronization setting module is configured to calculate a foreground variation value, and specifically comprises:

5. The human skeleton detection-based multi-camera human three-dimensional positioning and tracking system of claim 1, wherein the detection matching module is used for mapping spatial rays, and comprises the following specific steps:

converting the human skeleton points into points of a camera coordinate system;

6. The human skeleton detection-based multi-camera personnel three-dimensional positioning and tracking system of claim 1, wherein the detection matching module is used for matching the spatial rays through a matching algorithm, and specifically comprises:

7. The multi-camera personnel three-dimensional positioning and tracking system based on human skeleton detection as claimed in claim 6, wherein said matching algorithm is the Hungarian maximum matching algorithm.

8. The human skeleton detection-based multi-camera human three-dimensional positioning and tracking system of claim 1, wherein the tracking method is a Euclidean distance tracking method.

9. The human skeleton detection-based multi-camera human three-dimensional positioning and tracking system of claim 1, wherein the tracking display module is configured to display the obtained matched target person, and specifically comprises:

setting a displayed model matrix, a projection matrix and a viewport matrix;

10. The human skeleton detection-based multi-camera human three-dimensional position tracking system of claim 9, wherein the interactive operation comprises rotation, zooming or translation.