CN113838140B

CN113838140B - Monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance

Info

Publication number: CN113838140B
Application number: CN202110936647.5A
Authority: CN
Inventors: 许志华; 牛一如; 孙文彬
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2023-07-18
Anticipated expiration: 2041-08-16
Also published as: CN113838140A

Abstract

The patent discloses a monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance: firstly, acquiring a monocular video image containing a dynamic pedestrian to be positioned and a laser radar point cloud in a video domain range; secondly, recovering the position, the posture and the internal azimuth elements of the video camera by extracting and matching the monocular video image with the characteristic points in the laser radar point cloud; thirdly, carrying out two-dimensional detection on the pedestrian to be positioned in the video image to obtain a pixel coordinate value of a target feature point, and simultaneously processing point cloud data of a scene where the target is positioned to extract a ground plane to obtain a coordinate value of the ground plane in the vertical direction of a laser radar coordinate system; and then, introducing a constraint condition that the target pedestrian is always vertical to the ground plane into a collineation equation, constructing a joint solution equation based on the prepared data obtained in the process, and recovering the three-dimensional coordinates and height information of the characteristics of the video pedestrian.

Description

Monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance

Technical Field

The invention belongs to the field of target tracking, and particularly relates to a monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance.

Background

The three-dimensional positioning of the dynamic target has good practical value in the fields of intelligent transportation, disaster emergency service, digital cities, public epidemic prevention and control and the like. Currently, three-dimensional positioning of targets mainly depends on a global positioning system (Global Positioning System, GPS), but in tall, dense building groups or closed spaces (such as indoor, tunnel, underground parking lot and the like), positioning misalignment is easily caused by multipath effects, signal shielding and other factors. In recent years, wireless positioning technologies, such as Ultra-Wide Band (UWB), wireless network (WIFI), bluetooth, infrared and the like, have also attracted widespread attention, but the positioning process thereof has a large dependence on external conditions, and has high cost and poor universality.

Over the past decades, vision-based and lidar-based target positioning methods have been proposed and have received great attention. Stereovision positioning techniques are relatively sophisticated, but focus primarily on three-dimensional position estimation of vehicles, while pedestrians are of varying heights and shapes, resulting in a lack of sufficient attribute information. The traditional three-dimensional pedestrian positioning method mostly depends on the scene or the known information of the pedestrian, for example, the relationship between the height and the body part and the stride obtained by medical statistics is utilized to position the pedestrian; or using physical constraints, the pixel measurements are converted to human height by simple motion trajectory analysis, such as jumping or running. This type of method uses other metrics to indirectly obtain the three-dimensional position of the pedestrian, possibly resulting in error propagation. In addition, the sensors such as a digital camera, a laser radar, a wireless sensor, an inertial gyroscope and the like can also be used as a data acquisition platform, image data and geographic position data are acquired simultaneously, and a three-dimensional map is constructed to realize multi-sensor fusion positioning. However, these methods are highly dependent on equipment or scene conditions and are not easily integrated into all public places. In recent years, many scholars try to solve the problem of three-dimensional positioning of pedestrians by using an artificial intelligence method, and different neural network structures are proposed for three-dimensional imaging, positioning or estimating three-dimensional postures of human bodies. However, the reliance on a large number of data sets for training in these studies, and the fact that pedestrians are mostly the same in height, can lead to inherent positioning errors, and the accuracy is difficult to meet.

In summary, the vision-based technology can capture detailed gesture and texture attributes, but is limited in that each spatial point has only one perspective projection straight line corresponding to the spatial point, and lacks depth information, and additional information is needed to realize conversion from two-dimensional coordinates to three-dimensional coordinates. In this context, we propose an effective alternative to achieve three-dimensional localization, using ground lidar to capture a three-dimensional map to estimate parameters of a monocular camera. Pedestrians are dynamic targets but always perpendicular to the ground so we can determine their three-dimensional position. The method aims to solve the limitation of plane positioning and improve the application of traditional photogrammetry in three-dimensional space.

The invention provides a monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance, which aims at solving the problems of the monocular video dynamic target three-dimensional positioning under the general condition. The method comprises the following implementation thought: firstly, calibrating a monocular video image by using a three-dimensional map to restore camera parameters, then completing pedestrian detection to obtain a two-dimensional boundary box comprising pedestrian head and foot positions, then extracting a vertical coordinate value of a ground plane from corresponding point cloud data of the monocular video, and realizing three-dimensional positioning of pedestrians by using the inherent condition that the bodies of the pedestrians are always vertical to the ground. The invention does not depend on special calibration objects or training data sets, does not limit scene geometric conditions, has simple and efficient calculation process, can obtain more accurate positioning results than other methods, can recover more accurate height values of pedestrians, and has theoretical and practical significance.

3. Summary of the invention

Solution of (one)

The invention aims to perform three-dimensional positioning on pedestrians in monocular videos. In view of the geometrical constraints that may not be available in a real scene, there is difficulty in determining the distance of a moving object to a camera based on monocular video. Aiming at the problem, the invention designs a monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance, which comprises the steps of firstly constructing a three-dimensional map by acquiring point cloud data of a video domain scene, then calibrating a camera by utilizing feature matching to acquire internal and external azimuth elements of a monocular video image, then detecting pedestrians to acquire pixel coordinate values of a pedestrian frame by the monocular video image containing a dynamic target to be positioned, simultaneously acquiring a vertical direction coordinate of a ground plane based on ground point cloud data of the video scene, introducing an inherent condition that the pedestrians are always vertical to the ground into a collineation equation, and carrying out joint adjustment on target feature points to acquire a three-dimensional position based on the intrinsic condition.

(II) technical scheme

In order to achieve the above purpose, the invention discloses a monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance, which specifically comprises the following steps:

step 1: respectively acquiring a monocular video image F containing a dynamic target to be positioned and LiDAR point cloud C which does not contain the target under the same scene;

step 2: recovering the internal and external azimuth elements of the camera by utilizing the 2D-3D matching relation of the feature points of the monocular video image F and the point cloud C, and taking the external azimuth elements as global transformation parameters of a target scene;

step 3: let P _i (i E1, 2,3, …, n) represents pedestrians to be positioned in the monocular video image F, n is the total number of pedestrians to be positioned, the pixel coordinates of the middle points of the upper boundary line and the lower boundary line of the pedestrian detection frame are obtained based on the target detection algorithm, and t is used respectively _i ,b _i A representation;

step 4: extracting a ground plane from a laser radar point cloud C scanned by a scene where the video image F is positioned, and obtaining a coordinate value Z of the vertical direction of the ground plane _g ；

Step 5: by pedestrians P _i (i.epsilon.1, 2,3, …, n) always being perpendicular to the ground, and the coordinate value Z of the ground plane vertical direction extracted in the step 4 _g Introducing a collineation equation, and obtaining a certain pedestrian pixel coordinate (u _t ,v _t )，(u _b ,v _b ) Constructing a joint solution model;

step 6: performing Taylor polynomial expansion on the model constructed in the step 5, and solving solutions meeting a certain threshold range after multiple iterations to obtain different pedestrians P respectively _i (i.epsilon.1, 2,3, …, n) two geometric points t _i ,b _i Three-dimensional coordinates of (a) to realizeMonocular video pedestrian three-dimensional positioning.

(III) beneficial effects

1. By utilizing the method and the device, the three-dimensional positioning of the monocular video dynamic pedestrians can be realized under the condition that the real size of the target is unknown and the scene has no specific geometric characteristics.

2. The method can provide technical support for application such as urban dynamic pedestrian video cross-border tracking, track analysis and behavior anomaly detection.

4. Description of the drawings

Fig. 1 is a flow chart of a monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance.

FIG. 2 is a schematic view of a lidar point cloud and a monocular video image containing a dynamic pedestrian to be located.

FIG. 3 is a schematic diagram of a two-dimensional detection result of a dynamic pedestrian to be positioned in a monocular video image.

Fig. 4 is a schematic diagram of three-dimensional positioning of a dynamic pedestrian to be positioned in a monocular video image.

5. Detailed description of the preferred embodiments

Taking fig. 2,3 and 4 as examples, the implementation process of the present invention will be described in detail. The specific implementation mode is as follows:

step 1: here by any pedestrian P to be located _i (i=1, 2,3, …, n) is described as an example. As shown in fig. 2, considering the dynamics of the pedestrian to be positioned, first, the pedestrian P not to be positioned is acquired by using a three-dimensional laser scanner _i And then intercepting the target P by using a scene-deployed monitoring camera _i Monocular video image F of (a).

Step 2: by O respectively _F -X _F Y _F Z _F And O _C -X _C Y _C Z _C And the coordinate system of the monocular video image F and the laser radar point cloud C is represented, and the camera calibration is realized by adopting a direct linear transformation algorithm. At least 6 pairs of characteristic points are selected from the monocular video image F and the ground point cloud C, and the internal azimuth element matrix A (u) of the camera is restored according to the 2D-3D matching relation of the characteristic points ₀ ,v ₀ F), wherein (u ₀ ,v ₀ ) Representing principal point coordinates, F being focal length, and video image F acquisitionAn instantaneous matrix E of external orientation elements (including a translation vector T and a rotation matrix R), of the formula (1) (X _w ,Y _w ,Z _w ) For the object P to be positioned _i Is its image Fang Erwei coordinates:

step 3: as shown in fig. 3, a pedestrian P in a monocular video image F is acquired with a YOLO object detector _i Is a two-dimensional detection frame of (a). The YOLO algorithm takes the whole image as the input of the network, predicts the target area and the category thereof, takes the midpoint t of the upper edge of the pedestrian detection frame, namely the position of the head of the pedestrian, and the midpoint b of the lower edge, namely the position of the foot of the pedestrian, as the marking points, and uses (u _t ,v _t )，(u _b ,v _b ) Representing its pixel coordinates.

Step 4: in the laser radar, a gyroscope is arranged to ensure that the Z axis in a coordinate system is always vertical to the ground, so that a ground plane is extracted from the laser radar point cloud C, and a coordinate value Z in the vertical direction is obtained _g Necessary basic data are provided for three-dimensional positioning of subsequent pedestrians.

Step 5: as shown in fig. 4, for the target pedestrian P to be positioned appearing only in the monocular video image F _i After the camera calibration and posture recovery in the step 2 are completed, the coordinate value Z of the vertical direction of the ground plane extracted in the step 4 is obtained _g Introducing an improved collineation equation, as in equation (2), to obtain the three-dimensional coordinates (X _b ,Y _b ,Z _b ) Wherein a is _i ,b _i ,c _i (i=1, 2, 3) represents the element values in the rotation matrix R, (X) _S ,Y _S ,Z _S ) Representing the value of the element in the translation vector T,

step 6: considering discomfort of monocular vision positioning, pedestrian is always vertical in solving processOn the ground as constraint condition, calculating the target pedestrian P to be positioned _i Is a re-projection error (deltau) of the head point t of (1) _t ,Δv _t )：

In the aboveThe pixel coordinates representing the point t detected directly from the video image F are:

accordingly, the reprojection error in the formula (4) can be expressed as a vertical direction coordinate value Z with respect to the pedestrian head point t _t Is defined by the equation:

calculating unknown number Z by matrix operation of (7) _t Correction of DeltaZ _t After multiple iterations, the three-dimensional coordinates of the point t meeting the precision requirement are obtained, and the pedestrian P is obtained _i After three-dimensional coordinates of the head and the foot, calculating the height h of the pedestrian:

step 7: step 6 gives (Y) _t ,Y _t ,Z _t ) And (X) _b ,Y _b ,Z _b ) I.e. to-be-positioned target pedestrian P satisfying the set threshold condition _i Based on which the position of the video dynamic object is determined.

The above embodiments describe the implementation steps taking any object as an example, the method being equally applicable when there are multiple objects to be located. The three-dimensional positioning of the monocular video dynamic pedestrians can be realized.

While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be had by the present invention, it should be understood that the foregoing description is merely illustrative of the present invention and that no limitations are intended to the scope of the invention, except insofar as modifications, equivalents, improvements or modifications are within the spirit and principles of the invention.

Claims

1. A monocular video pedestrian three-dimensional positioning method based on three-dimensional map assistance comprises the following steps:

step 1: acquisition of pedestrian P without positioning by three-dimensional laser scanner _i And then intercepting the target P by using a scene-deployed monitoring camera _i Monocular video image F of (a);

step 2: by O respectively _F -X _F Y _F Z _F And O _C -X _C Y _C Z _C The coordinate system representing the monocular video image F and the laser radar point cloud C is used for realizing camera calibration by adopting a direct linear transformation algorithm, namely at least 6 pairs of characteristic points are selected from the monocular video image F and the ground point cloud C, and the internal azimuth element matrix A (u ₀ ,v ₀ ,f)

In the formula (1) (u) ₀ ,v ₀ ) Representing principal point coordinates, F being focal length, E representing the instant of acquisition of video image FComprises a translation vector T and a rotation matrix R, (X) _w ,Y _w ,Z _w ) For the object P to be positioned _i (u, v) is its image Fang Erwei coordinates;

step 3: acquiring pedestrian P in monocular video image F with YOLO object detector _i The two-dimensional detection frame of the (a) takes the whole image as the input of the network, predicts the target area and the category thereof, takes the midpoint t of the upper edge of the pedestrian detection frame, namely the position of the head of the pedestrian and the midpoint b of the lower edge, namely the position of the foot of the pedestrian, as mark points, and uses (u) _t ,v _t )，(u _b ,v _b ) Representing its pixel coordinates;

step 4: because the gyroscope in the laser scanner can ensure that the Z axis in the point cloud coordinate system is always vertical to the ground, the ground plane is extracted from the laser radar point cloud C, and the vertical coordinate value Z of the ground plane is obtained _g ；

Step 5: for the target pedestrian P to be located which appears only in the monocular video image F _i After the camera calibration and posture recovery in the step 2 are completed, the coordinate value Z of the vertical direction of the ground plane extracted in the step 4 is obtained _g The collineation equation is introduced to calculate the three-dimensional coordinates (X _b ,Y _b ,Z _b )：

Wherein a is _i ,b _i ,c _i (i=1, 2, 3) represents the element values in the rotation matrix R, (X) _S ,Y _S ,Z _S ) Representing the element values in the translation vector T;

step 6: considering discomfort of monocular vision positioning, introducing pedestrians to be vertical to the ground all the time as constraint conditions in the solving process, and calculating the target pedestrians P to be positioned _i Is a re-projection error (deltau) of the head point t of (1) _t ,Δv _t )：

step 7: step 6 gives (X) _t ,Y _t ,Z _t ) And (X) _b ,Y _b ,Z _b ) I.e. to-be-positioned target pedestrian P satisfying the set threshold condition _i Based on which the video motion can be determinedThe position of the state target is located.