CN117315424A

CN117315424A - Multisource fusion bird's eye view perception target detection method, device, equipment and medium

Info

Publication number: CN117315424A
Application number: CN202311294008.9A
Authority: CN
Inventors: 漆昇翔
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2023-12-29

Abstract

The invention provides a multisource fusion bird's eye view perception target detection method, a device, equipment and a medium, wherein the method comprises the following steps: by carrying out cylindrical mapping on different source images, the bird's-eye view sensing network model can support fusion of fish-eye images and pinhole images with different installation layout orientations under the same main body frame, and the bird's-eye view feature mapping and sensing tasks are completed together, so that target tracking detection is carried out based on bird's-eye view plane coding features, and target tracking detection results in a plurality of cylindrical projection image coding features are obtained; therefore, the method is not only suitable for common and conventional pinhole camera images, but also suitable for fisheye camera images, reduces image distortion when a close-range target object is detected, can optimize the distance-near proportion of characteristic distribution, and further effectively improves the accuracy of environment perception detection.

Description

Multisource fusion bird's eye view perception target detection method, device, equipment and medium

Technical Field

The application relates to the technical field of automatic driving, in particular to a multisource fusion bird's eye view perception target detection method, device, equipment and medium.

Background

Along with the rapid development of the automatic driving perception technology, the multi-source multi-mode sensor fusion perception gradually becomes a main stream method for autonomous high-precision and high-accuracy perception of vehicles. For example: the bird's eye view sensing mode based on the bird's eye view angle space mapping is to directly perform information fusion of different sources and different modes on the feature layer of each sensor data, so as to realize the full 360-degree direction space feature coverage of the vehicle periphery.

In the current practical situation, a vehicle type with a higher configuration is capable of covering Che Zhou degrees of multi-path high-definition pinhole cameras, and has higher requirements on data bandwidth, processor resources and the like, but the vehicle type with a lower configuration does not have the condition of configuring an omnibearing panoramic high-definition camera. In order to further reduce the use cost of the pinhole high-definition camera, a method for carrying out aerial view perception by using a fisheye looking-around camera supporting parking is adopted, but the scheme only uses fisheye images, so that the detection distance is short, the image distortion is serious, the characteristic distribution far-near proportion is serious and unbalanced, the data standard picking difficulty is high, the model training is seriously dependent on corresponding fisheye standard picking images, a large number of existing conventional pinhole camera images cannot be reused, the development cost is high, and the practicability is poor.

Disclosure of Invention

In view of the above drawbacks of the prior art, the present invention provides a method, apparatus, device and medium for detecting a multisource fused aerial view sensing target, so as to solve the above technical problems.

The invention provides a multisource fused aerial view perception target detection method, which comprises the following steps: acquiring cylindrical projection parameters, a plurality of frames of cylindrical projection images, an image coding feature map and probability of radial depth belonging to each pixel feature of the image coding feature map, wherein the cylindrical projection images comprise a pinhole camera cylindrical projection image and a fisheye camera cylindrical projection image, and the cylindrical projection parameters comprise the cylindrical projection parameters of the fisheye camera image and the cylindrical projection parameters of a pinhole camera; determining aerial view plane coding features of cylindrical projection image coding features of each frame based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth and a target vehicle coordinate system; recording the aerial view plane coding feature corresponding to the cylindrical projection image of the current frame as the aerial view plane coding feature of the current frame, and carrying out feature fusion on the aerial view plane coding feature of the current frame and the aerial view plane coding feature of the historical frame to obtain an aerial view time sequence fusion feature; decoding the bird's eye view time sequence fusion characteristics to obtain corresponding scale decoding characteristics; connecting the scale decoding features with a perception task head network to obtain a perception network model; and generalizing and training the perception network model, and performing target detection based on the trained perception network model.

In one embodiment of the present invention, acquiring a plurality of frames of cylindrical projection images includes: collecting a plurality of camera images, wherein the camera images comprise a pinhole camera image and a fisheye camera image; constructing a virtual cylinder according to the internal reference information of the camera image, wherein the center point of the virtual cylinder is the origin of the camera optical axis, the radius of the virtual cylinder is the focal length of the camera, the central axis of the cylindrical projection image coincides with the optical axis of the camera image, and the transverse view angle of the cylindrical projection of the virtual cylinder is the same as the transverse view angle of the camera; carrying out coordinate distortion correction on the camera image through camera image distortion model parameters to obtain camera image coordinates to be converted; and converting the image coordinates of the camera to be converted into cylindrical projection images of the camera by utilizing the virtual cylinder.

In an embodiment of the present invention, determining a bird's eye view plane coding feature of each frame of cylindrical projection image coding features based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth belonging to, and a target vehicle coordinate system includes: mapping the image coding feature map to a three-dimensional gridding feature space taking the origin of a coordinate system of a target vehicle as a center through the cylindrical projection parameters and the probability of the radial depth; and carrying out planarization compression on the three-dimensional meshing feature space to obtain the aerial view plane coding feature.

In an embodiment of the present invention, before mapping the cylindrical projection image coding feature of each frame to the three-dimensional gridding feature space centered on the origin of the coordinate system of the target vehicle, the method further includes:

converting the cylindrical projection image coordinates into real object point coordinates in camera coordinates by the following formula:

wherein the pixel coordinate x _c The coordinate y is the radian direction of the cylindrical surface _c The depth ρ is the height direction of the cylinder _c In the radial direction of the central axis of the cylinder, f is the focal length, (X, Y, Z) is the real object point coordinate, (X) _c ，y _c ，ρ _c ) For cylindrical projection image coordinates.

In one embodiment of the present invention, mapping the cylindrical projection image coding feature of each frame to a three-dimensional gridded feature space centered on the origin of the target vehicle coordinate system includes:

mapping the cylindrical projection image coding features of each frame to a three-dimensional gridding feature space taking the origin of a target vehicle coordinate system as the center by utilizing a preset projection relation; wherein, the preset projection relation is:

wherein,encoding features for cylindrical projection images, ">For three-dimensional meshing of feature spaces, C ₀ For characteristic dimension +.>For the spatial image feature distribution weighted by probability of D equivalent cylinder radial depth intervals corresponding to the ith cylindrical projection image, Γ {. Phi } represents the encoding feature of the cylindrical projection image through linear interpolation by converting the cylindrical projection image coordinates into real object point coordinates under camera coordinates, equivalent cylinder transformation parameters and camera relative target vehicle body external parameters The sign map is to Cartesian coordinate system space centered on the host vehicle.

In one embodiment of the present invention, performing planar compression on the three-dimensional gridding feature space through a cylinder pooling operation includes: adding and averaging all voxel characteristic vectors corresponding to the column body along the vertical axis height direction in the aerial view space meshing space to obtain a first characteristic dimension; and carrying out dimension stacking on all voxel feature vectors corresponding to the cylinder along the height direction of the longitudinal axis to obtain a second feature dimension, and carrying out planarization compression on the three-dimensional gridding feature space based on the first feature dimension and the second feature dimension.

In an embodiment of the present invention, feature fusion is performed on a current frame aerial view plane coding feature and a historical frame aerial view plane coding feature to obtain an aerial view time sequence fusion feature, including: calculating the plane coordinate conversion relation from the historical frame aerial view plane coding feature to the current frame aerial view plane coding feature according to the target vehicle positioning information of the historical frame aerial view plane coding feature and the target vehicle positioning information of the current frame aerial view plane coding feature; based on the plane coordinate conversion relation, aligning each historical frame aerial view plane coordinate system to the same coordinate system to obtain a coordinate aligned historical frame aerial view plane coding feature; and setting weights for the plane coding features of the aerial view of the history frame with the aligned coordinates, carrying out feature fusion on the plane coding features of the aerial view of the history frame with the aligned coordinates and the plane coding features of the aerial view of the current frame based on linear interpolation of the weights in plane space, and obtaining the time sequence fusion features of the aerial view.

In an embodiment of the present invention, the generalizing training of the perceptual network model includes: basic data information corresponding to a perception task is collected, wherein the basic data information comprises time stamp information; and performing task data sample labeling on the basic data information aiming at different perception tasks to obtain corresponding labeling data, wherein the perception tasks comprise: three-dimensional target detection, a road passable area and lane lines; and constructing a loss function, and performing generalization training on the perception network model based on the labeling data.

In one embodiment of the present invention, constructing a loss function includes: constructing a total loss function of the perception network model according to the perception task prediction loss and the cylinder radial depth prediction loss; wherein the perceived task prediction loss includes a target detection loss and an image segmentation task loss; the target detection loss is obtained by weighting calculation of a classification loss function formula and a three-dimensional frame regression loss function; the image segmentation task loss is calculated by comparing a predicted segmentation mask and a real segmentation mask; the depth prediction loss is obtained by cross entropy loss of simple classification or ordered regression loss calculation.

The invention provides a multisource fused aerial view perception target detection device, which comprises: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring cylindrical projection parameters, a plurality of frames of cylindrical projection images, an image coding feature map and the probability that the radial depth corresponding to each pixel feature of the image coding feature map belongs to, the cylindrical projection images comprise a pinhole camera cylindrical projection image and a fisheye camera cylindrical projection image, and the cylindrical projection parameters comprise the cylindrical projection parameters of the fisheye camera image and the cylindrical projection parameters of a pinhole camera; the aerial view plane coding feature generation module is used for determining aerial view plane coding features of the aerial view plane coding features of each frame based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth and a target vehicle coordinate system; the fusion module is used for recording the aerial view plane coding feature corresponding to the cylindrical projection image of the current frame as the aerial view plane coding feature of the current frame, and carrying out feature fusion on the aerial view plane coding feature of the current frame and the aerial view plane coding feature of the historical frame to obtain an aerial view time sequence fusion feature; the decoding module is used for decoding the bird's eye view time sequence fusion characteristics to obtain corresponding scale decoding characteristics; the training module is used for connecting the scale decoding characteristics with a perception task head network to obtain a perception network model; and generalizing and training the perception network model, and performing target detection based on the trained perception network model.

The invention has the beneficial effects that: according to the multisource fusion aerial view perception target detection method, device, equipment and medium, cylindrical mapping is carried out on different source images, the aerial view perception network model can support fusion of the fisheye images and the pinhole images with different installation layout orientations under the same main body frame, and the aerial view feature mapping and perception tasks are completed together, so that target tracking detection is carried out based on aerial view plane coding features, and tracking detection results of targets in a plurality of cylindrical projection image coding features are obtained; therefore, the method is not only suitable for common and conventional pinhole camera images, but also suitable for fisheye camera images, reduces image distortion when a close-range target object is detected, can optimize the distance-near proportion of characteristic distribution, and further effectively improves the accuracy of environment perception detection.

In addition, the scheme can be also independently applied to 360-degree full-fisheye images without pinhole images so as to finish bird's eye view perception, a large number of existing conventional pinhole camera images can be multiplexed, development cost is low, and the method has strong practicability.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 is a flow chart of a multisource fused bird's eye view aware target detection method shown in an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of cylinder radial depth equivalent to a cylindrical projection image, as shown in an exemplary embodiment of the present application;

FIG. 3 is a schematic view of a target vehicle orientation shown in an exemplary embodiment of the present application;

FIGS. 4 (a), 4 (b), and 4 (c) are schematic diagrams of in-focus cylindrical projection of pinhole imaging, as illustrated in an exemplary embodiment of the present application;

FIGS. 5 (a) and 5 (b) are top views of cylindrical projections of the same focal length and same field angle as the pinhole imaging for fish-eye imaging according to an exemplary embodiment of the present application

FIG. 6 is a schematic view of a camera coordinate system shown in an exemplary embodiment of the present application;

FIG. 7 is a network model schematic diagram of a multisource fused bird's eye view aware target detection method according to an exemplary embodiment of the present application;

FIG. 8 is a schematic block diagram of a multi-source fused bird's eye view sensing target detection method according to an exemplary embodiment of the present application;

FIG. 9 is a block diagram of a multisource fused bird's eye view aware target detection device according to an exemplary embodiment of the present application;

fig. 10 shows a schematic diagram of a computer system suitable for use in implementing the electronic device of the embodiments of the present application.

Detailed Description

Further advantages and effects of the present invention will become readily apparent to those skilled in the art from the disclosure herein, by referring to the accompanying drawings and the preferred embodiments. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In the following description, numerous details are set forth in order to provide a more thorough explanation of embodiments of the present invention, it will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present invention.

It should be noted that, first, the autopilot technique refers to the ability of a vehicle to autonomously sense, make decisions, and control travel without human intervention. This technology covers a number of areas of knowledge and technology, some of which include: sensor technology: the autopilot acquires information of the surrounding environment by using a laser radar, a camera, an ultrasonic sensor, a radar, and the like. The data collected by these sensors helps the vehicle identify roads, vehicles, pedestrians, and other obstacles. Perception and environmental modeling: the vehicle uses the sensor data to construct a three-dimensional model of the surrounding environment. By identifying and analyzing roads, signs, traffic signals, obstacles, etc., the vehicle can understand the scene in which it is located. Decision making: based on the perceived environment, the autopilot system will make decisions, such as selecting appropriate lanes, speeds, and inter-vehicle distances, to ensure safe and efficient travel. Path planning: the autonomous vehicle calculates the best path to reach the destination or perform a particular task. This requires consideration of traffic conditions, road rules and other vehicle behavior. And (3) a control system: the control system is responsible for actually steering the vehicle, including accelerating, braking, steering, etc. These operations are performed based on the direction of the decision system. Artificial intelligence and machine learning: automatic driving technology is widely applied to artificial intelligence and machine learning technology for pattern recognition, behavior prediction, decision making and the like. The deep learning algorithm may help the vehicle identify images, predict the behavior of other road users, etc. Data fusion: the autopilot system needs to fuse the data from multiple sensors together to obtain comprehensive environmental awareness. This requires efficient data processing and sensor data fusion algorithms. Security and reliability: autopilot technology needs to operate under a variety of complex and uncertain conditions, and therefore requires extremely high safety and reliability of the system. Fault detection, redundant systems and emergency switching are important means of ensuring safety.

The multi-source multi-mode sensor forms supporting self-driving perception comprise various visual imaging sensors such as long-focus cameras for long-distance detection, wide-angle cameras for middle-close-range detection, fish eyes for looking around parking detection and various radar point cloud detection sensors such as millimeter wave radar, laser radar, 4D (Fourth Dimension) millimeter wave radar and the like for realizing space positioning. Among these, a representative solution with high reliability is the fusion use of visual images with a lidar point cloud. On the one hand, the visual image has the characteristic characteristics of most abundant details and most plump particles, and is enough to fully reflect various information such as colors, shapes, textures, poses and the like of environmental scene targets, and is regarded as an optimal number source for object identification and scene semantic representation. The problem is that the original information in the visual image always reflects 2D (Two Dimension) plane characteristics, which have no depth of field attribute, and accurate measurement and positioning are difficult to achieve, which is a core defect in practical use. In this regard, the laser radar and the 4D millimeter wave radar, which can reflect the characteristics of the 3D (Three Dimension) spatial depth position, have the visual incomparable advantages for the ranging and the accurate positioning of the spatial object, and have better detection capability than the visual sense for sudden intrusion or unknown obstacles. The lightning fusion BEV perception of vision and 3D Lei Dadian cloud information is considered to be the optimal combination for solving the automatic driving environment perception at the present stage, however, the high cost of the laser radar greatly limits the large-scale mass production popularization of the lightning fusion perception method, and the low-cost multi-source fusion perception represented by pure vision becomes a solution with high cost performance pursued by most relevant enterprises.

The pure-vision multi-source fusion BEV perception is characterized in that imaging cameras such as long-focus imaging and wide-angle imaging of vehicles in all directions are fully utilized to complete 360-degree environment vision feature level fusion perception, and the aim of reducing the use cost of the sensor is achieved by extracting information only through vision images. The neural network model performs feature extraction on each image, then maps the visual features one by one to a BEV (bird's eye view) space with the vehicle as a center by utilizing depth estimation and imaging geometric relations, and finally forms a BEV feature map covering the whole domain for detecting, dividing and other perception tasks. The most commonly used multi-shot vision fusion BEV perception method at present mainly aims at pinhole camera images, namely front-view, side-view and rear-view conventional pinhole cameras distributed around a vehicle are utilized to complete BEV feature projection. The pinhole camera accords with the geometrical principle of pinhole imaging, has simple and clear projection relation, high imaging resolution and definition, lighter distortion degree and balanced far-near proportion of target expression effect, and is easy to meet the use requirements of various conventional scenes. However, in the current practical situation, the vehicle type with a higher configuration is capable of covering Che Zhou ° multi-path high-definition pinhole cameras, and many vehicle types with a middle and low configuration do not have the condition of configuring the omnibearing panoramic high-definition camera because of the higher requirements on data bandwidth, processor resources and the like. In order to further reduce the use cost of a pinhole high-definition camera, a small number of methods for carrying out BEV perception by using a fisheye looking-around camera supporting parking are also proposed in the industry, but the schemes only use fisheye images, so that the detection distance is short, the image distortion is serious, the feature distribution far-near proportion is serious and unbalanced, the data standard-picking difficulty is high, the model training is seriously dependent on the corresponding fisheye standard-picking images, a large number of conventional pinhole camera images cannot be multiplexed, the development cost is high, the practicability is poor, and most of the methods are only in a research and exploration state and have no popularization universality.

The pinhole camera accords with the geometrical principle of pinhole imaging, has simple and clear projection relation, high imaging resolution and definition, lighter distortion degree and balanced far-near proportion of target expression effect, and is easy to meet the use requirements of various conventional scenes. However, in the current practical situation, the vehicle type with a higher configuration is capable of covering Che Zhou ° multi-path high-definition pinhole cameras, and many vehicle types with a middle and low configuration do not have the condition of configuring the omnibearing panoramic high-definition camera because of the higher requirements on data bandwidth, processor resources and the like. In order to further reduce the use cost of a pinhole high-definition camera, a small number of methods for carrying out BEV perception by using a fisheye looking-around camera supporting parking are also proposed in the industry, but the schemes only use fisheye images, so that the detection distance is short, the image distortion is serious, the feature distribution far-near proportion is serious and unbalanced, the data standard-picking difficulty is high, the model training is seriously dependent on the corresponding fisheye standard-picking images, a large number of conventional pinhole camera images cannot be multiplexed, the development cost is high, the practicability is poor, and most of the methods are only in a research and exploration state and have no popularization universality.

Fig. 1 is a schematic diagram of a method for detecting a bird's eye view sensing target by multi-source fusion according to an exemplary embodiment of the present application, specifically including the following steps:

Step S110, obtaining cylindrical projection parameters, a plurality of frames of cylindrical projection images, an image coding feature map and probability of the radial depth corresponding to each pixel feature of the image coding feature map, wherein the cylindrical projection images comprise a pinhole camera cylindrical projection image and a fisheye camera cylindrical projection image, and the cylindrical projection parameters comprise the cylindrical projection parameters of the fisheye camera image and the cylindrical projection parameters of the pinhole camera.

In one embodiment of the present application, the lenticular projection image is input to a convolution backbone network for feature encoding to obtain an image encoding feature map, and in this embodiment, the lenticular projection image is a 2D visual lenticular projection image I _i (i=1, 2,., K) the image encoding feature map is a 2D image encoding feature, the convolution backbone network is a shared 2D (Two-Dimensional) backbone network, and the 2D visual cylindrical projection image is encoded through the shared 2D backbone network to obtain a 2D image encoding featureIn the embodiment, the 2D image coding feature, the camera internal and external parameters and the projection cylinder parameters are respectively input into a cylindrical projection equivalent cylinder radial depth prediction network to obtain an equivalent cylinder radial depth probability P corresponding to each pixel feature of the 2D image _i ∈R ^D×H×W (i=1, 2,) K. Wherein K is a constant, H is a height value, W is a width value, D is a depth value, C ₀ Is a feature dimension.

In one embodiment of the present application, a cylindrical projection equivalent cylinder radial depth prediction screenComplex input as 2D image coding featuresThe network structure can adopt SE (sequential-and-specification) network, wherein the parameters of the original camera, the parameters of the external camera and the parameters of the projection column are used as one-dimensional signals to be input into MLP (Multilayer perceptron, multi-layer perceptron) network, the output high-dimensional vector is used as SE network weight weighting to be used for image characteristic channels, the final SE network outputs equivalent column radial depth of each pixel of the cylindrical projection image, the equivalent column radial depth is divided into each depth interval according to actual distance, and the probability that each pixel belongs to each depth interval is predicted, in the embodiment, the probability P that the radial depth corresponding to each pixel characteristic belongs to each depth interval _i ∈R ^D ^×H×W (i=1, 2,) K. In this embodiment, the equivalent cylinder radial depth of the cylindrical projection image refers to the length of a real object point corresponding to each pixel in the cylindrical projection image along the radial direction of the projection cylinder in a top view, as shown in fig. 2, fig. 2 is a schematic view of the equivalent cylinder radial depth of the cylindrical projection image, which is shown in an exemplary embodiment of the present application, and the length of the equivalent cylinder radial depth of the cylindrical projection image along the radial direction of the projection cylinder in the top view is equivalent to the depth of field of the perspective image, and the equivalent cylinder radial depth of each pixel in the cylindrical projection image is predicted by encoding features, camera parameters and projection cylinder parameters of the input image. The method can help to understand the distance relation of different objects in the scene, and can more accurately reflect the position of the real object point under the characteristics of the cylindrical projection image.

In the above embodiment, the 2D convolution backbone network is the input of each cylindrical projection image I _i (i=1, 2,., K) share weights, and the network can typically employ, but is not limited to, a series of commonly used 2D convolutional network structures such as res net (residual neural network) with a depth level, efficientNet, swinTransformer, voVNetV2, and the like.

Step S120, determining a bird' S eye view plane coding feature of each frame of cylindrical projection image coding feature based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth and the target vehicle coordinate system.

In one embodiment of the present application, cylindrical projected image featuresEquivalent cylinder transformation parameter and probability P of radial depth of each image pixel _i ∈R ^D×H×W (i=1, 2..once., K), linear interpolation sampling the combined data in combination with the perspective camera relative to the vehicle body coordinate system external parameters, mapping each cylindrical projection image coding feature to a unified BEV perspective 3D gridding feature space centered on the target vehicle coordinate system originAs shown in fig. 3, fig. 3 is a schematic view of a target vehicle orientation shown in an exemplary embodiment of the present application. Flattening and compressing BEV space gridding characteristics by using column pooling operation to obtain BEV plane coding characteristics +. >In this way, by linear interpolation sampling to a uniform BEV view, information of different views can be mapped into a shared BEV space to obtain BEV planar coding features. Such a characteristic representation may be better suited for certain perceived tasks such as object detection, obstacle recognition, etc.

And step S130, recording the aerial view plane coding feature corresponding to the cylindrical projection image of the current frame as the aerial view plane coding feature of the current frame, and carrying out feature fusion on the aerial view plane coding feature of the current frame and the aerial view plane coding feature of the historical frame to obtain an aerial view time sequence fusion feature.

And step S140, decoding the bird' S eye view time sequence fusion characteristics to obtain corresponding scale decoding characteristics.

And step S150, connecting the scale decoding features with a perception task head network to obtain a perception network model, performing generalization training on the perception network model, and performing target detection based on the trained perception network model.

In one embodiment of the application, a smaller secondary trunk decoding network is utilized to decode the bird's eye view time sequence fusion characteristic features, a characteristic pyramid structure is adopted at the tail end of the bird's eye view time sequence fusion characteristic features, and proper scale characteristics are output for each perception task In this embodiment, the secondary backbone decoding network is a 2D convolutional secondary backbone network, and a series of common 2D convolutional network structures such as, but not limited to, res net and the like can be generally adopted, and the number of network layers adopted in the link is generally relatively shallow, such as res net-18 and the like. The secondary trunk decoding network is used for restoring useful information from the fused BEV features, and the secondary trunk decoding network adopts a feature pyramid structure, so that data information can be extracted from features with different scales to adapt to the requirements of different sensing tasks, and a target vehicle can more accurately sense the surrounding environment and make an accurate decision.

In one embodiment of the application, each scale feature is connected with different sensing task head networks such as 3D target detection, road passable area segmentation, lane line extraction and the like which are required to be matched in scale, and finally sensing results such as corresponding 3D target detection, road passable area, lane line and the like are output.

In one embodiment of the present application, BEV scale features are output for each level of feature pyramid FPNUsually, relatively smaller scale features are connected with the 3D target detection head DtHead to improve network efficiency, relatively larger scale features are connected with the road passable region segmentation head RdHead and the lane line detection head LaHead to improve pixel segmentation calculation efficiency, dtHead, rdHead, laHead can flexibly use a plurality of different task head networks, and the model has strong universal adaptation capability.

In the technical scheme shown in fig. 1, by performing cylindrical mapping on different source images, a bird's-eye view sensing network model can support fusion of fish-eye images and pinhole images with different installation layout orientations under the same main body frame, and the bird's-eye view feature mapping and sensing tasks are completed together, so that target tracking detection is performed based on bird's-eye view plane coding features, and target tracking detection results in a plurality of cylindrical projection image coding features are obtained; therefore, the method is not only suitable for common and conventional pinhole camera images, but also suitable for fisheye camera images, reduces image distortion when a close-range target object is detected, can optimize the distance-near proportion of characteristic distribution, and further effectively improves the accuracy of environment perception detection.

In one embodiment of the present application, acquiring a number of frames of cylindrical projection images includes acquiring a number of camera images including a pinhole camera image and a fisheye camera image; according to internal reference information of a camera image, a virtual cylinder is constructed, the center point of the virtual cylinder is a camera optical axis origin, the radius of the virtual cylinder is a camera focal length, the central axis of a cylindrical projection image coincides with the camera image optical axis, the transverse view angle of the cylindrical projection of the virtual cylinder is the same as the transverse view angle of the camera, and coordinate distortion correction is carried out on the camera image through camera image distortion model parameters to obtain a pinhole camera image coordinate after distortion removal; and converting the image coordinates of the camera to be converted into cylindrical projection images of the camera by using the virtual cylinder.

In one embodiment of the application, the camera is a pinhole camera, the camera image coordinates to be converted are the undistorted pinhole camera image, and the lenticular projection image is the pinhole camera lenticular projection image. As shown in fig. 4, fig. 4 (a), fig. 4 (b), and fig. 4 (c) are schematic diagrams of focal length cylindrical projection such as pinhole imaging according to an exemplary embodiment of the present application, firstly, coordinates in a pinhole camera image are corrected by image distortion model parameters to obtain a de-distorted pinhole camera image, secondly, a virtual cylinder is constructed, a center point of the virtual cylinder is a pinhole camera optical axis origin, a radius of the virtual cylinder is a pinhole camera focal length, a central axis of a cylindrical projection image coincides with a pinhole camera image optical axis, and a lateral field angle of cylindrical projection of the virtual cylinder is the same as a lateral field angle of the pinhole camera, so that it can be ensured that any original image information cannot be missed in a projection process. The longitudinal extent can be controlled as needed, and can be generally set to cut off the black edge area brought by projection transformation so as to ensure that the projected image has no invalid area.

And converting the undistorted pinhole camera image into a cylindrical projection image of the pinhole camera by the following formula:

Wherein, (x) _p ,y _p ) For pinhole camera image coordinate position, (x) _c ，y _c ) And f is the focal length of the pinhole camera, and θ is the azimuth angle of the image point of the pinhole camera.

In one embodiment of the present application, the camera is a fisheye camera, the camera image coordinates to be converted are perspective projection plane image coordinates, the cylindrical projection image is a fisheye camera cylindrical projection image, focal length cylindrical projection of fisheye imaging is shown in fig. 5, and fig. 5 (a) and fig. 5 (b) are top views of the fisheye imaging and pinhole imaging cylindrical projection with the same focal length and the same field angle shown in an exemplary embodiment of the present application. Correcting coordinates in an image of the fisheye camera through a distortion polynomial model of the fisheye camera, internal parameters and distortion parameters to obtain an undistorted fisheye camera image, calculating corresponding coordinates of the undistorted fisheye camera image on an equal-focal-length perspective projection plane according to the undistorted fisheye camera image coordinates to obtain perspective projection plane image coordinates, constructing a virtual cylinder, wherein the center point of the virtual cylinder is an origin of an optical axis of the fisheye camera, the radius of the virtual cylinder is the focal length of the fisheye camera, the central axis of the cylindrical projection image coincides with the optical axis of the fisheye camera image, the transverse field angle of cylindrical projection of the virtual cylinder is the same as the transverse field angle of the fisheye camera, and converting the fisheye camera image into a cylindrical projection image of the fisheye camera through the following formula:

Wherein (x) _f ,y _f ) Camera image coordinates for fish eye camera, (x) _w ,y _w ) For the coordinates of the cylindrical projection image of the converted fisheye camera, f is the focal length of the fisheye camera, and θ is the azimuth angle of the image point of the fisheye camera.

In one embodiment of the present application, the cylindrical projection image of the fisheye camera is consistent or similar in field of view to the pinhole camera cylindrical projection image, which enables the generation of a transformed image suitable for a perception task or an analysis task at the wide angle view of the fisheye camera.

In one embodiment of the present application, the cylindrical projection image of the fisheye camera is guaranteed to be consistent or approximate in view angle to the cylindrical projection image of the pinhole camera, while the lateral and longitudinal view angles remain the same as the pinhole camera view angle to ensure the integrity of the image information. Usually, the FOV (angle of view) of the fisheye camera covers about 180 ° of ultra wide angle field, but the high-definition texture of the image is basically concentrated in the area near the central axis of the fisheye camera, and the information compression is serious near the edge area, but in order to reduce the blind area between the cameras as much as possible, the two new cylindrical projection image central axes are required to be considered when actually setting, and are generally set at a certain included angle with the central axis of the fisheye camera, the FOV of the field is the same as that of the pinhole camera, and the new cylindrical projection images with the centers facing symmetrically opposite are generally desirable to supplement the information of different viewing angles to each other so as to reduce the blind area.

In one embodiment of the application, based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth and the target vehicle coordinate system, determining the aerial view plane coding feature of each frame of cylindrical projection image coding feature, wherein the image coding feature map is mapped to a three-dimensional gridding feature space taking the origin of the target vehicle coordinate system as the center through the cylindrical projection parameters and the probability of the radial depth; and carrying out planarization compression on the three-dimensional meshing feature space to obtain the aerial view plane coding feature. And processing the gridding characteristics of the BEV space through column pooling operation, compressing 3D information of the column onto a bird's-eye view plane, namely fusing information of a plurality of data sources, and mapping the information into a unified bird's-eye view space to obtain bird's-eye view plane coding characteristics. The characteristic representation can be better suitable for certain perception tasks, such as target detection, obstacle recognition and the like, namely, the fusion result of the video camera and the depth information of each view angle can be obtained more intuitively, and the characteristic representation can be used for subsequent perception tasks.

In one embodiment of the present application, before mapping each frame of cylindrical projection image coding features to a three-dimensional gridding feature space centered on the origin of the target vehicle coordinate system, further comprising converting the cylindrical projection image coordinates to real object point coordinates under the camera coordinates by the following formula:

Wherein the pixel coordinate x _c The coordinate y is the radian direction of the cylindrical surface _c The depth ρ is the height direction of the cylinder _c In the radial direction of the central axis of the cylinder, f is the focal length of the camera, (X, Y, Z) is the coordinates of a real object point, (X) _c ，y _c ，ρ _c ) For cylindrical projection image coordinates.

In one embodiment of the present application, as shown in fig. 6, fig. 6 is a schematic diagram of a camera coordinate system shown in an exemplary embodiment of the present application. The position of each pixel of the cylindrical projection image is represented by coordinates in the cylindrical arc direction and in the cylindrical height direction. These coordinates are calculated in a cylindrical projection transformation and under the camera coordinate system, the coordinates of the real object point can be determined from its three-dimensional position in space. The coordinates of the real object point include a depth in the radial direction of the cylinder center axis, which is invisible in the cylindrical projection image, and therefore, it is necessary to convert the cylindrical projection image coordinates into coordinates of the real object point in the camera coordinate system, thereby estimating the position of the real object point in the camera coordinate system in the cylindrical projection image.

In one embodiment of the present application, mapping each frame of cylindrical projection image coding features to a three-dimensional gridding feature space centered at the origin of the target vehicle coordinate system includes the following projection relationship of the cylindrical projection image coding features to the three-dimensional gridding feature space:

Wherein,encoding features for cylindrical projection images, ">For three-dimensional meshing of feature spaces, C ₀ For characteristic dimension +.>And (3) for spatial image feature distribution after probability weighting of D equivalent cylinder radial depth intervals corresponding to the ith cylindrical projection image, wherein Γ { · } represents that the cylindrical projection image coding feature is mapped to a Cartesian rectangular coordinate system space taking the vehicle as the center through linear interpolation by converting the cylindrical projection image coordinates into real object point coordinates under the camera coordinates, equivalent cylinder transformation parameters and camera relative target vehicle body external parameters.

In one embodiment of the present application, the 3D meshing space R may be pre-meshed ^Z×X×Y Each voxel grid unit in the column projection image corresponds to the radial depth space R of the column with different pixels ^D×H×W The coordinate position conversion relation in the table is pre-calculated and stored in a lookup table T _F In the method, model training and reasoning run-time is directly based on the lookup table T _F Linear interpolation calculation is performed by searching relevant position features so as to quickly obtain BEV visual angle 3D gridding features

In one embodiment of the present application, performing a planarization compression on a three-dimensional gridding feature space through a cylinder pooling operation, including adding and averaging all voxel feature vectors corresponding to a cylinder in a vertical axis height direction in a bird's eye view space gridding space to obtain a first feature dimension; and carrying out dimension stacking on all voxel feature vectors corresponding to the column body along the height direction of the vertical axis to obtain a second feature dimension, and compressing the three-dimensional grid feature space to a bird's eye view plane based on the first feature dimension and the second feature dimension. The meshing features of the BEV space are processed through a cylinder pooling operation, 3D information of the cylinder is compressed onto the BEV plane, that is, information of a plurality of data sources is fused and mapped into the unified BEV space, so that BEV plane coding features are obtained. Such a characteristic representation may be better suited for certain perceived tasks such as object detection, obstacle recognition, etc.

In one embodiment of the present application, for 3D featuresThe operation mode of cylinder pooling can consider the sum average or dimension stacking of all voxel feature vectors of the cylinder corresponding to the Z-axis height direction of each grid on the XY plane of the BEV space under the coordinate system of the vehicle, and if the sum average is the new feature dimension C ₁ ＝C ₀ If the dimensions are stacked, a new feature dimension C ₁ ＝C ₀ XZ, finally obtaining BEV planar coding features->Wherein C is ₀ For the first characteristic dimension, C ₁ Is the second feature dimension. The meshing features of the BEV space are processed through a cylinder pooling operation, 3D information of the cylinder is compressed onto the BEV plane, that is, information of a plurality of data sources is fused and mapped into the unified BEV space, so that BEV plane coding features are obtained. Such a characteristic representation may be better suited for certain perceived tasks such as object detection, obstacle recognition, etc.

In one embodiment of the application, feature fusion is performed on the current frame aerial view plane coding feature and the historical frame aerial view plane coding feature to obtain an aerial view time sequence fusion feature, wherein the method comprises the steps of calculating a plane coordinate conversion relationship from each historical frame aerial view plane coding feature to the current frame aerial view plane coding feature according to target vehicle positioning information of the historical frame aerial view plane coding feature and target vehicle positioning information of the current frame aerial view plane coding feature; based on a plane coordinate conversion relation, aligning each historical frame aerial view plane coordinate system to the same coordinate system to obtain a coordinate aligned historical frame aerial view plane coding feature; and setting weights for the plane coding features of the aerial view of the history frame with the aligned coordinates, and carrying out feature fusion on the plane coding features of the aerial view of the history frame with the aligned coordinates and the plane coding features of the aerial view of the current frame based on linear interpolation of the weights in plane space to obtain the time sequence fusion features of the aerial view. By fusing the plane coding features of the aerial view of each historical frame with the plane coding features of the aerial view of the current frame, the more complete and accurate plane coding features of the aerial view of the current frame can be obtained, the target vehicle can more accurately sense the surrounding environment based on the plane coding features of the aerial view of the current frame, and more abundant time sequence sensing information can be obtained for subsequent tasks.

In one embodiment of the present application, each preamble history frame bird's eye view plane coding feature isThe plane coding characteristic of the aerial view of the current frame is +.>Combining the vehicle positioning information of the current frame time and each preamble history frame time, calculating and obtaining the aerial view plane coordinate conversion relation from each preamble history frame to the current frame, taking the aerial view plane coordinate system of the current frame as a reference coordinate system, converting and aligning aerial view characteristic plane coordinates of each preamble history frame to the current reference coordinate system, completing aerial view plane coding characteristic time sequence characteristic fusion by setting weight and a plane space linear interpolation mode on aerial view plane coding characteristics of each history frame, and obtaining aerial view time sequence fusion characteristics which are the same as those of the aerial view time sequence fusion characteristics>In this embodiment, the weights may be assigned according to the time distance between the historical frame and the current frame, or may be designed as a variable that can be learned; and then decoding the bird's eye view time sequence fusion characteristic by utilizing a smaller secondary trunk decoding network, and performing decoding on the bird's eye view time sequence fusion characteristicThe tail end adopts a characteristic pyramid structure, and outputs proper scale characteristics for each perception task>By fusing the plane coding features of the aerial view of each historical frame with the plane coding features of the aerial view of the current frame, the more complete and accurate plane coding features of the aerial view of the current frame can be obtained, the target vehicle can more accurately sense the surrounding environment based on the plane coding features of the aerial view of the current frame, and more abundant time sequence sensing information can be obtained for subsequent tasks. And the useful information is restored from the fused BEV features through the secondary trunk decoding network, and the data information can be extracted from the features with different scales by adopting the feature pyramid structure in the secondary trunk decoding network so as to adapt to the requirements of different sensing tasks, thereby enabling the target vehicle to more accurately sense the surrounding environment and make an accurate decision.

In the above embodiment, the decoding network may output a feature map with a suitable scale in combination with the feature pyramid for each perception task. These tasks may include object detection, semantic segmentation, instance segmentation, etc., providing appropriate features according to task needs.

In the above embodiment, for the allocation of the weights, the manually set manner may be adopted, and the weights are determined according to the time interval between the historical frame and the current frame. Neural networks may also be used to learn dynamic weights, requiring consideration of time intervals, etc. in the network training process.

In one embodiment of the present application, each preamble history frame bird's eye view plane coding feature isThe plane coding characteristic of the aerial view of the current frame is +.>Acquiring positioning information of the target vehicle at each moment corresponding to the plane coding features of the aerial view of the historical frame and the plane coding features of the aerial view of the current frame, and calculating to obtain each moment of the historyRotation and translation matrix relative to the current moment +.>Then, the plane coordinate system of the aerial view of the current frame is used as a reference coordinate system, the BEV characteristic plane coordinate of each preamble historical frame is converted and aligned to the current reference coordinate system, and the aerial view characteristic of the historical frame after coordinate alignment is obtained >Then, the bird's-eye view time sequence fusion is completed by setting weight and planar space linear interpolation mode for each historical frame to obtain bird's-eye view time sequence fusion characteristics +.>The weight can be manually distributed according to the time distance between the historical frame and the current frame, or can be obtained by training a learning variable through a network. Therefore, through fusing the plane coding features of the aerial view of each historical frame and the plane coding features of the aerial view of the current frame, the more complete and accurate plane coding features of the aerial view of the current frame can be obtained, the target vehicle can more accurately sense the surrounding environment based on the plane coding features of the aerial view of the current frame, and more abundant time sequence sensing information can be obtained for the subsequent tasks, so that the target vehicle can more accurately sense the surrounding environment and make an accurate decision.

In one embodiment of the present application, generalizing training a perceptual network model comprises: basic data information corresponding to the sensing task is collected, wherein the basic data information comprises time stamp information; and performing task data sample labeling on the basic data information aiming at different perception tasks to obtain corresponding labeling data, wherein the perception tasks comprise: three-dimensional target detection, a road passable area and lane lines; and constructing a loss function, and performing generalization training on the perception network model based on the labeling data. The method and the device can effectively improve the accuracy of environment perception detection by establishing and applying the method and the device to the perception network model and carrying out target detection through the perception network model.

In one embodiment of the application, the perception network model comprises three-dimensional target detection across a visual field space, bird's-eye view road passable region segmentation, bird's-eye view lane line extraction, and the labeling data comprises three-dimensional point cloud space target frame labeling information and high-precision map road structuring labeling information. After the collected data are fused and aligned through the timestamp information, the basic data information is marked according to different perception tasks to obtain each marked data, so that true values are provided for monitoring and training of each bird's eye view perception network model, in the example, the marking process involves drawing a target boundary box, dividing areas, classifying labels and the like, and the perception network model is trained through the marked data.

In one embodiment of the application, the BEV sensing network model generalization training is completed by constructing a corresponding multitask loss function and combining an effective random data enhancement method, the robust optimal network weight is obtained, and the optimal network weight obtained by training is provided for the sensing network model for reasoning and using so as to optimize the sensing network model, so that the sensing network model can accurately complete the sensing task. Thus, by using the multi-tasking loss function, multiple perception tasks can be learned and solved at the same time. And during training, random data enhancement methods, such as translation, rotation, scaling, etc., are applied to increase the generalization ability of the model.

In one embodiment of the present application, constructing a loss function includes: constructing a total loss function of a perception network model according to the perception task prediction loss and the cylinder radial depth prediction loss; the perceived task prediction loss comprises a target detection loss and an image segmentation task loss; the target detection loss is obtained by weighting calculation of a classification loss function and a three-dimensional frame regression loss function; the image segmentation task loss is calculated by comparing the predicted segmentation mask and the real segmentation mask; depth prediction loss is obtained by cross entropy loss of simple classification or ordered regression loss calculation. By using the multi-task loss function, a plurality of perception tasks can be simultaneously learned and solved, and in the training process, a perception network model with robustness and generalization performance can be trained by applying the plurality of loss functions and a random data enhancement method so as to be used for various perception tasks.

In one embodiment of the present application, the model trained total loss function L is composed of all perceived task prediction losses together with cylinder radial depth prediction losses, where the target detection loss L _det Generally lost L by classification Focal _cls Regression with 3D frame L1 loss L _reg Weighted calculation, image segmentation loss L _seg Obtained by binary cross entropy calculation, with depth prediction loss L _dep The cross entropy loss construction of simple classification can be adopted, and the construction can also be carried out by adopting ordered regression loss; random data enhancement methods for network training to promote generalization include, but are not limited to, random scale transformation, region cropping, symmetric mirroring, color transformation, grid occlusion, and the like. Therefore, the input data can be randomly modified during training to help the model to better adapt to different scenes and changes, the generalization performance of the perception network model is improved, a plurality of perception tasks can be simultaneously learned and solved by using a multi-task loss function, and in the training process, the perception network model with robustness and generalization performance can be trained by applying a plurality of loss functions and a random data enhancement method so as to be used for various perception tasks.

In one embodiment of the application, under the condition that the fisheye camera image label picking data is limited, the pinhole camera image with more accumulated stock is firstly converted into the cylindrical projection image for large-scale pre-training, and then the fisheye camera image is combined to be converted into the cylindrical projection image for network fine modulation, so that a better prediction effect is achieved.

In one embodiment of the present application, as shown in fig. 7, fig. 7 is a schematic diagram of a network model of a multisource fusion bird's-eye view sensing target detection method according to an exemplary embodiment of the present application, where a three-dimensional gridding feature space is a three-dimensional space characteristic, a bird's-eye view plane coding feature is a BEV feature, and a bird's-eye view time sequence fusion feature is a BEV fusion feature. Acquiring pinhole camera images of a plurality of pinholes and fisheye camera images of a plurality of fisheye cameras, performing cylindrical projection conversion of the pinhole camera images with equal focal length and view angles to obtain a plurality of pinhole camera cylindrical projection images and a plurality of fisheye camera cylindrical projection images, encoding the cylindrical projection images through a backbone network to obtain image encoding features, mapping the image encoding features to three-dimensional space features centered on the origin of a target vehicle coordinate system through a cylinder radial depth prediction network, performing planarization compression on the three-dimensional space features through cylinder pooling operation to obtain BEV features, aligning and fusing each preface historical frame BEV feature with the current frame BEV feature to obtain bird BEV fusion features, decoding the BEV fusion features through a decoding network to obtain corresponding multi-scale decoding features, connecting each scale decoding feature with a corresponding perception task head network to obtain a perception network head 1, a perception network head 2 and a perception network head 3, in this embodiment, performing three-dimensional object detection across the vision space, performing traffic zone segmentation on the bird-view road, and extracting the aerial view lane line. By carrying out cylindrical mapping on different source images, the bird's-eye view sensing network model can support fusion of fish-eye images and pinhole images with different installation layout orientations under the same main body frame, and the bird's-eye view feature mapping and sensing tasks are completed together, so that target tracking detection is carried out based on bird's-eye view plane coding features, and target tracking detection results in a plurality of cylindrical projection image coding features are obtained; therefore, when a near-distance target object is detected, image distortion is reduced, the distance proportion of characteristic distribution can be optimized, and the accuracy of environment perception detection is effectively improved.

In one embodiment of the present application, as shown in fig. 8, fig. 8 is a schematic block diagram of a multisource fusion bird's eye view sensing target detection method according to an exemplary embodiment of the present application, where an image coding feature map is a 2D image coding feature, a three-dimensional meshing feature space is a BEV space, and a bird's eye view plane coding feature is a BEV feature. Acquiring a pinhole camera image and a fisheye camera image, performing cylindrical projection conversion on the pinhole camera image to obtain a pinhole camera cylindrical projection image, performing cylindrical projection conversion on the fisheye camera image to obtain a fisheye camera cylindrical projection image, encoding the cylindrical projection image through sharing a 2D backbone network to obtain 2D image encoding features, mapping the 2D image encoding features to a BEV space taking the origin of a target vehicle coordinate system as the center based on cylindrical projection parameters and probability of radial depth, performing planarization compression on the three-dimensional meshing feature space through cylinder pooling operation to obtain BEV characteristics, aligning and fusing the BEV characteristics of each front historical frame with the BEV characteristics of the current frame to obtain an aerial view time sequence fusion feature, decoding the BEV multi-scale features of the sub-backbone network to obtain corresponding scale decoding features, connecting the scale decoding features with the corresponding perception task head network to obtain each perception network model, and extracting aerial view lane lines, wherein the aerial view lane line comprises three-dimensional target detection and aerial view road passable area. Training the three-dimensional target detection in the cross-view space through marking data three-dimensional point cloud space target frame marking information, carrying out training through dividing the passable area of the high-precision map road structured marking information aerial view road and extracting aerial view lane lines, and carrying out target detection based on each trained perception network model. Training a cylindrical projection equivalent cylindrical radial depth prediction network through three-dimensional point cloud space coordinate information, and predicting the cylindrical radial depth based on the trained cylindrical projection equivalent cylindrical radial depth prediction network. By carrying out cylindrical mapping on different source images, the bird's-eye view sensing network model can support fusion of fish-eye images and pinhole images with different installation layout orientations under the same main body frame, and the bird's-eye view feature mapping and sensing tasks are completed together, so that target tracking detection is carried out based on bird's-eye view plane coding features, and target tracking detection results in a plurality of cylindrical projection image coding features are obtained; therefore, when a near-distance target object is detected, image distortion is reduced, the distance proportion of characteristic distribution can be optimized, and the accuracy of environment perception detection is effectively improved.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Among them, machine learning is the core of artificial intelligence, which is the fundamental approach for making computers intelligent, and is applied throughout various fields of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

For example, the machine learning model may include a neural network-based supervision model, such as a two-class machine learning model, and the machine learning model is trained by using a large number of historical tracks, so that the machine learning model performs model parameter adjustment in the training process, and the adjusted model parameters have comprehensive prediction performance on the omnidirectional characteristics of the moving speed, the moving direction, the moving habit, the dynamic state and the static state of the navigation object.

Fig. 9 is a block diagram of a road condition refreshing apparatus according to an exemplary embodiment of the present application. The apparatus may be applied to the implementation environment shown in fig. 2, and is specifically configured in the smart terminal 210. The apparatus may also be adapted to other exemplary implementation environments and may be specifically configured in other devices, and the present embodiment is not limited to the implementation environments to which the apparatus is adapted.

As shown in fig. 9, the exemplary multisource fused aerial view perception target detection device 900 includes an acquisition module 901, an aerial view plane coding feature generation module 902, a fusion module 903 and a training module 904. The acquiring module 901 is configured to acquire a cylindrical projection parameter, a plurality of frames of cylindrical projection images, an image coding feature map, and a probability that radial depths corresponding to pixel features of the image coding feature map belong to each other, where the cylindrical projection images include a pinhole camera cylindrical projection image and a fisheye camera cylindrical projection image, and the cylindrical projection parameters include a cylindrical projection parameter of the fisheye camera image and a cylindrical projection parameter of the pinhole camera; the aerial view plane coding feature generation module 902 is configured to determine aerial view plane coding features of each frame of aerial view plane coding features of cylindrical projection images based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth belonging to the aerial view plane coding feature, and the target vehicle coordinate system; the fusion module 903 is configured to record a bird's-eye view plane coding feature corresponding to the cylindrical projection image of the current frame as the current frame bird's-eye view plane coding feature, and perform feature fusion on the current frame bird's-eye view plane coding feature and the historical frame bird's-eye view plane coding feature to obtain a bird's-eye view time sequence fusion feature; the decoding module 904 is configured to decode the bird's eye view timing fusion feature to obtain a corresponding scale decoding feature; training module 905, configured to connect the scale decoding feature with the perception task head in a network manner, so as to obtain a perception network model; and generalizing and training the perception network model, and detecting targets based on the trained perception network model. According to the invention, by carrying out cylindrical mapping on different source images, the bird's-eye view sensing network model can support fusion of fish-eye images and pinhole images with different installation layout orientations under the same main body frame, and the bird's-eye view feature mapping and sensing tasks are completed together, so that target tracking detection is carried out based on bird's-eye view plane coding features, and tracking detection results of targets in a plurality of cylindrical projection image coding features are obtained; therefore, the method is not only suitable for common and conventional pinhole camera images, but also suitable for fisheye camera images, reduces image distortion when a close-range target object is detected, can optimize the distance-near proportion of characteristic distribution, and further effectively improves the accuracy of environment perception detection. In addition, the invention can be independently applied to 360-degree full-fisheye images without pinhole images so as to finish bird's eye view perception, can multiplex a large number of existing conventional pinhole camera images, has lower development cost and strong practicability.

In an embodiment of the invention, the acquisition module is used for acquiring a plurality of camera images, wherein the camera images comprise a pinhole camera image and a fisheye camera image; according to internal reference information of a camera image, a virtual cylinder is constructed, the center point of the virtual cylinder is the origin of the optical axis of the camera, the radius of the virtual cylinder is the focal length of the camera, the central axis of a cylindrical projection image coincides with the optical axis of the camera image, and the transverse view angle of the cylindrical projection of the virtual cylinder is the same as that of the camera; carrying out coordinate distortion correction on the camera image through camera image distortion model parameters to obtain the camera image coordinates to be converted; and converting the image coordinates of the camera to be converted into cylindrical projection images of the camera by using the virtual cylinder.

In an embodiment of the present invention, the aerial view plane coding feature generation module is configured to map, by using a cylindrical projection parameter and a probability that a radial depth belongs to, a cylindrical projection image coding feature of each frame to a three-dimensional gridding feature space centered on an origin of a coordinate system of a target vehicle; and carrying out planarization compression on the three-dimensional gridding characteristic space through cylinder pooling operation to obtain the aerial view plane coding characteristic.

In an embodiment of the present invention, the obtaining module is further configured to convert the cylindrical projection image coordinates into real object point coordinates under the camera coordinates by the following formula:

In an embodiment of the present invention, the aerial view plane coding feature generating module is configured to map each frame of cylindrical projection image coding feature to a three-dimensional gridding feature space centered on an origin of a coordinate system of a target vehicle, and map each frame of cylindrical projection image coding feature to the three-dimensional gridding feature space centered on the origin of the coordinate system of the target vehicle by using a preset projection relationship; wherein, the preset projection relation is:

In an embodiment of the present invention, the aerial view plane coding feature generation module is further configured to add and average all voxel feature vectors corresponding to the column in the height direction of the vertical axis in the aerial view space meshing space, to obtain a first feature dimension; and carrying out dimension stacking on all voxel feature vectors corresponding to the cylinder along the height direction of the longitudinal axis to obtain a second feature dimension, and carrying out planarization compression on the three-dimensional gridding feature space based on the first feature dimension and the second feature dimension.

In an embodiment of the invention, the fusion module calculates a plane coordinate conversion relationship from each historical frame aerial view plane coding feature to the current frame aerial view plane coding feature according to the target vehicle positioning information of the historical frame aerial view plane coding feature and the target vehicle positioning information of the current frame aerial view plane coding feature; based on a plane coordinate conversion relation, aligning each historical frame aerial view plane coordinate system to the same coordinate system to obtain a coordinate aligned historical frame aerial view plane coding feature; and setting weights for the plane coding features of the aerial view of the history frame with the aligned coordinates, and carrying out feature fusion on the plane coding features of the aerial view of the history frame with the aligned coordinates and the plane coding features of the aerial view of the current frame based on linear interpolation of the weights in plane space to obtain the time sequence fusion features of the aerial view.

In an embodiment of the present invention, the training module is configured to collect basic data information corresponding to a sensing task, where the basic data information includes timestamp information; and performing task data sample labeling on the basic data information aiming at different perception tasks to obtain corresponding labeling data, wherein the perception tasks comprise: three-dimensional target detection, a road passable area and lane lines; and constructing a loss function, and performing generalization training on the perception network model based on the labeling data.

In an embodiment of the present invention, the training module is configured to construct a total loss function of the perceptual network model according to the predicted loss of the perceptual task and the predicted loss of the cylinder radial depth; the perceived task prediction loss comprises a target detection loss and an image segmentation task loss; the target detection loss is obtained by weighting calculation of a classification loss function and a three-dimensional frame regression loss function; the image segmentation task penalty is calculated by comparing the predicted segmentation mask with the true segmentation mask. Depth prediction loss is obtained by cross entropy loss of simple classification or ordered regression loss calculation.

It should be noted that, the multi-source fused aerial view sensing target detection device provided in the foregoing embodiment and the multi-source fused aerial view sensing target detection method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit perform operations has been described in detail in the method embodiment, which is not repeated herein. In practical application, the road condition refreshing device provided in the above embodiment may distribute the functions to different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above, which is not limited herein.

The embodiment of the application also provides a device, which comprises: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the electronic equipment realizes the multisource fusion bird's eye view perception target detection method provided in the various embodiments.

Embodiments of the present application also provide a medium having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform a bird's eye view aware target detection method implementing the multi-source fusion provided in the above embodiments.

Fig. 10 shows a schematic diagram of a computer system suitable for use in implementing the electronic device of the embodiments of the present application. It should be noted that, the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 10, the computer system 1000 includes a central processing unit (Central Processing Unit, CPU) 1001 that can perform various appropriate actions and processes, such as performing the method described in the above embodiment, according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a random access Memory (Random Access Memory, RAM) 1003. In the RAM 1003, various programs and data required for system operation are also stored. The CPU 1201, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An Input/Output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 1010 as needed, so that a computer program read out therefrom is installed into the storage section 1008 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by a Central Processing Unit (CPU) 1001, the computer program performs various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

Another aspect of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform a method of bird's eye view aware target detection for multi-source fusion as described above. The computer-readable storage medium may be included in the electronic device described in the above embodiment or may exist alone without being incorporated in the electronic device.

Another aspect of the present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the multisource fused bird's eye view sensing target detection method provided in the above embodiments.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. It is therefore intended that all equivalent modifications and changes made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the appended claims.

Claims

1. The multi-source fused aerial view perception target detection method is characterized by comprising the following steps of:

acquiring cylindrical projection parameters, a plurality of frames of cylindrical projection images, an image coding feature map and probability of radial depth belonging to each pixel feature of the image coding feature map, wherein the cylindrical projection images comprise a pinhole camera cylindrical projection image and a fisheye camera cylindrical projection image, and the cylindrical projection parameters comprise the cylindrical projection parameters of the fisheye camera image and the cylindrical projection parameters of a pinhole camera;

determining aerial view plane coding features of cylindrical projection image coding features of each frame based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth and a target vehicle coordinate system;

recording the aerial view plane coding feature corresponding to the cylindrical projection image of the current frame as the aerial view plane coding feature of the current frame, and carrying out feature fusion on the aerial view plane coding feature of the current frame and the aerial view plane coding feature of the historical frame to obtain an aerial view time sequence fusion feature;

decoding the bird's eye view time sequence fusion characteristics to obtain corresponding scale decoding characteristics;

Connecting the scale decoding features with a perception task head network to obtain a perception network model; and generalizing and training the perception network model, and performing target detection based on the trained perception network model.

2. The multi-source fused aerial view perception target detection method of claim 1, wherein obtaining a plurality of frames of cylindrical projection images comprises:

collecting a plurality of camera images, wherein the camera images comprise a pinhole camera image and a fisheye camera image;

constructing a virtual cylinder according to the internal reference information of the camera image, wherein the center point of the virtual cylinder is the origin of the camera optical axis, the radius of the virtual cylinder is the focal length of the camera, the central axis of the cylindrical projection image coincides with the optical axis of the camera image, and the transverse view angle of the cylindrical projection of the virtual cylinder is the same as the transverse view angle of the camera;

carrying out coordinate distortion correction on the camera image through camera image distortion model parameters to obtain camera image coordinates to be converted;

and converting the image coordinates of the camera to be converted into cylindrical projection images of the camera by utilizing the virtual cylinder.

3. The multisource fusion bird's eye view sensing target detection method according to claim 1, wherein determining bird's eye view plane coding features of each frame of cylindrical projection image coding features based on the image coding feature map, the cylindrical projection parameters, the probability of the radial depth belonging to and a target vehicle coordinate system comprises:

Mapping the image coding feature map to a three-dimensional gridding feature space taking the origin of a coordinate system of a target vehicle as a center through the cylindrical projection parameters and the probability of the radial depth;

and carrying out planarization compression on the three-dimensional meshing feature space to obtain the aerial view plane coding feature.

4. The multi-source fused bird's eye view sensing target detection method according to claim 3, wherein before mapping the cylindrical projection image coding feature of each frame to a three-dimensional gridding feature space centered on the origin of the target vehicle coordinate system, further comprising:

wherein x is _c Is in the arc direction of the cylindrical surface, y _c Is the height direction of the cylindrical surface, ρ _c In the radial direction of the central axis of the cylinder, f is the focal length, (X, Y, Z) is the real object point coordinate, (X) _c ，y _c ，ρ _c ) For cylindrical projection image coordinates.

5. The multi-source fused bird's eye view sensing target detection method according to claim 4, wherein mapping the cylindrical projection image coding feature of each frame to a three-dimensional gridded feature space centered on the origin of the target vehicle coordinate system comprises:

wherein,encoding features for cylindrical projection images, ">For three-dimensional meshing of feature spaces, C ₀ For characteristic dimension +.>And (3) probability-weighted spatial image feature distribution of D equivalent cylinder radial depth intervals corresponding to the ith cylindrical projection image, wherein Γ { · } represents the coordinate of the cylindrical projection image by converting the coordinate of the cylindrical projection image into the coordinate of a real object point under the coordinate of a camera.

6. A multisource fusion bird's eye view perception target detection method according to claim 3, wherein the flattening compression of the three-dimensional gridding feature space by cylinder pooling operation comprises:

adding and averaging all voxel characteristic vectors corresponding to the column body along the vertical axis height direction in the aerial view space meshing space to obtain a first characteristic dimension;

carrying out dimension stacking on all voxel feature vectors corresponding to the column body along the height direction of the longitudinal axis to obtain a second feature dimension;

and carrying out planarization compression on the three-dimensional meshing feature space based on the first feature dimension and the second feature dimension.

7. The multisource fusion bird's-eye view sensing target detection method according to claim 1, wherein feature fusion is performed on current frame bird's-eye view plane coding features and historical frame bird's-eye view plane coding features to obtain bird's-eye view time sequence fusion features, and the method comprises the steps of:

calculating the plane coordinate conversion relation from the historical frame aerial view plane coding feature to the current frame aerial view plane coding feature according to the target vehicle positioning information of the historical frame aerial view plane coding feature and the target vehicle positioning information of the current frame aerial view plane coding feature;

based on the plane coordinate conversion relation, aligning each historical frame aerial view plane coordinate system to the same coordinate system to obtain a coordinate aligned historical frame aerial view plane coding feature;

and setting weights for the plane coding features of the aerial view of the history frame with the aligned coordinates, carrying out feature fusion on the plane coding features of the aerial view of the history frame with the aligned coordinates and the plane coding features of the aerial view of the current frame based on linear interpolation of the weights in plane space, and obtaining the time sequence fusion features of the aerial view.

8. The multi-source fused bird's eye view sensing target detection method according to any one of claims 1-7, wherein generalizing the sensing network model comprises:

Basic data information corresponding to a perception task is collected, wherein the basic data information comprises time stamp information;

and performing task data sample labeling on the basic data information aiming at different perception tasks to obtain corresponding labeling data, wherein the perception tasks comprise: three-dimensional target detection, a road passable area and lane lines;

and constructing a loss function, and performing generalization training on the perception network model based on the labeling data.

9. The multi-source fused bird's eye view sensing target detection method of claim 8, wherein constructing the loss function comprises:

constructing a total loss function of the perception network model according to the perception task prediction loss and the cylinder radial depth prediction loss; wherein the perceived task prediction loss includes a target detection loss and an image segmentation task loss; the target detection loss is obtained by weighting calculation of a classification loss function formula and a three-dimensional frame regression loss function; the image segmentation task loss is calculated by comparing a predicted segmentation mask and a real segmentation mask;

the depth prediction loss is obtained by cross entropy loss of simple classification or ordered regression loss calculation.

10. The utility model provides a multisource fused bird's eye view perception target detection device which characterized in that, multisource fused bird's eye view perception target detection device includes:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring cylindrical projection parameters, a plurality of frames of cylindrical projection images, an image coding feature map and the probability that the radial depth corresponding to each pixel feature of the image coding feature map belongs to, the cylindrical projection images comprise a pinhole camera cylindrical projection image and a fisheye camera cylindrical projection image, and the cylindrical projection parameters comprise the cylindrical projection parameters of the fisheye camera image and the cylindrical projection parameters of a pinhole camera;

the aerial view plane coding feature generation module is used for obtaining aerial view plane coding features of the aerial view plane coding features of each frame of cylindrical projection image based on the image coding feature map, the cylindrical projection parameters and the probability of the radial depth, and combining the external parameters of the view camera relative to a target vehicle coordinate system;

the merging module is used for merging the aerial view time sequence characteristics of each historical frame aerial view plane coding characteristic and the current frame aerial view plane coding characteristic to obtain merged aerial view time sequence characteristics;

the training module is used for decoding the combined aerial view time sequence features to obtain corresponding scale decoding features, connecting each scale decoding feature with a corresponding perception task head network to obtain each perception network model, performing generalization training on each aerial view perception network model corresponding to each perception task, and performing target detection based on each trained aerial view perception network model.

11. An apparatus, the apparatus comprising:

one or more processors;

storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the multisource fusion bird's eye view aware target detection method of any of claims 1 to 9.

12. A medium having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the multisource fused bird's eye view aware target detection method according to any of claims 1 to 9.