CN117423077A - BEV perception model, construction method, device, equipment, vehicle and storage medium - Google Patents

BEV perception model, construction method, device, equipment, vehicle and storage medium Download PDF

Info

Publication number
CN117423077A
CN117423077A CN202311315964.0A CN202311315964A CN117423077A CN 117423077 A CN117423077 A CN 117423077A CN 202311315964 A CN202311315964 A CN 202311315964A CN 117423077 A CN117423077 A CN 117423077A
Authority
CN
China
Prior art keywords
bev
model
image
perception
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311315964.0A
Other languages
Chinese (zh)
Inventor
王泓清
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Geely Holding Group Co Ltd
Zhejiang Shikong Daoyu Technology Co Ltd
Original Assignee
Zhejiang Geely Holding Group Co Ltd
Zhejiang Shikong Daoyu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Geely Holding Group Co Ltd, Zhejiang Shikong Daoyu Technology Co Ltd filed Critical Zhejiang Geely Holding Group Co Ltd
Priority to CN202311315964.0A priority Critical patent/CN117423077A/en
Publication of CN117423077A publication Critical patent/CN117423077A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a BEV perception model, a construction method, a construction device, equipment, a vehicle and a storage medium, and belongs to the technical field of automatic driving. The problem that the existing BEV perception model is poor in perception effect is solved. The BEV perception model comprises an image view encoder, a mapping module, a three-dimensional backbone network, a three-dimensional target detection head, a BEV encoder for extracting bird's-eye view characteristics of a fused image formed by fusing a remote sensing image with the bird's-eye view, and a fusion module for fusing radar BEV characteristics with camera BEV characteristics. The BEV perception model construction method comprises the steps of A, training data preparation and loading; B. dividing the data set; C. constructing a BEV network model; D. training a model; E. evaluating a model; F. model prediction. The invention also provides a device, equipment, a vehicle and a storage medium of the BEV perception model. According to the invention, the ground condition is acquired by utilizing the instant remote sensing constellation to obtain the natural BEV view angle, and the view angle is integrated into the BEV model to improve the perception effect.

Description

BEV perception model, construction method, device, equipment, vehicle and storage medium
Technical Field
The invention belongs to the technical field of automatic driving, and relates to a BEV perception model, a construction method, a device, equipment, a vehicle and a storage medium.
Background
In recent years, the automatic driving technology gradually becomes a research hot spot in the field of automobiles, while the perception technology is a serious problem in the field of automatic driving, and can help the vehicle to understand and decide the surrounding environment according to the surrounding environment information of the vehicle acquired by the sensor and by combining with a preset algorithm, so as to complete the path planning and the vehicle behavior control of the automatic driving of the vehicle. BEV (Bird's eye view) perception model based on ring-car body multi-view image feature fusion is the main stream of the current vehicle perception technology.
Currently, in the BEV perception model, forward data obtained by two sensors, namely a camera and a laser radar, are mainly converted into a BEV mode. However, the conversion of both sensors is not sufficient by itself. The camera acquires a two-dimensional image, so that the problems of image characteristic distortion and deformation easily occur in the conversion process of visual angle conversion, and further depth estimation is needed, which increases the calculation amount and the error. However, although the point cloud data acquired by the laser radar is three-dimensional, the problem of data sparseness exists, and the three-dimensional information of the missing data set, the reference, the algorithm and the like needs to be considered, which affects the final conversion effect.
For this reason, chinese patent application (application number 202210114639.7) discloses a method for generating a semantic segmentation label of aerial view based on multi-frame semantic point cloud stitching, after synchronizing data collected by cameras and lidar at the same moment, performing joint labeling on original images collected by each camera and point cloud images collected by the lidar at the same moment, marking original image pavement information on each frame of image collected by the cameras, marking a 3D bounding box of a moving object on the synchronized point cloud images collected by the lidar, including converting the marked point cloud images to each camera plane, dyeing the point cloud by utilizing semantic information of the images, generating semantic point cloud, stitching continuous multi-frame semantic point cloud to a unified vehicle body coordinate system taking a certain frame as a reference, and projecting onto a BEV canvas. The method and the device realize the acquisition of the BEV label by directly utilizing the semantic point cloud and multi-frame splicing, and realize the fusion of the data acquired by the camera and the laser radar respectively.
However, the above method has the following drawbacks: 1. in the process of converting a camera or a laser radar into a BEV mode, the problems of view angle transformation and three-dimensional information reconstruction still exist, and the conversion accuracy and the robustness are poor. 2. The camera and the laser radar have limited sensing range, and the priori information available in automatic driving is limited, so that the sensing effect is poor.
Disclosure of Invention
The invention aims at solving the problems in the prior art, and provides a BEV perception model, a construction method, a device, equipment, a vehicle and a storage medium, wherein the technical problems to be solved by the invention are as follows: how to improve the perception effect of the unmanned technology of the vehicle on the surrounding environment of the vehicle.
The aim of the invention can be achieved by the following technical scheme: the BEV perception model comprises an image view encoder, which is used for extracting features from original image data input by an image perception module of a vehicle to output image multi-view features with semantic information, and further comprises:
the mapping module is used for mapping the image multi-view features from two dimensions to three dimensions and converting the obtained three-dimensional image multi-view features into a bird's eye view taking the vehicle as a center;
the BEV encoder is used for acquiring the remote sensing image sent by the instant remote sensing constellation, the bird's-eye view and a fusion image formed by fusing the remote sensing image and the bird's-eye view, fusing the three image data and extracting the bird's-eye view characteristics to form the BEV characteristics of the camera.
The image sensing module of the vehicle detects to obtain original image data, and the instant remote sensing constellation detects to obtain a remote sensing image. Firstly, an image view encoder receives original image data and performs feature extraction on the original image data to obtain image multi-view features with semantic information. Next, the mapping module maps the image multi-view feature from two dimensions to three dimensions and obtains a vehicle-centric bird's eye view based on a three-dimensional autopilot vehicle coordinate system. The multiple visual angles are unified to the aerial view, so that the obstacle or the vehicles in cross traffic can be identified, and the development and deployment of subsequent modules are facilitated. Finally, the BEV encoder fuses the effective features of the remote sensing image, the bird's-eye view and the fused image obtained by fusing the remote sensing image and the bird's-eye view in the BEV space and extracts the bird's-eye view features to obtain the BEV features of the camera. The obtained camera BEV features show objects and terrains in the environment in a overlooking view angle, so that tasks such as object detection and map segmentation can be performed, the automobile is helped to better understand the surrounding environment in automatic driving, and the accuracy of perception and decision making is improved.
The instant remote sensing constellation is a remote sensing satellite constellation capable of realizing global arbitrary target minute level revisitation and arbitrary regional remote sensing data hour level acquisition, comprises various types of loads such as visible light, SAR, thermal infrared, hyperspectral, glimmer and the like, is generally fused with various high-precision tip technologies such as intelligent on-board processing, inter-satellite chain communication and the like, has the characteristics of quasi-real time, multi-load, multi-spectral band, high intelligence and the like, and has great application value in aspects such as human life, disaster emergency, ecological environmental protection, transportation, national security, global change monitoring and the like.
In the scheme, the remote sensing image is fused with the remote sensing image before the bird's-eye view enters the BEV encoder to obtain a fused image, so that the problem of view angle transformation is solved, and meanwhile, the remote sensing image is added into the BEV encoder to be fused with the bird's-eye view and the fused image which also enter the BEV encoder to obtain more comprehensive and accurate data. Moreover, the remote sensing image has high frequency, high resolution and wide coverage, provides priori information outside the sensing range of the image sensing module, and enhances the information of the extracted features. In addition, the features are fused in the BEV space, so that not only is the data loss reduced, but also the calculation power consumption is reduced, and the perception effect of the unmanned vehicle technology on the surrounding environment of the vehicle is greatly improved.
In the BEV perception model, the remote sensing image and the bird's eye view are fused in a PCA transformation mode before entering the BEV encoder to form a fused image. PCA, namely a principal component analysis method, can transform a remote sensing image and a common image into a unified space domain or frequency domain, fuse the transformed coefficients, and obtain a fused image through inverse transformation. The PCA conversion reduces the data dimension, simultaneously maintains the information of the original data to the maximum extent, and is beneficial to comprehensively improving the processing efficiency and the processing effect of the image.
In the BEV perception model, the remote sensing image, the bird's eye view and the fusion image in the BEV encoder are subjected to data fusion in an image data matrix superposition mode. The data of the three types of image information are fused into a graph by adopting an image data matrix superposition mode, so that the information of various types of data is integrated, the correlation and the mutual influence among different data sources are found, the data are more comprehensively understood and analyzed, and meanwhile, the subsequent modules are better developed and deployed, such as feature extraction after fusion, so that the efficiency is improved, and the accuracy is also improved.
In the BEV perception model described above, the BEV perception model further includes:
The three-dimensional backbone network is used for extracting characteristics of original radar point cloud data input by the laser perception module of the vehicle to obtain radar BEV characteristics;
and the fusion module is used for acquiring radar BEV characteristics and camera BEV characteristics, and fusing the radar BEV characteristics and the camera BEV characteristics to obtain multi-mode BEV characteristics.
And detecting by a laser perception module of the vehicle to obtain original radar point cloud data. Firstly, the three-dimensional backbone network receives original radar point cloud data and performs feature extraction on the original radar point cloud data to obtain radar BEV features. The three-dimensional backbone network can inherently learn three-dimensional features from almost original images without compressing the point cloud data into multiple two-dimensional images. And then, the fusion module receives the radar BEV characteristics and the camera BEV characteristics and fuses the radar BEV characteristics and the camera BEV characteristics in a PACF mode to finally obtain the multi-mode BEV characteristics. The laser sensing module with high measurement accuracy, high response speed and strong anti-interference capability is added, so that more comprehensive data can be provided for the model. The BEV features can simplify complex three-dimensional images into two-dimensional images, so that subsequent tasks are facilitated, and the real conditions of the surrounding environment of the automobile can be reflected more comprehensively and accurately by integrating the multi-mode BEV features of data obtained by detection of various sensors.
In the BEV perception model described above, the BEV perception model further includes: the three-dimensional target detection head is used for acquiring and outputting the multi-mode BEV characteristics sent by the fusion module. The three-dimensional target detection head receives the multi-mode BEV characteristics, calculates and predicts and outputs a bird's eye view containing the multi-mode fusion result. The output result can be used for supporting functions of an automatic driving system, such as path planning, obstacle avoidance, traffic signal recognition and the like.
In the BEV perception model, the image view encoder includes a two-dimensional backbone network and a neck module, and the original image data input by the image perception module of the vehicle passes through the two-dimensional backbone network and the neck module and then outputs the image multi-view feature with semantic information. The original image data is converted into a series of feature images with different scales and depths when passing through a two-dimensional backbone network, then the neck module performs dimension reduction or adjustment on the feature images from the two-dimensional backbone network, and finally the image multi-view features with semantic information are output. The two-dimensional backbone network may extract feature information of the image data for subsequent processing and analysis, while the neck module may enhance the importance and relevance of the feature map to better capture semantic relationships and context information between objects.
A BEV perception model construction method comprises the following steps:
A. training data preparation and loading: collecting training data, wherein the training data comprises data obtained by an image sensing module and instant remote sensing constellation detection, and cleaning and processing the collected data;
B. dividing the data set: dividing the data after data cleaning into a training set, a verification set and a test set which are mutually independent;
C. building a BEV network model: constructing a BEV perception model comprising an image view encoder, a mapping module, and a BEV encoder network structure;
D. training a model: substituting the training set into the BEV perception model to train, and carrying out model parameter adjustment after training;
E. model evaluation: substituting the data of the test set into the BEV perception model after model parameter adjustment to obtain a predicted value, measuring the difference between the predicted value and a true value by adopting a loss function, evaluating the current model by an evaluation index, judging whether the model is qualified or not according to the measurement of the loss function and the evaluation of the evaluation index, returning to the step D if the model is unqualified, and storing the model if the model is qualified;
F. model prediction: substituting the verification set data into the model with qualified evaluation to obtain an output result, evaluating the result by adopting an evaluation index, and completing the construction of the BEV perception model combined with instant remote sensing after the evaluation is completed.
In the method, after training data are prepared and loaded, the data which are subjected to cleaning processing are divided into a training set, a verification set and a test set which are independent of each other, the training set is substituted into a BEV perception model for training, model parameter adjustment is carried out after training, the data of the test set are substituted into the BEV perception model subjected to model parameter adjustment to obtain a predicted value, the difference between the predicted value and a true value is measured by a loss function, the current model is evaluated by an evaluation index, whether the model is qualified or not is judged according to the measurement of the loss function and the evaluation of the evaluation index, the model is trained again if the model is unqualified, the model is saved if the model is qualified, the verification set data are substituted into the model which is qualified in evaluation, the output result is evaluated by the evaluation index, and the construction of the BEV perception model combined with instant remote sensing is completed after the evaluation is completed.
And a large amount of diversified data learning is beneficial to model learning and identifying various modes, and the data cleaning processing improves the utilization efficiency of data and the data processing ability of model understanding. The data in the training set, the verification set and the test set come from the same data source, so that training and evaluation errors caused by differences among different data sources are avoided. BEV perception models are primarily used to process image data viewed from a bird's eye view, and can provide a global perspective that helps identify and understand objects and structures in a scene. Through parameter adjustment and optimization, the model can learn the mode in the data better, and the prediction performance of the model can be improved. Model evaluation is beneficial to show the performance of the model on unseen data to ensure that the model has good generalization ability. The remote sensing image provided by the instant remote sensing constellation is a natural BEV visual angle, the sensing effect of the BEV sensing model can be improved while the BEV sensing model is well adapted to the BEV sensing model, and the complementary effect is achieved.
In the BEV perception model, the training data in step a further includes data detected by the laser perception module, and in step C, the BEV perception model further includes a three-dimensional backbone network, a fusion module, and a three-dimensional target detection head. The point cloud image detected by the laser perception module is subjected to feature extraction through a three-dimensional backbone network to obtain radar BEV features, the radar BEV features can be fused with camera BEV features output by a BEV encoder in a fusion module, and multi-mode BEV features are output through a three-dimensional target detection head. The laser perception module has high measurement accuracy, high response speed and strong anti-interference capability, and is beneficial to providing more comprehensive data for model training.
In the BEV perception model, in the step E, when the difference between the predicted value and the actual value is greater than the preset difference threshold, it is determined that iterative training is required, and step D is entered, when the difference between the predicted value and the actual value is less than the preset difference threshold, the model index is evaluated by the evaluation index, and when the evaluation is failed, it is determined that iterative training is required, step D is entered, and after the evaluation is passed, the model is saved, and step F is entered. In model evaluation, a difference threshold is used as a standard, whether the difference between a true value and a predicted value obtained by a model which is completed with training meets the requirement is judged, if not, the next step is sequentially executed from the beginning of model training, otherwise, the model is saved, and the next step of model prediction is executed. The preset of the difference threshold improves the efficiency of judging whether the model is qualified or not, unifies the judgment standard, and is beneficial to standardized management.
In the BEV perception model described above, the total loss function is l=αl TSL +βL VMI +γL IG Wherein alpha, beta and gamma are three super parameters, respectively and correspondingly represent L TSL 、L VMI 、L IG Weights, L TSL Alignment loss function for BEV viewing angle, L VMI For BEV characteristic mutual information loss function, L IG Gain loss function for BEV a priori information, L TSL The formula of (2) isWhere N is the total number of pixel points in the image, y i Is the value of the i-th pixel point in the BEV view image obtained after the sensor is converted into the BEV mode,/for the sensor>Is the value of the ith pixel point in the BEV visual angle image acquired by the instant remote sensing constellation,/and->Is the similarity between two pixels under different transformations, +.>The calculation formula is +.>Where M is the total number of transforms, T j Is the j-th transformation function and,is the structural similarity between two transformed pixel points, and the specific formula is thatWherein mu y And->T is respectively j (y i ) Andmean value in a small window, +.>And->T is respectively j (y i ) And->Variance within a small window, +.>Is T j (y i ) And->Covariance in a small window, c1 and c2 are two constants; l (L) VMI The formula of (2) isx and y respectively and correspondingly represent BEV features obtained by using a sensor alone and BEV features obtained by fusing a BEV view angle obtained by an instant remote sensing constellation with forward data obtained by the sensor, q (x, y) represents probability distribution of simultaneous occurrence of the two features, and q (x) and q (y) respectively represent probability distribution of respective occurrence of the two features; l (L) IG The formula of (2) is +.>Wherein the method comprises the steps ofC is the total number of categories, y i Is the probability of the i-th category in the true annotation,/>Is the probability of the i-th category predicted by the model,/->Is the probability of the ith category when no a priori information is used.
The BEV view angle alignment loss function is used for measuring the alignment degree between the BEV view angle image acquired by the instant remote sensing constellation and the BEV view angle image obtained after the traditional sensor is converted into the BEV mode, the characteristic mutual information loss function is used for measuring the mutual information quantity between the BEV characteristic obtained after the BEV view angle acquired by the instant remote sensing constellation is fused with the forward data acquired by the traditional sensor and the BEV characteristic obtained by the traditional sensor, and the prior information gain loss function is used for measuring the gain of the effect of converting the sensor into the BEV mode by utilizing the prior information provided by the instant remote sensing constellation. The three super parameters of alpha, beta and gamma are used for respectively representing weights of the BEV visual angle alignment loss function, the characteristic mutual information loss function and the prior information gain loss function, the three loss functions are weighted and summed to obtain a final loss function, and by adjusting the weights, different performances of the model can be comprehensively considered and balanced to obtain an optimal solution of the final loss function. The loss functions with different functions correspond to different super parameters, so that the model is more convenient to optimize and manage.
The device of the BEV perception model comprises an image perception module for inputting original image data and a laser perception module for inputting original radar point cloud data, and is characterized by further comprising a position module for transmitting a position signal of a vehicle to an instant remote sensing constellation and a vehicle-mounted computer provided with the BEV perception model, wherein the image perception module and the laser perception module are connected with the vehicle-mounted computer, and the vehicle-mounted computer is in wireless connection with the instant remote sensing constellation to receive remote sensing images transmitted by the instant remote sensing constellation, so that the BEV perception model works.
And in the running process of the vehicle, the image sensing module detects the surrounding environment of the running vehicle through the vehicle-mounted camera and sends the collected original data image data to the vehicle-mounted computer. The laser perception module detects the surrounding environment of the running vehicle through the vehicle-mounted laser radar and sends the collected original radar point cloud data to the vehicle-mounted computer.
Meanwhile, the position module sends the real-time longitude and latitude information of the running vehicle to the instant remote sensing constellation, the instant remote sensing constellation detects the surrounding environment of the running vehicle from the air by utilizing the satellite group, and the collected remote sensing image data is wirelessly sent to the vehicle-mounted computer.
And the vehicle-mounted computer receives the image data image, the original Lei Dadian cloud data and the remote sensing image data, and performs sensing work through the set BEV sensing model.
An apparatus of a BEV perception model, comprising one or more processors; storage means for storing one or more programs that, when executed by the one or more processors, cause the apparatus to implement any of the BEV perception models described above.
The device may also be connected to one or more external devices, such as a keyboard, router, or consumer electronic mobile device, etc.
A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by a processor of a device, enable the BEV perception model operation of any one of the above.
The non-transitory computer readable storage medium has stored thereon a program product of the method described herein. The readable storage medium may be, but is not limited to, electronic, magnetic, optical, or infrared, or a combination of any of the above. And may specifically be embodied as a flash memory card, read-only memory, hard disk, or the like.
A vehicle of the BEV perception model, characterized by an arrangement of the BEV perception model.
The vehicle fully combines and utilizes the BEV perception model, the construction method, the device, the equipment and the storage medium to realize the perception of automatic driving.
Compared with the prior art, the invention has the following advantages:
1. according to the invention, the real-time remote sensing constellation is utilized to obtain the overlooking ground condition, so that the natural BEV view angle is obtained, and the problems of view angle transformation and three-dimensional information reconstruction when the camera and the laser radar are converted into the BEV mode are solved.
2. The invention fuses the BEV visual angle acquired by the instant remote sensing constellation with the forward data acquired by the traditional camera or laser radar, thereby improving the accuracy and the robustness of the sensor converted into the BEV mode.
3. The image using the instant remote sensing constellation has high frequency, high resolution and wide coverage, provides priori information outside the sensing range of the image sensing module, and is favorable for enhancing the information of the extracted features of the aerial view.
Drawings
FIG. 1 is a schematic diagram of a BEV model network architecture in accordance with a first embodiment of the present invention.
FIG. 2 is a schematic diagram of the operation of a BEV model in accordance with a first embodiment of the present invention.
FIG. 3 is a diagram of a BEV model network structure in accordance with a second embodiment of the present invention.
FIG. 4 is a schematic diagram of the operation of the BEV model of the second embodiment of the present invention.
Fig. 5 is a schematic diagram of BEV model principle of the second embodiment.
FIG. 6 is a flow chart of a BEV model building method in accordance with the third embodiment.
In the figure, 1, an image view encoder; 1a, a two-dimensional backbone network; 2a, neck module; 2. a mapping module; 3. BEV encoder; 4. a three-dimensional backbone network; 5. a fusion module; 6. three-dimensional target detection head.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Embodiment one:
as shown in fig. 1 and 2, a BEV perception model includes an Image view Encoder 1 (Image-view Encoder) for feature extraction of original Image data input from an Image perception module of a vehicle to output Image multi-view features having semantic information. BEV perception models are a technique for generating a bird's eye view from data of different sensors. A bird's eye view is a perspective from above looking at a scene, showing the layout and features of objects and space. The basic idea of the BEV perception model is to transform sensor data into a unified BEV representation space, and then perform tasks such as feature extraction and object detection on this space. The image sensing module in this embodiment comprises a vehicle-mounted multi-view camera, and the image encoder comprises a two-dimensional backbone network 1a (2D backcon) and a neck module 2a (neg).
When the BEV model is actually applied, an image sensing module of the vehicle detects an original input image, and after the instant remote sensing constellation receives the vehicle longitude and latitude position information sent by a position module of the vehicle, the vehicle position is positioned, and a remote sensing image at the corresponding position of the vehicle is obtained by detection.
The image view encoder 1 receives original image data, firstly performs feature extraction on the original image data through the two-dimensional backbone network 1a, converts the original image data into feature images with different scales and depths, sends the feature images to the neck module 2a, then performs dimension reduction or adjustment on the feature images through the neck module 2a, and finally outputs image multi-view features with semantic information. Neck module 2a includes FPN (Feature Pyramid Network), a method of fusion with different levels of features, and ADP (Adaptive Feature Pool ing) is a module that adds an adaptive feature pooling on the basis of FPN. The two-dimensional backbone network 1a may extract characteristic information of the image data for subsequent processing and analysis, while the neck module 2a may enhance the importance and relevance of the feature map for better capturing semantic relationships and context information between objects.
Before entering the image view encoder 1, preprocessing operations such as clipping, scaling, normalizing and the like are performed on the original image data, noise and irrelevant information are removed, and the interested area around the automobile is reserved.
The BEV perception model further includes:
and a mapping module 2 (2D- > 3D Projector) for mapping the image multi-view feature from two dimensions to three dimensions and converting the obtained three-dimensional image multi-view feature into a bird's-eye view centered on the vehicle. The mapping module 2 maps the image multi-view feature with semantic information from two dimensions to three dimensions and obtains a vehicle-centric bird's eye view based on a three-dimensional autopilot car coordinate system (3 Dego-car coordinate). The multiple visual angles are unified to the aerial view, so that the obstacle or the vehicles in cross traffic can be identified, and the development and deployment of subsequent modules are facilitated.
And the BEV encoder 3 (BEV encoder) is used for acquiring the remote sensing image sent by the instant remote sensing constellation, the bird's-eye view and a fused image formed by fusing the remote sensing image and the bird's-eye view, fusing the three image data and extracting the bird's-eye view characteristic to form the BEV characteristic of the camera.
The remote sensing image and the aerial view are transformed into a unified space domain or frequency domain by adopting a PCA (Princ ipal Component Analysis) transformation mode before entering the BEV encoder 3, the transformed coefficients are fused, and the fused image is obtained through inverse transformation. The PCA conversion reduces the data dimension, simultaneously maintains the information of the original data to the maximum extent, and is beneficial to comprehensively improving the processing efficiency and the processing effect of the image.
Alternatively, the fusion of the remote sensing image and the aerial view may also use IHS (intels, hue, saturation), i.e. a color coordinate system, or DWT (Di screte Wavelet Transformat ion), i.e. a discrete wavelet transform.
The BEV encoder 3 (BEV encoder) is a module for extracting a bird's eye view feature, which can convert the cone features of a plurality of cameras into a unified BEV view angle, thereby realizing 3D object detection. The remote sensing image, the bird's eye view and the fusion image in the BEV encoder 3 are subjected to data fusion in a mode of overlapping an image data matrix. The BEV encoder 3 performs data fusion on the effective features of the three types of images, namely the remote sensing image, the aerial view and the remote sensing image and the fused image, in the BEV space, and performs aerial view feature extraction to obtain the BEV feature of the camera. The data of the three types of image information are fused into a graph by adopting an image data matrix superposition mode, multiple types of data information are synthesized, the correlation and the mutual influence among different data sources are found, the data are more comprehensively understood and analyzed, and meanwhile, the subsequent modules are better developed and deployed, such as feature extraction is performed after fusion, so that the efficiency is improved, and the accuracy is also improved. The obtained camera BEV features show objects and terrains in the environment in a overlooking view angle, so that tasks such as object detection and map segmentation can be performed, the automobile is helped to better understand the surrounding environment in automatic driving, and the accuracy of perception and decision making is improved.
The instant remote sensing constellation is a remote sensing satellite constellation capable of realizing global arbitrary target minute level revisitation and arbitrary regional remote sensing data hour level acquisition, comprises various types of loads such as visible light, SAR, thermal infrared, hyperspectral, glimmer and the like, is generally fused with various high-precision tip technologies such as intelligent on-board processing, inter-satellite chain communication and the like, has the characteristics of quasi-real time, multi-load, multi-spectral band, high intelligence and the like, and has great application value in aspects such as human life, disaster emergency, ecological environmental protection, transportation, national security, global change monitoring and the like.
In the scheme, the remote sensing image is fused with the remote sensing image before the bird's-eye view enters the BEV encoder 3 to obtain a fused image, so that the problem of view angle transformation is solved, and meanwhile, the remote sensing image is added into the BEV encoder 3 and fused with the bird's-eye view and the fused image which also enter the BEV encoder 3 to obtain more comprehensive and accurate data. Moreover, the remote sensing image has high frequency, high resolution and wide coverage, provides priori information outside the sensing range of the image sensing module, and enhances the information of the extracted features. In addition, the features are fused in the BEV space, so that not only is the data loss reduced, but also the calculation power consumption is reduced, and the perception effect of the unmanned vehicle technology on the surrounding environment of the vehicle is greatly improved.
Embodiment two:
as shown in fig. 3, 4 and 5, the technical solution in this embodiment is basically the same as that in the first embodiment, except that, based on the BEV perception model in the first embodiment, the method further includes:
and the three-dimensional backbone network 4 is used for extracting the characteristics of the original radar point cloud data input by the laser perception module of the vehicle to obtain radar BEV characteristics. The laser sensing module in this embodiment includes a lidar.
When the BEV model is actually applied, the laser perception module of the vehicle detects and obtains original radar point cloud data. The three-dimensional backbone network 4 (3D backhaul) receives the original radar point cloud data and performs feature extraction on the same to obtain radar BEV features. The three-dimensional backbone network 4 does not need to compress the point cloud data into a plurality of two-dimensional images, and can inherently learn three-dimensional features from almost original images. The laser sensing module with high measurement accuracy, high response speed and strong anti-interference capability is added, so that more comprehensive data can be provided for the model.
And a Fusion Module 5 (Fusion Module) for acquiring radar BEV features and camera BEV features, and fusing the two features to obtain multi-mode BEV features. The Fusion module 5 receives the radar BEV feature and the camera BEV feature and fuses them directly on three-dimensional points in a PACF (Point-based Attentive Cont-conv Fusion) manner, and enhances the expression capability of the Fusion feature by using continuous convolution, battery and attention aggregation. And finally obtaining the multi-mode BEV characteristic. The BEV features can simplify complex three-dimensional images into two-dimensional images, so that subsequent tasks are facilitated, and the real conditions of the surrounding environment of the automobile can be reflected more comprehensively and accurately by integrating the multi-mode BEV features of data obtained by detection of various sensors.
Alternatively, the fusion of the radar BEV features and the camera BEV features may also be DFM (Dynamic Fus ion Module), i.e. dynamic fusion, or fusion painting, i.e. fusion rendering.
The BEV perception model further includes: a three-dimensional object detection head 6 (3D Object Detection Head) for acquiring and outputting the multi-modal BEV features sent by the fusion module 5. The three-dimensional object detection head 6 receives the multi-modal BEV features and calculates a predicted output of a bird's eye view comprising the multi-modal fusion result. The output result can be used for supporting functions of an automatic driving system, such as path planning, obstacle avoidance, traffic signal recognition and the like.
The laser sensing module with high measurement precision, high response speed and strong anti-interference capability is added, so that more comprehensive data can be provided for a model, and powerful guarantee is provided for the safety of automatic driving.
Embodiment III: in this embodiment, a method for constructing the BEV perception model in the first embodiment and the second embodiment is given. As shown in fig. 6, a BEV perception model construction method includes the following steps:
A. training data preparation and loading: and collecting training data, wherein the training data comprises a large amount of diversified data obtained by an image sensing module and instant remote sensing constellation detection, and cleaning and processing the collected data. In this embodiment, the number of data is not strictly defined, but is generally regarded as thousands or more.
And collecting a large amount of camera images detected by the image sensing module and BEV visual angle remote sensing images detected by the instant remote sensing constellation in a diversified manner, cleaning and processing the collected image data, removing noise and irrelevant information, and reserving an interested region. The data learning is large and diversified, so that model learning and recognition of various modes are facilitated, and the data cleaning processing improves the utilization efficiency of the data and the data processing ability of model understanding;
in step a, the image sensing module includes a vehicle-mounted multi-view camera. And after receiving the vehicle longitude and latitude position information sent by the vehicle position module, the instant remote sensing constellation positions the vehicle and detects to obtain a remote sensing image of the natural BEV view angle. The remote sensing image has high frequency, high resolution and wide coverage, provides priori information outside the sensing range of the image sensing module, and enhances the information of the extracted features.
In this embodiment, the training data further includes data detected by a laser sensing module, where the laser sensing module includes a laser radar, and the laser sensing module has high measurement accuracy, fast response speed, and strong anti-interference capability, and is favorable to providing more comprehensive data for model training.
B. Dividing the data set: the data after data cleaning processing is divided into a training set, a verification set and a test set which are mutually independent. And checking whether dirty data exists in the data subjected to the cleaning processing, if so, re-performing the data cleaning processing, otherwise, dividing the data into a training set, a verification set and a test set which are mutually independent based on a cross verification method or a leave-out method. The data in the training set, the verification set and the test set come from the same data source, so that training and evaluation errors caused by differences among different data sources are avoided.
C. Building a BEV network model: a BEV perception model is constructed comprising the image view encoder 1, the mapping module 2 and the network structure of the BEV encoder 3. BEV perception models are a technique for generating a bird's eye view from data of different sensors. A bird's eye view is a perspective from above looking at a scene, showing the layout and features of objects and space. The basic idea of the BEV perception model is to transform sensor data into a unified BEV representation space, and then perform tasks such as feature extraction and object detection on this space. BEV perception models are primarily used to process image data viewed from a bird's eye view, and can provide a global perspective that helps identify and understand objects and structures in a scene.
In this embodiment, in step C, the BEV perception model further includes a three-dimensional backbone network 4, a fusion module 5, and a three-dimensional target detection head 6. The point cloud image detected by the laser perception module is subjected to feature extraction through the three-dimensional backbone network 4 to obtain radar BEV features, the radar BEV features can be fused with the camera BEV features output by the BEV encoder 3 in the fusion module 5, and the multi-mode BEV features are output through the three-dimensional target detection head 6.
D. Training a model: substituting the training set into the BEV perception model to train, and carrying out model parameter adjustment after training. Through parameter adjustment and optimization, the model can learn the mode in the data better, and the prediction performance of the model can be improved.
E. Model evaluation: substituting the data of the test set into the BEV perception model after model parameter adjustment to obtain a predicted value, measuring the difference between the predicted value and a true value by adopting a loss function, evaluating the current model by an evaluation index, judging whether the model is qualified or not according to the measurement of the loss function and the evaluation of the evaluation index, returning to the step D if the model is unqualified, and storing the model if the model is qualified; model evaluation is beneficial to show the performance of the model on unseen data to ensure that the model has good generalization ability. In this embodiment, the obtaining of the true value is the prior art, and will not be described in detail herein.
In this embodiment, when the difference between the predicted value and the actual value is greater than a preset difference threshold, it is determined that iterative training is required, step D is entered, when the difference between the predicted value and the actual value is less than the preset difference threshold, the model index is evaluated by the evaluation index, and when the evaluation is not qualified, it is determined that iterative training is required, step D is entered, the model is saved after the evaluation is qualified, and step F is entered. The difference threshold may be considered as being set in a manner known in the art, and will not be described in detail herein. The evaluation index is an index for evaluating the performance of the deep learning model on a certain task, is usually related to a target or application scene of interest, and can help us select the most suitable model or compare the performance between different models. The setting of the evaluation index is the prior art, and will not be described in detail here.
In model evaluation, a difference threshold is used as a standard, whether the difference between a true value and a predicted value obtained by a model which is completed with training meets the requirement is judged, if not, the next step is sequentially executed from the beginning of model training, otherwise, the model is saved, and the next step of model prediction is executed. The preset of the difference threshold improves the efficiency of judging whether the model is qualified or not, unifies the judgment standard, and is beneficial to standardized management.
In this embodiment, the total loss function is l=αl TSL +βL VMI +γL IG Wherein alpha, beta and gamma are three super parameters, respectively and correspondingly represent L TSL 、L VMI 、L IG Weights, L TSL Alignment loss function for BEV viewing angle, L VMI For BEV characteristic mutual information loss function, L IG Gain loss function for BEV a priori information, L TSL The formula of (2) isWhere N is the total number of pixel points in the image, y i Is the value of the i-th pixel point in the BEV view image obtained after the sensor is converted into the BEV mode,/for the sensor>Is the value of the ith pixel point in the BEV visual angle image acquired by the instant remote sensing constellation,/and->Is the similarity between two pixels under different transformations, +.>The calculation formula is thatWhere M is the total number of transforms, T j Is the j-th transformation function and,is the structural similarity between two transformed pixel points, and the specific formula is thatWherein mu y And->T is respectively j (y i ) And->Mean value in a small window, +.>And->T is respectively j (y i ) And->Variance within a small window, +.>Is T j (y i ) And->Covariance in a small window, c1 and c2 are two constants; l (L) VMI The formula of (2) isx and y respectively and correspondingly represent BEV features obtained by using a sensor alone and BEV features obtained by fusing a BEV view angle obtained by an instant remote sensing constellation with forward data obtained by the sensor, q (x, y) represents probability distribution of simultaneous occurrence of the two features, and q (x) and q (y) respectively represent probability distribution of respective occurrence of the two features; l (L) IG The formula of (2) is +.>Wherein C is the total number of categories, y i Is the probability of the i-th category in the true annotation,/>Is the probability of the i-th category predicted by the model,/->Is the probability of the ith category when no a priori information is used.
The BEV view alignment loss function, using TSL (Transformation Sensitivity Loss), the transform sensitivity loss, calculates the similarity between two images at different transforms. The BEV view angle alignment loss function is used for measuring the alignment degree between the BEV view angle image acquired by the instant remote sensing constellation and the BEV view angle image obtained after the traditional sensor is converted into the BEV mode.Then it is an index for measuring image quality and can reflect the similarity of brightness, contrast and structure of image, in which μ is y And->Representing the brightness of the image>And->Representing the contrast of the image and,representing the structure of the image, c 1 and c2 are used to prevent the denominator from being zero. The closer this formula is to 1, the more similar the two transformed pixels are represented. The smaller the BEV view alignment loss function, the more consistent the two images.
The feature mutual information loss function uses VMI (Variational Mutual Informat ion), i.e., variable mutual information, which calculates the amount of information shared between two features. And the characteristic mutual information loss function is used for measuring the mutual information quantity between BEV characteristics obtained by fusing the BEV visual angle acquired by the instant remote sensing constellation and the forward data acquired by the traditional sensor and BEV characteristics obtained by independently using the traditional sensor. The larger the feature mutual information loss function, the more complementary the two features are.
The a priori information gain loss function, IG (Information Gain), i.e. the information gain, is used. The a priori information gain loss function is used to measure the gain of the effect of the a priori information provided by the instant remote sensing constellation on the conversion of the sensor into BEV mode. The larger the a priori information gain loss function, the more accurate the model prediction.
The three super parameters of alpha, beta and gamma are used for respectively representing weights of the BEV visual angle alignment loss function, the characteristic mutual information loss function and the prior information gain loss function, the three loss functions are weighted and summed to obtain a final loss function, and by adjusting the weights, different performances of the model can be comprehensively considered and balanced to obtain an optimal solution of the final loss function. The loss functions with different functions correspond to different super parameters, so that the model is more convenient to optimize and manage.
F. Model prediction: substituting the verification set data into the model with qualified evaluation to obtain an output result, evaluating the result by adopting an evaluation index, and completing the construction of the BEV perception model combined with instant remote sensing after the evaluation is completed. The remote sensing image provided by the instant remote sensing constellation is a natural BEV visual angle, so that the sensing effect of the BEV sensing model can be improved while the BEV sensing model is well adapted to the BEV sensing model, and the complementary effect is achieved.
According to the embodiment, the remote sensing image is added in the construction, so that the accuracy of converting the sensor into the BEV mode is higher, the robustness is better, a larger sensing range can be obtained in practical application, and the safety and the reliability of automatic driving are improved.
Embodiment four:
the device for realizing the work of the BEV sensing model in the first embodiment and the second embodiment is provided in this embodiment, and the device for realizing the BEV sensing model comprises an image sensing module for inputting original image data and a laser sensing module for inputting original radar point cloud data, and is characterized in that the device for realizing the BEV sensing model in the first embodiment and the second embodiment further comprises a position module for transmitting a position signal of a vehicle to an instant remote sensing constellation, and a vehicle-mounted computer provided with the BEV sensing model in the first embodiment and the second embodiment, wherein the image sensing module and the laser sensing module are connected with the vehicle-mounted computer, and the vehicle-mounted computer is wirelessly connected with the instant remote sensing constellation to receive the remote sensing image transmitted by the instant remote sensing constellation, so that the BEV sensing model in the first embodiment and the second embodiment can work through the connection.
And in the running process of the vehicle, the image sensing module detects the surrounding environment of the running vehicle through the vehicle-mounted camera and sends the collected original data image data to the vehicle-mounted computer. The laser perception module detects the surrounding environment of the running vehicle through the vehicle-mounted laser radar and sends the collected original radar point cloud data to the vehicle-mounted computer.
Meanwhile, the position module sends the real-time longitude and latitude information of the running vehicle to the instant remote sensing constellation, the instant remote sensing constellation detects the surrounding environment of the running vehicle from the air by utilizing the satellite group, and the collected remote sensing image data is wirelessly sent to the vehicle-mounted computer.
The vehicle-mounted computer receives the image data image, the original Lei Dadian cloud data and the remote sensing image data, and further controls the work of realizing the BEV perception model.
Fifth embodiment:
in this embodiment, an apparatus for implementing the operation of the BEV perception model in the first embodiment and the second embodiment is provided, where the apparatus for the BEV perception model includes one or more processors; storage means for storing one or more programs that, when executed by the one or more processors, enable the apparatus to implement the BEV perception model operations described in embodiments one and two.
In particular, the device is preferably an electronic device, but may also be connected to one or more external devices, such as a keyboard, a router, or a consumer electronic mobile device.
Example six:
in this embodiment, a non-transitory computer-readable storage medium embodying the operations of the BEV perception model in the first and second embodiments is provided, where the non-transitory computer-readable storage medium stores computer-executable instructions that, when executed by a processor of a device, enable the operations of the BEV perception model in the first and second embodiments to be implemented.
The non-transitory computer readable storage medium has stored thereon a program product of the method described herein. The readable storage medium may be, but is not limited to, electronic, magnetic, optical, or infrared, or a combination of any of the above. Specific examples thereof include Flash cards (Flash Card), read-Only memories (ROM), and Hard disks (HDD).
Embodiment seven:
in this embodiment, a vehicle equipped with the electronic device in the fifth embodiment is provided, and the vehicle fully combines and utilizes the BEV perception models, the construction methods, the devices, the equipment and the storage medium of the first embodiment to the sixth embodiment to realize automatic driving.
According to the invention, not only is the real-time remote sensing constellation used for acquiring the overlooking ground condition and obtaining the natural BEV view angle, the problems of view angle conversion and three-dimensional information reconstruction when the camera and the laser radar are converted into the BEV mode are solved, but also the BEV view angle acquired by the real-time remote sensing constellation is fused with the forward data acquired by the traditional camera or the laser radar, and the accuracy and the robustness of the conversion of the sensor into the BEV mode are improved. In addition, the image using the instant remote sensing constellation has high frequency, high resolution and wide coverage, provides priori information outside the sensing range of the image sensing module, and is favorable for enhancing the information of the features extracted from the aerial view.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.
Although the terms of the image view encoder 1, the two-dimensional backbone network 1a, the neck module 2a, the mapping module 2, the BEV encoder 3, the three-dimensional backbone network 4, the fusion module 5, the three-dimensional object detection head 6, etc. are used more herein, the possibility of using other terms is not excluded. These terms are used merely for convenience in describing and explaining the nature of the invention; they are to be interpreted as any additional limitation that is not inconsistent with the spirit of the present invention.

Claims (14)

1. A BEV perception model comprising an image view encoder (1) for feature extraction of raw image data input by an image perception module of a vehicle to output image multi-view features with semantic information, characterized in that the BEV perception model in combination with instant remote sensing further comprises:
the mapping module (2) is used for mapping the image multi-view characteristics from two dimensions to three dimensions and converting the obtained three-dimensional image multi-view characteristics into a bird's eye view taking the vehicle as a center;
And the BEV encoder (3) is used for acquiring the remote sensing image sent by the instant remote sensing constellation, the bird's-eye view and a fused image formed by fusing the remote sensing image and the bird's-eye view, fusing the three image data and extracting the bird's-eye view characteristics to form the BEV characteristics of the camera.
2. BEV perception model according to claim 1, characterized in that the remote sensing image and the bird's eye view are fused to form a fused image by means of PCA transformation before entering the BEV encoder (3).
3. The BEV perception model according to claim 1 or 2, characterized in that the remote sensing image, the bird's eye view and the fused image within the BEV encoder (3) are data fused in a manner of image data matrix superposition.
4. The BEV perception model according to claim 1, characterized in that the BEV perception model in combination with instant remote sensing further comprises:
the three-dimensional backbone network (4) is used for extracting characteristics of original radar point cloud data input by the laser perception module of the vehicle to obtain radar BEV characteristics;
and the fusion module (5) is used for acquiring radar BEV characteristics and camera BEV characteristics, and fusing the radar BEV characteristics and the camera BEV characteristics to obtain multi-mode BEV characteristics.
5. The BEV perception model according to claim 4, wherein the BEV perception model in combination with instant remote sensing further comprises: the three-dimensional target detection head (6) is used for acquiring and outputting the multi-mode BEV characteristics sent by the fusion module (5).
6. The BEV perception model according to claim 4 or 5, characterized in that the image view encoder (1) comprises a two-dimensional backbone network (1 a) and a neck module (2 a), and the original image data input by the image perception module of the vehicle is output with semantic information of the image multi-view feature after passing through the two-dimensional backbone network (1 a) and the neck module (2 a).
7. A BEV perception model construction method, characterized in that the method comprises the steps of:
A. training data preparation and loading: collecting training data, wherein the training data comprises data obtained by an image sensing module and instant remote sensing constellation detection, and cleaning and processing the collected data;
B. dividing the data set: dividing the data after data cleaning into a training set, a verification set and a test set which are mutually independent;
C. building a BEV network model: constructing a BEV perception model comprising an image view encoder (1), a mapping module (2) and a network structure of the BEV encoder (3);
D. training a model: substituting the training set into the BEV perception model to train, and carrying out model parameter adjustment after training;
E. model evaluation: substituting the data of the test set into the BEV perception model after model parameter adjustment to obtain a predicted value, measuring the difference between the predicted value and a true value by adopting a loss function, evaluating the current model by an evaluation index, judging whether the model is qualified or not according to the measurement of the loss function and the evaluation of the evaluation index, returning to the step D if the model is unqualified, and storing the model if the model is qualified;
F. Model prediction: substituting the verification set data into the model with qualified evaluation to obtain an output result, evaluating the result by adopting an evaluation index, and completing the construction of the BEV perception model combined with instant remote sensing after the evaluation is completed.
8. The BEV perception model construction method according to claim 7, characterized in that in step a the training data further comprises data detected by a laser perception module, and in step C the BEV perception model further comprises a three-dimensional backbone network (4), a fusion module (5) and a three-dimensional object detection head (6).
9. The BEV perception model construction method according to claim 8, wherein in step E, it is determined that iterative training is required when the difference between the predicted value and the actual value is greater than a preset difference threshold, step D is entered, the model index is evaluated by the evaluation index when the difference between the predicted value and the actual value is less than the preset difference threshold, step D is entered when the evaluation is failed, the model is saved after the evaluation is passed, and step F is entered.
10. The BEV perception model construction method according to claim 8 or 9, characterized in that the total loss function is L = αl TSL +βL VMI +γL IG Wherein alpha, beta and gamma are three super parameters, respectively and correspondingly represent L TSL 、L VMI 、L IG Weights, L TSL Alignment loss function for BEV viewing angle, L VMI For BEV characteristic mutual information loss function, L IG Gain loss function for BEV a priori information, L TSL The formula of (2) isWhere N is the total number of pixel points in the image, y i Is the value of the i-th pixel point in the BEV view image obtained after the sensor is converted into the BEV mode,/for the sensor>Is the ith pixel point in the BEV visual angle image acquired by the instant remote sensing constellationValue of->Is the similarity between two pixels under different transformations, +.>The calculation formula is +.>Where M is the total number of transforms, T j Is the j-th transformation function,>is the structural similarity between two transformed pixel points, and the specific formula is that
Wherein mu y And->T is respectively j (y i ) And->Mean value in a small window, +.>And->T is respectively j (y i ) And->The variance within a small window of time,is T j (y i ) And->Covariance in a small window, c1 and c2 are two constants; l (L) VMI The formula of (2) isx and y respectively and correspondingly represent BEV features obtained by using a sensor alone and BEV features obtained by fusing a BEV view angle obtained by an instant remote sensing constellation with forward data obtained by the sensor, q (x, y) represents probability distribution of simultaneous occurrence of the two features, and q (x) and q (y) respectively represent probability distribution of respective occurrence of the two features; l (L) IG The formula of (2) is +.>Wherein C is the total number of categories, y i Is the probability of the i-th category in the true annotation,/>Is the probability of the i-th category predicted by the model,/->Is the probability of the ith category when no a priori information is used.
11. The device of the BEV perception model comprises an image perception module for inputting original image data and a laser perception module for inputting original radar point cloud data, and is characterized by further comprising a position module for transmitting a position signal of a vehicle to an instant remote sensing constellation and a vehicle-mounted computer provided with the BEV perception model according to any one of claims 1 to 6, wherein the image perception module and the laser perception module are connected with the vehicle-mounted computer, and the vehicle-mounted computer is in wireless connection with the instant remote sensing constellation to receive remote sensing images transmitted by the instant remote sensing constellation, so that the work of the BEV perception model according to any one of claims 1 to 6 is realized through the connection.
12. An apparatus of a BEV perception model, comprising one or more processors; storage means for storing one or more programs that, when executed by the one or more processors, cause the apparatus to implement the BEV perception model operation of any of claims 1-6.
13. A vehicle of the BEV perception model, characterized in that it is provided with the device of claim 12.
14. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by a processor of a device, enable the BEV perception model operation of any one of claims 1-6.
CN202311315964.0A 2023-10-11 2023-10-11 BEV perception model, construction method, device, equipment, vehicle and storage medium Pending CN117423077A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311315964.0A CN117423077A (en) 2023-10-11 2023-10-11 BEV perception model, construction method, device, equipment, vehicle and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311315964.0A CN117423077A (en) 2023-10-11 2023-10-11 BEV perception model, construction method, device, equipment, vehicle and storage medium

Publications (1)

Publication Number Publication Date
CN117423077A true CN117423077A (en) 2024-01-19

Family

ID=89529280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311315964.0A Pending CN117423077A (en) 2023-10-11 2023-10-11 BEV perception model, construction method, device, equipment, vehicle and storage medium

Country Status (1)

Country Link
CN (1) CN117423077A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097721A (en) * 2024-04-29 2024-05-28 江西师范大学 Wetland bird recognition method and system based on multi-source remote sensing observation and deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097721A (en) * 2024-04-29 2024-05-28 江西师范大学 Wetland bird recognition method and system based on multi-source remote sensing observation and deep learning

Similar Documents

Publication Publication Date Title
US11532151B2 (en) Vision-LiDAR fusion method and system based on deep canonical correlation analysis
CN110988912B (en) Road target and distance detection method, system and device for automatic driving vehicle
US10817731B2 (en) Image-based pedestrian detection
CN112233097B (en) Road scene other vehicle detection system and method based on space-time domain multi-dimensional fusion
CN112396650A (en) Target ranging system and method based on fusion of image and laser radar
CN114359181B (en) Intelligent traffic target fusion detection method and system based on image and point cloud
CN103902976A (en) Pedestrian detection method based on infrared image
CN115943439A (en) Multi-target vehicle detection and re-identification method based on radar vision fusion
CN114639115B (en) Human body key point and laser radar fused 3D pedestrian detection method
CN114495064A (en) Monocular depth estimation-based vehicle surrounding obstacle early warning method
CN117423077A (en) BEV perception model, construction method, device, equipment, vehicle and storage medium
CN114295139A (en) Cooperative sensing positioning method and system
CN117111085A (en) Automatic driving automobile road cloud fusion sensing method
CN113688738A (en) Target identification system and method based on laser radar point cloud data
CN117808689A (en) Depth complement method based on fusion of millimeter wave radar and camera
CN114966696A (en) Transformer-based cross-modal fusion target detection method
CN110909656B (en) Pedestrian detection method and system integrating radar and camera
CN117475355A (en) Security early warning method and device based on monitoring video, equipment and storage medium
CN113743163A (en) Traffic target recognition model training method, traffic target positioning method and device
CN113298781B (en) Mars surface three-dimensional terrain detection method based on image and point cloud fusion
CN113624223B (en) Indoor parking lot map construction method and device
CN114550016B (en) Unmanned aerial vehicle positioning method and system based on context information perception
Zheng et al. Research on environmental feature recognition algorithm of emergency braking system for autonomous vehicles
CN112395956A (en) Method and system for detecting passable area facing complex environment
Rasyidy et al. A Framework for Road Boundary Detection based on Camera-LIDAR Fusion in World Coordinate System and Its Performance Evaluation Using Carla Simulator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination