CN116450761A

CN116450761A - Map generation method, map generation device, electronic equipment and storage medium

Info

Publication number: CN116450761A
Application number: CN202310301074.8A
Authority: CN
Inventors: 赵行
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-07-18

Abstract

The embodiment of the invention provides a map generation method, a map generation device, electronic equipment and a storage medium, and relates to the technical field of map generation. The method comprises the following steps: acquiring a target image, and processing the target image through an encoder to obtain target image characteristics; determining corresponding target priori features from the neural map priori according to the position information corresponding to the target image; inputting the target image features and the target priori features into a trained map feature generation model for feature fusion to obtain target fusion features; and inputting the target fusion characteristics to a decoder to obtain a semantic map corresponding to the target image. By the method provided by the embodiment of the invention, the reasoning performance of the map can be improved by fusing the current target image characteristics with the nerve map prior, so that the prediction quality of the online semantic map is improved.

Description

Map generation method, map generation device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of map generation, in particular to a map generation method, a map generation device, electronic equipment and a storage medium.

Background

Aiming at high-definition semantic maps, most of the high-definition semantic maps are offline maps established through expensive manual annotation, and when the road conditions change, the high-definition semantic maps do not support timely updating, so that potential safety hazard caused by map outdated use of vehicles can be caused.

Based on this, online prediction of semantic maps is proposed in the related art, and these schemes usually use a deep learning method to infer the semantic map in real time, so as to effectively solve the problem that the map cannot be updated in time. However, the quality of the map inferred at present is usually far lower than that of the pre-built offline map, that is, how to predict the high-quality semantic map in real time is a technical problem to be solved by the present invention.

Disclosure of Invention

Based on the technical problems, the embodiment of the invention provides a map generation method, a map generation device, electronic equipment and a storage medium, so as to improve the prediction quality of a semantic map.

The embodiment of the invention provides a map generation method, which comprises the following steps:

acquiring a target image, and processing the target image through an encoder to obtain target image characteristics;

determining corresponding target priori features from the neural map priori according to the position information corresponding to the target image;

Inputting the target image features and the target priori features into a trained map feature generation model for feature fusion to obtain target fusion features;

and inputting the target fusion characteristics to a decoder to obtain a semantic map corresponding to the target image.

A second aspect of an embodiment of the present invention provides a map generating apparatus, including:

the image feature determining module is used for acquiring a target image, and processing the target image through the encoder to obtain target image features;

the prior feature determining module is used for determining corresponding prior features of the target from the prior of the neural map according to the position information corresponding to the target image;

the fusion feature determining module is used for inputting the target image features and the target priori features into a trained map feature generation model to perform feature fusion to obtain target fusion features;

and the map determining module is used for inputting the target fusion characteristics to a decoder to obtain a semantic map corresponding to the target image.

A third aspect of an embodiment of the present invention provides an electronic device, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor implements a map generation method as in the first aspect of the embodiments of the invention.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the map generation method of the first aspect of the embodiments of the present invention.

According to the map generation method, the target image is obtained, and the target image is processed through the encoder to obtain the target image characteristics; determining corresponding target priori features from the nerve map priori according to the position information corresponding to the target image; inputting the target image features and the target priori features into a trained map feature generation model for feature fusion to obtain target fusion features; and inputting the target fusion characteristics to a decoder to obtain a semantic map corresponding to the target image. In the method, a neural map prior is provided, so that a model is generated through trained map features when map reasoning is carried out, and the current features (namely target image features) and the previous features (corresponding target prior features in the neural map prior) are subjected to feature fusion, so that the reasoning performance of the map is improved through fusing the current target image features with the neural map prior, and the prediction quality of an online semantic map is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a map generation method according to an embodiment of the present invention;

FIG. 2 is an attention diagram after GRU fusion visualization in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a map generation method according to an embodiment of the present invention;

fig. 4 is a block diagram of a map generating apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In fact, the high-definition semantic map has very wide application and has important significance for pedestrian and vehicle navigation. For example, high Definition (HD) semantic maps are a key component of urban street autopilot, which autopilot vehicles rely on to predict future trajectories and navigate across urban streets. Most automatic cars use offline HD semantic maps that are pre-annotated. These offline semantic maps are built using complex build pipelines, survey vehicles including multiple LiDAR scans, global point cloud alignment, and manual map element annotation. While these offline mapping solutions achieve high accuracy, they are cumbersome and expensive, limiting their scalability.

As mentioned above, some solutions have been proposed to learn a high-definition semantic map from the on-vehicle sensor observation, and these solutions generally use a deep learning method to infer map elements in real time, so as to effectively eliminate the map update problem. However, the high-definition semantic map learning of these methods depends on the sensing range of the sensor, is easily affected by occlusion, and in the case of bad weather and occlusion, the inferred map quality may be further deteriorated, resulting in that the inferred map quality is generally lower than the global offline map built in advance, and the accuracy requirement of the semantic map cannot be met.

Based on the above, the invention provides a map generation method, which provides a neural map prior, namely the neural network feature of the map obtained by inference before, fuses the current image feature (namely the target image feature) and the corresponding prior feature (namely the target prior feature) in the neural map prior through a map feature generation model trained in advance to obtain refined fine features (namely the target fusion feature) so as to further obtain a semantic map corresponding to the target object, thereby improving the online prediction quality of the semantic map. That is, the present invention provides a hybrid mapping method that combines the best features on both timelines to further improve the learning performance of the high definition map through a neural map prior constructed and maintained in advance.

Referring to fig. 1, fig. 1 is a flowchart illustrating a map generation method according to an embodiment of the present invention.

As shown in fig. 1, the method comprises the steps of:

step S11: and obtaining a target image, and processing the target image through an encoder to obtain target image characteristics.

The present embodiment may acquire the target image first to estimate the semantic map corresponding to the target image based on the target image. The target image may be a road environment image or a street environment image, and the target image may be a single-frame panoramic image, may be multiple frames of surrounding images, or may be an image formed by splicing multiple frames of surrounding images. Specifically, the target image may be acquired by a camera in the vehicle capable of capturing the surrounding environment, for example, the images captured by six surrounding view cameras on the vehicle may run frame by frame. The vehicle can be an automatic driving vehicle, an intelligent vehicle or a common vehicle. Further, in this embodiment, the target image may be obtained by real-time shooting, and the semantic map corresponding to the target image may be estimated in real time by the method of this embodiment, or the target image may be shot and saved, and then the shot target image may be obtained, and the semantic map may be estimated by the method of this embodiment, which is not limited in this embodiment.

The embodiment can process the obtained target image through the encoder to obtain the target image characteristics corresponding to the target image. The encoder and decoder in this embodiment may be any encoder-decoder architecture that performs map reasoning, among others. For example, the map generation method may be an HDMapNet model (i.e., an online high-definition map building and evaluating framework model), an LSS model (i.e., a model for encoding an image of any camera support by implicit non-projection to three dimensions), a bevform model (i.e., a model for learning a bird's eye view representation from a multi-lens image by space-time transformation), a VectorMapNet model (i.e., an end-to-end vector high-definition map learning model), etc., which are all models for performing online semantic map inference in the related art, and the map generation method provided in this embodiment may further improve the inference quality of a semantic map based on the semantic map inference model in the related art. In one embodiment, the target image feature processed by the encoder may be a BEV feature.

Step S12: and determining corresponding target priori features from the neural map priori according to the position information corresponding to the target image.

In this embodiment, a Neural Map Prior (NMP) is pre-stored, which is a Neural representation of a global Map, and is a Neural network feature of a Map obtained by reasoning through the method in this embodiment, where the Neural Map Prior in this embodiment may be defined as a sparse Map block initialized to be empty. That is, the storage mode of the sparse map block may be created first, and the creation is empty, and then the target fusion feature is obtained by reasoning according to the method of the embodiment, and then the obtained target fusion feature is gradually stored as the neural map prior, so as to obtain the prestored neural map prior.

It can be understood that the nerve map prior of the embodiment is a global nerve map prior, the target fusion feature is a local map feature, and the semantic map corresponding to the target image is a local semantic map. Here, "global" corresponds to "local", which means that the neural map prior of the present embodiment (i.e., the global neural map prior) includes a plurality of "local" map features.

In this embodiment, location information corresponding to the target image may be obtained, and then a priori feature corresponding to the location information (i.e., a fusion feature inferred by the method) may be determined from the neural map prior according to the location information as the target priori feature. In this embodiment, the neural map prior is stored through sparse map tiles, and each fusion feature inferred before is stored into a corresponding map tile as a geographic index through position information of an image corresponding to the fusion feature. I.e., each map tile corresponds to a location on the physical map. That is, the image for reasoning corresponds to the position information, and after the fusion feature is obtained by reasoning according to the image through the method, the fusion feature is stored in a map block mode to correspond to the position information, so that the neural map prior with a plurality of sparse map blocks is formed.

In this embodiment, map tiles are taken as a storage format for a neural map prior. This is because in cities, buildings occupy a large part of the area, while areas associated with roads occupy only a small part. To avoid increasing the storage of the map as the physical scale of the city increases, the present embodiment designs a storage structure that divides the city into sparse map tiles that are indexed by their physical coordinates. For example, in the nuScenes dataset, the boston area has an upper left corner of (298 m,328 m) and a lower right corner of (2527 m,1896 m), and boston in the nuScenes dataset is a city area that is 2 km tall and 1.5 km wide. Whereas if the feature dimension of the neural map a priori is defined as 265 channels, the resolution of the map a priori features is 0.3m, then 38GB is needed to store the data of Boston in the nuScenes dataset. Based on this, the present embodiment does not store a neural map prior at each location of a city, but divides the city into 32x32 map tiles, each map tile having a size of 69m x 49m. Typically, the map tiles of this embodiment are only slightly larger than the Bird's Eye View (BEV) range, which is set to 60m x 30m. During on-board map inference, only relevant map tiles, especially those that overlap the current perceived range, need be extracted from the global map (i.e., neural map priors), so that no map tiles containing no road-related information need be stored. After these map tiles are removed, the present embodiment can save road information of the entire city, such as boston area, by only 12 GB.

Therefore, the sparse map block structure saves memory consumption, and each automobile can adopt the map block structure of the embodiment only by a small amount of disk memory. The vehicle need not store a map of the entire city, but rather can download map tiles as needed. These map tiles will be updated, integrated and asynchronously uploaded to the cloud while maintaining a fixed training model. Over time, more and more travel data can be obtained, thereby obtaining a wider and better quality map.

The location information corresponding to the target image may be acquired by a positioning system that provides accurate positioning in the vehicle, for example, the vehicle is equipped with an on-board sensor, and the on-board sensor includes a camera that captures the surrounding environment and a positioning system (such as a GPS/IMU system) that provides accurate positioning, so that the target image may be acquired by the camera, and the location information corresponding to the target image may be acquired by the positioning system.

Step S13: and inputting the target image features and the target priori features into a trained map feature generation model to perform feature fusion, so as to obtain target fusion features.

In this embodiment, after the target image feature and the target prior feature are obtained, the target image feature and the target prior feature may be input into a map feature generation model trained in advance, and feature fusion is performed on the target image feature and the target prior feature through the map feature generation model, so as to obtain a target fusion feature output by the map feature generation model and subjected to fusion refinement. The map feature generation model trained in advance in the embodiment is used for fusing the target image features and the corresponding neural map priors when the local map prediction is performed, so that the target image features are further refined, and therefore the high-precision local semantic map is deduced.

Step S14: and inputting the target fusion characteristics to a decoder to obtain a semantic map corresponding to the target image.

In this embodiment, after obtaining the target fusion feature output by the map feature generation model, the target fusion feature may be input to the decoder to obtain the local semantic map output by the decoder, i.e. the semantic map corresponding to the target image, so as to complete the prediction of the online local map.

In this embodiment, when the on-line semantic map prediction is performed, a neural map prior is provided, not only the obtained current feature (i.e., the target image feature) but also the previous feature (the corresponding target prior feature in the neural map prior) is considered, so that the current feature and the previous feature are fused through a simple and efficient map feature generation model, the current feature is processed finely, and the target image feature processed by the encoder in the related map reasoning technology is compatible through the neural map prior collected from different time points of different vehicles in the past, so that the reasoning performance of the map is improved, and the prediction quality of the on-line semantic map is improved. In addition, the embodiment can avoid the influence of severe weather on the online local map reasoning through the reliable information in the nerve map prior, so that the map reasoning result under the severe weather is obviously improved.

In combination with the above embodiment, in an implementation manner, the present invention further provides a map generating method, where the method may further include the steps of: and replacing the target prior feature in the nerve map prior with the target fusion feature.

In this embodiment, after the target fusion feature is obtained, the target fusion feature not only can be processed by the decoder to obtain the semantic map corresponding to the target image, but also can be used for updating the neural map prior. The size of the target image features is the same as the size of the target prior features, the size of the fused target fusion features is the same as the size of the target prior features, and the neural map prior is updated after the target fusion features are obtained each time. Specifically, after the target fusion feature is obtained, the target prior feature in the neural map prior is replaced by the target fusion feature, so that the neural map prior is updated.

In the embodiment, the target fusion feature output by the map feature generation model can be used for updating the global neural map priori and reasoning of the local map, so that the effect of the local map reasoning and the expression quality of the global neural map priori are improved mutually, and the predicted local semantic map has better quality and the global neural map priori is more complete and updated in time as the vehicle passes through more scenes.

In combination with the above embodiment, in an implementation manner, the present invention further provides a map generating method, where the method further includes a training method of a map feature generating model, and specifically, the training step of the map feature generating model may include the following steps:

step A: and acquiring a sample image, and processing the sample image through the encoder to obtain sample image characteristics.

In this embodiment, the sample image may be obtained from a sample library or database, for example, the database may be a nuScenes dataset, which is a large autopilot dataset covering various weather conditions, traffic conditions, and different periods of the day, including multiple traversals, with accurate positioning and annotated high-definition map semantic tags. For example, the data set may include external camera parameters and the conversion relationship of the autonomous vehicle to the global coordinate system.

After the sample image is obtained, the sample image may be subjected to image processing by an encoder, for example, BEV feature extraction is performed, so as to obtain sample image features corresponding to the sample image, where the sample image features may be BEV features. The codec structure used in training of this embodiment is the same as that used in application. Such as any encoder-decoder architecture that performs map reasoning. For example, the model may be an HDMapNet model, an LSS model, a BEVFormer model, a VectorMapNet model, or the like.

And (B) step (B): and determining corresponding sample prior features from the nerve map prior according to the position information corresponding to the sample image.

In this embodiment, according to the position information corresponding to the sample image, the prior feature corresponding to the position information may be determined from the neural map prior to be used as the sample prior feature. The method for generating the neural map prior during training in this embodiment is the same as the method for generating the neural map prior during application. It can be appreciated that the neural map prior is from scratch during training and aids in the training of the fusion module (i.e., the initial model). When the method is applied, the nerve map is used for a priori from nothing to nothing, and is gradually filled and updated through a fusion module (namely an initial model), so that the online map is better predicted.

Step C: inputting the sample image features and the sample prior features into an initial model for feature fusion to obtain sample fusion features; the sample fusion features are used for determining a semantic map corresponding to the sample image and updating the nerve map prior.

The embodiment constructs an initial model for feature fusion of the sample image features and the sample prior features. Then, the embodiment outputs the sample image features and the sample prior features to the initial model for feature fusion so as to obtain sample fusion features output by the initial model. The sample fusion feature can be used for determining a semantic map corresponding to the sample image and updating the nerve map prior, namely, carrying out local map reasoning and global nerve map prior updating. Specifically, the local map reasoning process can be directly operated on the test vehicle, and the local map reasoning is performed by fusing the vehicle-mounted sensor observation and the global map priori, or can be performed by acquiring sample data in the data set. Whereas local map reasoning in turn can update the map a priori through attention operations. These two processes form a cycle and can be improved by collecting a large amount of data (e.g., data collected from a large number of vehicles on a driving road each day).

Step D: training the initial model based on the sample fusion characteristics, and determining the trained initial model as the map characteristic generation model.

In this embodiment, after the initial model outputs the sample fusion feature, the initial model may be trained based on the sample fusion feature, so that the trained initial model is determined to be the map feature generation model.

Illustratively, in one embodiment, a BEV encoder-decoder architecture is employed, the BEV encoder being referred to as f _E The decoder is called f _D The neural map prior (i.e., global neural map prior) is noted asWherein H is _G And W is _G Representing the height and width of the city, respectively. First, a group of observations is composed of a sample image I and its corresponding position information(i.e., the vehicle position in the global coordinate system), in this embodiment a set of observations is a set of training data. Wherein, G can be used in the embodiment _ego Local coordinates of each pixel of BEV (++>Where H and W denote the dimensions of the BEV feature) into a fixed global coordinate system, i.e.>First, sample image features (i.e., on-line BEV features) can be acquired first>Where C represents the hidden embedded size of the network and then uses the vehicle position P _ego Querying global priors P ^g To obtain a sample a priori feature (i.e. local a priori BEV feature)>Subsequently, a fusion function (i.e., an initial model) is applied to obtain sample fusion features (i.e., refined BEV features): denoted as F _refine ＝f _Fusion (O，P ^l ) Wherein->Finally, passing the refined BEV feature through a decoder f _D Decoding to final map output while using F _refine Updating global map priori P ^g . The whole process runs continuously, and various and complementary observations of different periods are integrated with time, so that training of the map feature generation model is completed.

In this embodiment, for training of the map feature generation model, a neural map prior is provided to combine maintenance of an offline global neural map prior and reasoning of an online local map, and meanwhile, the calculation amount and the memory consumed by local reasoning through the method of this embodiment are similar to those of a single frame system in a related technology, but better technical effects can be obtained: and the quality of local map reasoning is improved.

In combination with the above embodiment, the present invention further provides a map generating method, in which the initial model includes: an attention mechanism module and a gating cycle unit; and the step C may specifically further comprise the following steps:

Step C1: and inputting the sample image features and the sample prior features to the attention mechanism module to obtain intermediate fusion features.

In this embodiment, the proposed prior feature in the Neural Map Prior (NMP) provides powerful prior information for online map inference. A fusion function is needed to aggregate the prior feature and the current feature. In general, join operations are widely used for feature aggregation, which is a symmetric function in which a priori features and current features are considered equally important. However, in the map problem of the present embodiment, the road condition may vary greatly between each traversal, which means that the current feature and the prior feature may have different importance. Thus, the present embodiment uses an asymmetric fusion function consisting of the attention mechanism module and the gating loop variant to construct the initial model.

The attention mechanism module in the initial model of the present embodiment is mainly used for dynamically capturing the correlation between the current feature and the previous feature (i.e., the prior feature) for feature fusion. The attention mechanism module in this embodiment may be a current to previous cross attention module (C2P attention). In the embodiment, the sample image features and the sample prior features are input to the attention mechanism module, and the current features and the previous features are subjected to feature fusion processing by the attention mechanism module, so that intermediate fusion features output by the attention mechanism module are obtained.

Step C2: and carrying out feature fusion on the intermediate fusion feature and the sample prior feature through the gating circulating unit to obtain the sample fusion feature.

Aiming at the updating of the nerve map prior, if the updating speed is too high, the nerve map prior can be easily influenced by poor local observation; if the update rate is too slow, the neural map a priori may not be able to capture the change in road conditions in time. Based on this, in this embodiment, considering the update speed of the control neural map prior, the intermediate fusion feature output by the attention mechanism module and the sample prior feature may be fused by using a gate control loop unit (GRU) to balance the proportion between the newly generated intermediate fusion feature and the sample prior feature (i.e., the old map prior), so as to obtain the final sample fusion feature.

By way of example, it may be by a gating loop unit of a 2D convolution variant to balance the proportion of updates and forgets. The intermediate fusion value characteristic output by the attention mechanism module can be O', and the prior characteristic of the local map updated at t-1(i.e. sample a priori features or target a priori features) is a priori +/from the global neural map>I.e. neural map priors). Gating loop unit GRU will O' and the local a priori feature updated at t-1 ∈1 >Fusing to obtain new priori features at tI.e., sample fusion features or target fusion features, and predicts a local semantic map by a decoder. Then, global neural map a priori +/f for corresponding locations are updated by direct substitution>Namely, the sample priori features or target priori features in the nerve map priori are replaced by newly generated sample fusion features or target fusion features.

Specifically, the gating loop unit may GRU use the following operations to combine O' with the previous featuresFusion:

wherein z is to _t Denoted as update gate, r _t Expressed as reset gate or forget gate, σ represents Sigmoid function, W represents the weight of the 2D convolution (as W in equation (1) _z 、W _r 、W _h All weights), operation Fu represents a Hadamard product. That is, the update door z in the GRU _t And forget door r _t Determining the time from the previous traversal (i.e. the previous feature) Is fused to the current BEV feature O' and to the global map prior feature. As a data driving method, GRU in the embodiment is used as a selective attention mechanism to replace some manual linear updating rules, so that a better effect is realized.

Further, the last step of the gate loop unit GRU fusion process is as follows:

Wherein z can be understood as _t Is a parameter that can be learned and is,where H, W represents the height and width of the BEV feature. The local map prior feature is called->The current feature is called->It can be observed that when the predicted quality of the current frame is better, the network tends to learn a larger z _t Thereby giving the current feature more weight. When the predicted quality of the current frame is poor, typically the location is an intersection or further away from the car, the network tends to learn a larger 1-z _t To give the a priori features more weight. In this way, the gating loop unit in the initial model of the present embodiment may learn to selectively combine the features of the current and previous frames, thereby better controlling the update rate of the neural map priors.

FIG. 2 is an attention diagram after GRU fusion visualization, as shown in FIG. 2, in accordance with an embodiment of the present invention. As shown in fig. 2, from the first row to the fifth row, respectively: GT real map, map based on HDMapNet model reasoning, map based on BEVFomer model reasoning and BEVFomer model simultaneously use NMP neural map prior proposed by the embodiment (namely the map generation method using NMP proposed by the embodiment) and GRU weight. As can be seen from fig. 2, the map generation method proposed in the present embodiment can generate more accurate consistent semantic maps using neural map prior NMP compared to the baseline method.

In combination with the above embodiment, in an alternative embodiment, the step C1 may specifically include the following steps:

step C1-1: and dividing the sample image features and the sample prior features into a plurality of blocks respectively to obtain a plurality of sample image sub-features and a plurality of sample prior sub-features.

In this embodiment, after the sample image feature and the sample prior feature are obtained, the sample image feature and the sample prior feature may be divided into a plurality of small blocks, so as to obtain a plurality of sample image sub-features and a plurality of sample prior sub-features. For example, a 10x10 sized block may be used to represent a 3m x 3m region in the BEV, thereby preserving local spatial information while conserving parameters (i.e., conserving computational resources).

Step C1-2: after the plurality of sample image sub-features and the plurality of sample prior sub-features enter the first linear layer, each sample image sub-feature is used as a sample image sub-feature mark, and each sample prior sub-feature is used as a sample prior sub-feature mark.

In this embodiment, after obtaining the plurality of sample image sub-features and the plurality of sample prior sub-features, the plurality of sample image sub-features and the plurality of sample prior sub-features enter a first linear layer of the attention mechanism module, where the first linear layer may be a fully connected layer. After the multiple sample image sub-features and the multiple sample a priori sub-features enter the first linear layer, each block (i.e., each sub-feature) is treated as a marker using the first linear layer. Each sample image sub-feature is used as a sample image sub-feature mark, and each sample prior sub-feature is used as a sample prior sub-feature mark.

Step C1-3: and taking the sample image sub-feature marks as queries, taking the sample priori sub-feature marks as keys and values, and carrying out operation according to the queries, the keys and the values to obtain operation results.

In this embodiment, the sample image sub-feature tag may be used as the query Q, the sample prior sub-feature tag may be used as a key and a value, i.e., key-value, and then attention operation may be performed according to the query, the key and the value, so as to obtain an operation result. Specifically, the query Q and the key may be operated, and then multiplied by the value, so as to obtain a final attention operation result.

Step C1-4: and inputting the operation result to a second linear layer to obtain the intermediate fusion characteristic output by the second linear layer.

In this embodiment, after the operation result is obtained, that is, after the feature after the attention processing is obtained, the operation result (that is, the feature after the attention processing) is input to the second linear layer, so that the intermediate fusion feature output by the second linear layer and also output by the whole attention mechanism module is obtained. The output features and the input features have the same size, namely the intermediate fusion features are the same as the sample image features and the sample prior features in pairs, and the quality of the intermediate fusion features (i.e. the optimized BEV features) is superior to that of the prior features and the current features.

The second linear layer may also be a fully-connected layer. In an alternative embodiment, all the linear layers in the attention mechanism module are embedded using 256-dimensional feature dimensions, i.e., the first linear layer and the second linear layer are all fully connected layers with 256 filters.

In combination with the above embodiment, the present invention further provides a map generating method, in which, before the step C1, the training step of the map feature generating model may further include:

c0: and respectively adding corresponding position codes to the sample image features and the sample prior features to obtain intermediate sample image features and intermediate sample prior features.

Considering that the accuracy of the predictive map decreases as the location is farther from the autonomous vehicle, in order for the initial model to be aware of the influencing factors of the location, the present embodiment learns to believe that the location is closer to the current feature of the vehicle while believing that the location is farther from the previous feature of the vehicle, the present embodiment provides for the input of sample image features and sample prior features to the initial model (e.g., f _Fusion ) Previously, sample image features and sample prior features were preprocessed by position coding.

Wherein, the embodiment is to sample image characteristics and samplesThe prior features are respectively added with corresponding position codes, so that the image features of the intermediate samples and the prior features of the intermediate samples are respectively obtained. Where position coding is a learnable variable of the shape of a grid for enabling the initial model to trust the current feature, e.g. far from the vehicle, and the previous feature, at spatial locations, e.g. near to the vehicle. In particular, it may be to add a set of position codes (network-like learnable parameters) to the sample a priori featuresAnd adding a set of position codes +.>Where H and W here represent the height and width of the BEV feature, respectively.

In the method, the step C1 may specifically include: and inputting the intermediate sample image features and the intermediate sample prior features to the attention mechanism module to obtain the intermediate fusion features.

In this embodiment, after the intermediate sample image feature and the intermediate sample prior feature are obtained, the intermediate sample image feature and the intermediate sample prior feature may be input to the attention mechanism module for processing, so as to obtain the intermediate fusion feature output by the attention mechanism module.

In one embodiment, after the position codes are respectively added to the sample image feature and the sample prior feature to obtain the intermediate sample image feature and the intermediate sample prior feature, if the attention mechanism module needs to divide the previous feature and the current feature into a plurality of blocks, the intermediate sample image feature and the intermediate sample prior feature are respectively divided into a plurality of blocks, so as to obtain a plurality of sample image sub-features and a plurality of sample prior sub-features.

To illustrate the effectiveness of the position coding, attention mechanism modules, and gating loop in this embodiment, this embodiment also makes a simple fusion baseline, i.e., moving Average (MA), for comparison. For MA, in one set of experiments, the average moving MA is used as a fusion function, i.e. an initial model, to replace the attention mechanism module and the gating loop unit, where the update rule of MA may be:

wherein alpha in formula 3 is a ratio of manual search,new a priori features for the local map at t, < ->Is a local map prior feature updated at t-1. As shown in Table 1 below, mIoU is the mean intersection union, divider is the lane separation line, cross is the human course, boundary is the road Boundary, and ALL is the total. The Attention mechanism module C2P Attention, the position encoding PE and the gate control loop unit GRU proposed in this embodiment are both key to improving the prediction performance of the online map. In particular, GRU and MA act as update modules and achieve similar performance improvements. In this embodiment, the GRU is selected and used, so that the manual parameter search of MA can be avoided. Comparing C to E and F to G of table 1 below, the local PE increased the m iou of the intersection by 2.67 and 2.72, respectively, indicating that the local PE is beneficial for feature fusion, especially for the intersection, which is also the most challenging class of single frame model predictions. Adding local PEs allows the model to extract more powerful information from the neural map priors to supplement the information that is currently observed to be missing. Comparing C to F and E to G of the following tables, it can be found that the mIoU of the lane splitting line increases by 1.83 and 2.05 at CA (i.e., C2P Attention), respectively. This indicates that the CA is better able to handle lane structures. Studies have shown that the three components proposed by this embodiment (position coding, attention mechanism module and gating loop unit) are effective for feature fusion and updating.

TABLE 1

In another set of experiments, the method provided by this embodiment was applied to HDMapNet model, LSS model, bevform model, and VectorMapNet model to evaluate the effectiveness of the map generation method provided by this embodiment: during the training process, all modules before the online BEV feature can be frozen, only the C2P Attention module, local PE, GRU and decoder are trained. During the test, all samples were ordered chronologically. Experiments were performed on 8 NVIDIA3090 GPUs with input image size of 160 x 900 and batch size of 1. As shown in the following table, the results in tables 2 and 3 indicate that the NMP proposed in this example (i.e., the proposed mapping method) can continuously improve the map segmentation and detection performance in all baseline models. Wherein, mIoU in table 2 and table 3 is the average intersection union, divider is the lane separation line, cross is the crosswalk, boundary is the road Boundary, ALL is the total. These results indicate that the NMP provided by this example is a general method, potentially applicable to other map learning frameworks.

TABLE 2

TABLE 3 Table 3

In combination with the above embodiment, in an optional embodiment, the present invention further proposes a map generating method, in which the training step may further include:

Step E: and dividing the data set of the target area to obtain a training set and a testing set.

In order to train out a map feature generation model capable of still realizing a better reasoning effect in different cities of a training set and a test set distribution, the embodiment obtains the training set and the test set by re-dividing a data set of a target area, wherein the target area can be any one area, such as any one city, region and the like, and can be exemplified by re-dividing a subset (such as a data set of a boston region) in nuScenes data sets to obtain the training set and the test set.

The improvement of the neural map priors over online map inferences is due to the generation of neural priors from other trips that provide closer observations and complementary perspectives that enable current observations to "see farther" or bypass obstacles. Whereas data lacking historical trip observations cannot benefit from a priori improvement in the neural network, the present embodiment obtains training and testing sets by repartitioning the data set of the target region (e.g., the data set of the boston region) such that each training and testing sample has a past trip. That is, the divided training set includes training sample images, the training sample images correspond to the neural map priors, the test set includes test sample images, the test sample images correspond to the neural map priors, and the acquisition positions of the training set and the acquisition positions of the test set are disjoint in geographic position.

Step F: and testing the map feature generation model trained by the training set according to the testing set to obtain a testing result.

In this embodiment, an initial model is trained through sample data in a training set to obtain a trained map feature generation model, then the map feature generation model obtained through training of the training set is tested according to data in a testing set to obtain a test result, then retraining of the map feature generation model is performed according to the test result, then the test is performed again, and the steps are repeated until the map feature generation model meeting the conditions is trained, so that the trained map feature generation model can achieve a good effect in other cities, region limitation is avoided as far as possible, and the problem of poor generalization of map learning is relieved to a certain extent.

In this embodiment, by re-dividing the data set of the target area and by precisely calculating the overlap between the history frame and the current frame, the history travel corresponding to the current frame can be found more accurately. The present embodiment employs the mean intersection union (mIoU) to evaluate the quality of high definition semantic learning to evaluate the following 3 static map factors: road boundaries, lane separation lines and crosswalks. As shown in table 4 below, the cross is a lane separation line, the cross is a crosswalk, the Boundary is a road Boundary, ALL is total, the Boston Split is a division of the data set for the target area (such as a division of the data set for the Boston area), the Original Split is a common division, that is, a division of the normal data set in the Original method, and the NMP is a neural map prior, that is, it is shown that the method of the present embodiment is applied, it can be seen that the baseline result of the division of the data set for the target area proposed by the present method is lower than the Original division (that is, the common division), and the improvement of the neural map prior in the division of the data set for the target area is greater than the improvement in the Original division.

TABLE 4 Table 4

In an alternative implementation, for the NMP hyper-parameters of this example, the prior for the rasterized neural map may default to using a 0.3m quantization size, as shown in table 5 below, with mIoU being the mean intersection union, divider being the lane separation line, cross being the pedestrian crosswalk, boundary being the road Boundary, ALL being the total, NMP Grid Resolution being the NMP resolution, baseline being the Baseline. Studies were performed on the quantified dimensions of the neural map prior, which has a dimension of 256. In table 5, the impact of global, i.e., neural, map priors of different resolutions on the effectiveness of online map learning was studied. Intuitively, the processed road information can be considered a small object, so the a priori information should be stored at a fine resolution to explicitly indicate whether it is a road or not. Therefore, smaller spatial quantization sizes are preferred. However, extremely small resolution means that the memory space required for neural map priors grows secondarily and is susceptible to random errors in positioning. Thus, the choice for quantization size is a trade-off between smaller storage and higher accuracy. Whereas studies have shown that the best performance is achieved by a suitable quantization size (0.3 m).

TABLE 5

In an alternative embodiment, please refer to fig. 3, fig. 3 is a flowchart illustrating a map generating method according to an embodiment of the present invention. As shown in fig. 3, the training process of the map feature generation module may be represented, or the application process of the map feature generation module may be represented.

In the training process, first via the Encoder f _E Processing the sample image to obtain a processed current BEV characteristic O (i.e. sample image characteristic); and, the nerve map a priori updated when t-1 is pulled from the map tile memory (selected map tile) Then based on the position information of the sample image, i.e. the current vehicle position Pos _ego From->The middle sampling map tiles form corresponding sample a priori BEV features +.>Then sample image feature O and sample prior feature +.>Input into the initial model (i.e., fusion function in the graph) for training. Specifically, the sample image feature O and the sample prior feature ++>Adding position coding PE to obtain sample imageO-addition position encoding PE _c For sample a priori features->Adding position-coding PE _p Then the position code PE is added _c The feature O of (2) is divided into a plurality of small blocks to obtain a plurality of sample image sub-features, and the position coding PE is added _p Sample a priori features->Dividing into a plurality of small blocks to obtain a plurality of sample prior sub-features, converting each sub-feature into a mark, taking the sample image sub-feature mark as a query, taking the sample prior sub-feature mark as a key and a value, performing standard cross Attention (C2P Attention) processing to obtain an intermediate fusion feature, and then combining the intermediate fusion feature with the sample prior feature>Processing by GRU to obtain final output F of initial model _refine I.e.)>Thereby according to F _refine By Decoder f _D Deducing a semantic map and according to +.>The corresponding position is replaced by the nerve map prior, namely the characteristic +.>Map tiles are updated, thereby enabling a priori updates of the neural map.

And in the application process, the Encoder f is used first _E Processing the target image to obtain a processed current BEV characteristic O (namely target image characteristic); and, the nerve map a priori updated when t-1 is pulled from the map tile memory (selected map tile) Then based on the position information of the target image, i.e. the current vehicle position Pos _ego From->The middle sampling map block forms the corresponding target a priori BEV feature +.>Then the target image feature O and the target prior feature +. >Input into a map feature generation model (i.e., fusion function in the figure) for processing. In particular object image feature O and object prior feature +_, respectively>Adding position coding PE, adding position coding PE for target image feature O _c For the object a priori feature->Adding position-coding PE _p Then the position code PE is added _c The feature O of the image is divided into a plurality of small blocks to obtain a plurality of target image sub-features, and the position coding PE is added _p Target a priori features->Dividing the target image sub-feature into a plurality of small blocks to obtain a plurality of target priori sub-features, converting each sub-feature into a mark, taking the target image sub-feature mark as a query, taking the target priori sub-feature as a key and a value, performing standard cross Attention (C2P Attention) processing to obtain a target intermediate fusion feature, and then combining the target intermediate fusion feature with the target priori feature>Processing by GRU to obtain final output F of map feature generation model _refine I.e.)>Thereby according to F _refine By Decoder f _D Deducing a semantic map and according to +.>The corresponding position is replaced by the nerve map prior, namely, the nerve map prior is characterized byMap tiles are updated, thereby enabling a priori updates of the neural map.

Specifically, in fig. 3, on the top row, online high definition semantic map learning with images as input and map segmentation results as output is obtained by BEV encoders and decoders. In order to better utilize the neural map prior, a customized fusion module, namely a map feature generation model or an initial model (C2P Attention and GRU) of the embodiment, is added between the encoder and the decoder, and the fusion output is decoded by the decoder to obtain a final map output. In the bottom row, a neuro-prior map is extracted from the storage disk that overlaps the current BEV features, and then the prior features that have the same location as the current frame are cropped. After updating, the previous nerve prior map is put back into the extracted prior map.

TABLE 6

In relation to the above embodiment, in one embodiment, it is shown that the nerve map priors proposed by the above embodiment can help to see further. In particular, one of the traditional functions of a map is to provide road information beyond line of sight (out of horizon), which is critical to downstream navigation and planning and helps make informed decisions. While the neural map priors in the embodiments also provide support for this basic purpose, it may be inferred to be farther away by allowing the on-board map. As shown in table 6 above, the neural map prior method proposed by us can uniformly improve the segmentation result of the baseline method map in the case where the BEV range of the original baseline method is 60m×30m, 100m×100m, and 160m×100 m.

Wherein, mIoU in Table 6 is the average intersection union, divider is the lane separation line, cross is the crosswalk, boundary is the road Boundary, ALL is the total, and BEV Range is the BEV Range. Camera-based map segmentation and detection of map portions furthest from the host vehicle are generally considered challenging because they occupy only a few pixels in the image. Thus, taking into account the historical priors of the scene is critical to improving map segmentation and detection performance. As shown in the table, the method of the above embodiment is advantageous in that it can enhance the long-distance sensing result, which is difficult to achieve by the single frame method. Experimental results show that with decreasing distance, performance tends to decrease, but our method still significantly improves the results.

For the above examples, studies in one example showed that in the above examples, multi-pass fusion was better than single pass intra-fusion. As shown in table 7 below, the importance of single-trip information (single-trip fusion) and multiple-trip information (multi-trip fusion) was analyzed. Wherein, mIoU in Table 7 is the average intersection union, divider is the lane separation line, cross is the human course, boundary is the road Boundary, ALL is the total, intra-trip fusion is single-pass fusion, and Inter-trip fusion is multi-pass fusion. In particular, intra-trip information means that the available neural map is a priori limited to only a single current trip. Instead, the inter-trip information model is a map prior generated using arbitrary trips from the same location. Research results show that the prior information of multiple trips is more important for map construction, because the intra-trip model performs much less than the inter-trip model.

TABLE 7

And, by the map generation method involving the neural map prior NMP provided by the above-described embodiment, it is more helpful to map estimation under severe weather conditions because driving a car (whether automatic driving, intelligent driving or ordinary driving) inevitably faces challenges when driving under severe weather conditions, such as rainy days or nighttime driving, which may make it difficult for the vehicle to accurately recognize road information. However, the neural map priors obtained under better weather and lighting conditions may provide more reliable information, enabling the vehicle to more accurately perceive road information and safely travel under severe weather conditions.

As shown in table 8 below, table 8 shows that using the neural map a priori during rainy days and nighttime results in a significant improvement over normal weather, which suggests that the model of the above embodiment can effectively extract the necessary information from NMP to cope with bad weather scenarios. However, due to the limited map a priori information and the small number of samples, it can be seen in the table that there is less improvement in the rainy night.

TABLE 8

TABLE 9

In one embodiment, when the Attention mechanism module selected for this embodiment is a C2P Attention (Current-to-priority) module, the C2P Attention module uses a Cross Attention technique, and the window size of the Cross Attention module can be freely set. That is, for the design of C2P Attention, the road structure should remain spatially coherent, which is also demonstrated in table 9 above:

As can be seen from table 9, the performance of C2P Attention on the separation line and boundary is improved as the window shape is increased. However, an excessive window size may introduce extraneous information from adjacent lanes, while the number of parameters also increases significantly. Thus, the window size is chosen as a balance between capturing relevant road structures and excluding irrelevant spatial information, based on which the present embodiment selects the optimal window size of 3m x 3m according to experimental performance.

In another implementation, the map generation method proposed in the present embodiment is researched, and only 23ms is added to each frame based on the original model of the related technology, but the performance of the model is significantly improved, including extracting and storing features from the neural map prior and integrating other modules.

In combination with the above embodiments, an embodiment proposes a new neural map prior system to assist in online high definition semantic map learning. The key idea is to combine local map reasoning and global map prior update in frame-by-frame increments through C2 pattern and GRU. This design enables the neural map priors to output accurate and consistent global map priors and facilitates online semantic map learning. The nerve map prior is compatible with the latest map segmentation/detection architecture, so that map prediction performance is improved in severe weather, and map prediction is performed at a certain distance away from the current position. The global map priori reconstructed by the fusion module can be directly used for downstream tasks such as planning and control. Based on this, through end-to-end joint training with downstream tasks, neural map priors can open new possibilities for learning-based autopilot awareness and recognition systems.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Based on the same inventive concept, an embodiment of the present invention provides a map generating apparatus 400. Referring to fig. 4, fig. 4 is a block diagram illustrating a map generating apparatus according to an embodiment of the present invention. As shown in fig. 4, the apparatus 400 includes:

the image feature determining module 401 is configured to obtain a target image, and process the target image through an encoder to obtain a target image feature;

the prior feature determining module 402 is configured to determine a corresponding prior feature of the target from the neural map prior according to the location information corresponding to the target image;

the fusion feature determining module 403 is configured to input the target image feature and the target prior feature to a trained map feature generating model for feature fusion, so as to obtain a target fusion feature;

And the map determining module 404 is configured to input the target fusion feature to a decoder, and obtain a semantic map corresponding to the target image.

Optionally, the size of the target fusion feature is the same as the size of the target prior feature, and the neural map prior is updated after the target fusion feature is obtained each time; the apparatus 400 further comprises:

and the updating module is used for replacing the target priori features in the nerve map prior with the target fusion features.

Optionally, the map feature generating model is trained by a model generating module, and the model generating module includes:

the sample image feature determining module is used for acquiring a sample image, and processing the sample image through the encoder to obtain sample image features;

the sample prior feature determining module is used for determining corresponding sample prior features from the nerve map prior according to the position information corresponding to the sample image;

the sample fusion feature determining module is used for inputting the sample image features and the sample prior features into an initial model for feature fusion to obtain sample fusion features; the sample fusion features are used for determining a semantic map corresponding to the sample image and updating the nerve map prior;

And the model training module is used for training the initial model based on the sample fusion characteristics and determining the trained initial model as the map characteristic generation model.

Optionally, the initial model includes: an attention mechanism module and a gating cycle unit; the sample fusion feature determination module comprises:

the first fusion module is used for inputting the sample image features and the sample prior features to the attention mechanism module to obtain intermediate fusion features;

and the second fusion module is used for carrying out feature fusion on the intermediate fusion feature and the sample prior feature through the gating circulation unit to obtain the sample fusion feature.

Optionally, the model generating module further includes:

the position coding module is used for respectively adding corresponding position codes to the sample image features and the sample prior features before the sample image features and the sample prior features are input to the attention mechanism module to obtain intermediate fusion features so as to obtain intermediate sample image features and intermediate sample prior features;

the first fusion module comprises:

and the first fusion sub-module is used for inputting the intermediate sample image characteristic and the intermediate sample prior characteristic to the attention mechanism module to obtain the intermediate fusion characteristic.

Optionally, the first fusion module includes:

the characteristic dividing module is used for dividing the sample image characteristic and the sample prior characteristic into a plurality of blocks respectively to obtain a plurality of sample image sub-characteristics and a plurality of sample prior sub-characteristics;

the conversion module is used for taking each sample image sub-feature as a sample image sub-feature mark and taking each sample prior sub-feature as a sample prior sub-feature mark after the plurality of sample image sub-features and the plurality of sample prior sub-features enter the first linear layer;

the operation module is used for taking the sample image sub-feature marks as queries, taking the sample priori sub-feature marks as keys and values, and carrying out operation according to the queries, the keys and the values to obtain operation results;

and the second fusion sub-module is used for inputting the operation result to a second linear layer to obtain the intermediate fusion characteristic output by the second linear layer.

Optionally, the model generating module further includes:

the data set dividing module is used for dividing the data set of the target area to obtain a training set and a testing set; the training set comprises training sample images, the test set comprises test sample images, and the acquisition positions of the training set are disjoint with the acquisition positions of the test set in geographic positions;

And the testing module is used for testing the map feature generation model obtained through training of the training set according to the testing set to obtain a testing result.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the map generating method according to any of the above embodiments of the present invention.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device 500, as shown in fig. 5. Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present invention. The electronic device comprises a memory 502, a processor 501 and a computer program stored on the memory and executable on the processor, which when executed implements the steps of the map generation method according to any of the above embodiments of the present invention.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The map generating method, the map generating device, the map generating electronic device and the map generating storage medium provided by the invention are described in detail, and specific examples are applied to the principle and the implementation mode of the map generating method, the map generating device, the map generating electronic device and the map generating storage medium, so that the principle and the implementation mode of the map generating device are described in detail, and the description of the examples is only used for helping to understand the method and the core idea of the map generating method; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A map generation method, the method comprising:

2. The map generation method according to claim 1, wherein the size of the target fusion feature is the same as the size of the target prior feature, the neural map prior being updated after each acquisition of the target fusion feature; the method further comprises the steps of:

and replacing the target prior feature in the nerve map prior with the target fusion feature.

3. The map generation method according to claim 1 or 2, characterized in that the training step of the map feature generation model includes:

acquiring a sample image, and processing the sample image through the encoder to obtain sample image characteristics;

Determining corresponding sample prior features from the nerve map prior according to the position information corresponding to the sample image;

inputting the sample image features and the sample prior features into an initial model for feature fusion to obtain sample fusion features; the sample fusion features are used for determining a semantic map corresponding to the sample image and updating the nerve map prior;

training the initial model based on the sample fusion characteristics, and determining the trained initial model as the map characteristic generation model.

4. A map generation method according to claim 3, wherein the initial model comprises: an attention mechanism module and a gating cycle unit; inputting the sample image features and the sample prior features into an initial model for feature fusion to obtain sample fusion features, wherein the method comprises the following steps:

inputting the sample image features and the sample prior features to the attention mechanism module to obtain intermediate fusion features;

and carrying out feature fusion on the intermediate fusion feature and the sample prior feature through the gating circulating unit to obtain the sample fusion feature.

5. The map generation method of claim 4, wherein before said inputting the sample image features and the sample prior features to the attention mechanism module results in intermediate fusion features, the training step further comprises:

respectively adding corresponding position codes to the sample image features and the sample prior features to obtain intermediate sample image features and intermediate sample prior features;

inputting the sample image features and the sample prior features to the attention mechanism module to obtain intermediate fusion features, including:

and inputting the intermediate sample image features and the intermediate sample prior features to the attention mechanism module to obtain the intermediate fusion features.

6. The map generation method according to claim 4, wherein the inputting the sample image feature and the sample prior feature to the attention mechanism module, to obtain an intermediate fusion feature, includes:

dividing the sample image features and the sample prior features into a plurality of blocks respectively to obtain a plurality of sample image sub-features and a plurality of sample prior sub-features;

after the plurality of sample image sub-features and the plurality of sample prior sub-features enter the first linear layer, taking each sample image sub-feature as a sample image sub-feature mark, and taking each sample prior sub-feature as a sample prior sub-feature mark;

Taking the sample image sub-feature marks as queries, taking the sample priori sub-feature marks as keys and values, and carrying out operation according to the queries, the keys and the values to obtain operation results;

and inputting the operation result to a second linear layer to obtain the intermediate fusion characteristic output by the second linear layer.

7. A map generation method according to claim 3, wherein the training step further comprises:

dividing a data set of a target area to obtain a training set and a testing set; the training set comprises training sample images, the test set comprises test sample images, and the acquisition positions of the training set are disjoint with the acquisition positions of the test set in geographic positions;

and testing the map feature generation model trained by the training set according to the testing set to obtain a testing result.

8. A map generation apparatus, the apparatus comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when executed by the processor implements the map generation method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the map generation method according to any one of claims 1 to 7.