CN115965970A

CN115965970A - Method and system for realizing bird's-eye view semantic segmentation based on implicit set prediction

Info

Publication number: CN115965970A
Application number: CN202310051370.7A
Authority: CN
Inventors: 江昆; 杨殿阁; 施以宁; 周韬华; 杨蒙蒙
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-02-02
Filing date: 2023-02-02
Publication date: 2023-04-14

Abstract

The invention relates to a method and a system for realizing bird's-eye view semantic segmentation based on implicit set prediction, which comprises the following steps: preprocessing the labeled data in the high-precision map, and converting the semantic map layer of the high-precision map into a labeled polygonal map layer; extracting semantic features in the labeled polygon map layer, performing prediction head analysis on the semantic features, realizing splicing and fusion of the multi-view camera, and predicting a final polygon segmentation frame by an implicit set; and converting the visual angle of the image plane and the top view to finish the implicit prediction based on the converter. The invention can enable the network to simultaneously sense the dynamic object and the static road, and reduce the calculation amount of the sensing network; can be applied in the field of automatic driving.

Description

Method and system for realizing bird's-eye view semantic segmentation based on implicit set prediction

Technical Field

The invention relates to the field of automatic driving, in particular to a method and a system for realizing bird's-eye view semantic segmentation by using computer vision and implicit set prediction of an automatic driving look-around camera.

Background

Autonomous vehicles require accurate understanding of objects and static elements in traffic scene spaces, and environmental awareness is typically achieved by a variety of sensors, including lidar, cameras, and millimeter-wave radar. The environment perception technology of the all-round multi-camera comprises tasks of pose estimation, target detection, bird's-eye view image semantic segmentation and the like. Due to the lack of depth information of the camera, the target detection of the three-dimensional environment is difficult, and the accuracy is low, so that the perception requirement of high-level automatic driving is not met; the method is characterized in that a vehicle generally runs on the ground, scene representation of a bird's-eye view can directly act on a cognitive and decision system of the vehicle, and the conversion relation of an image from a camera plane to a ground reference plane is clearer, so that the semantic segmentation of the bird's-eye view is an important task of looking around the camera system and is also an important advanced research basis for latest local map construction.

The key stage of the bird's-eye view semantic segmentation is the extraction of image features, the visual angle conversion from an image visual angle to a bird's-eye view visual angle, the splicing and fusion of multi-camera images, the feature extraction of the bird's-eye view and the technology of a measuring head. In the prior art, advanced segmentation or detection of a backbone network can be used for image feature extraction, visual angle conversion and depth estimation are closely related, the visual angle conversion has multiple implementation modes, the basic method is that an internal reference model of a camera is only used through inverse perspective transformation of the camera, but universality is limited by strong plane hypothesis; one popular method is to use a pseudo laser radar method, i.e. each pixel of the image is projected to a point cloud after estimating the depth, and the three-dimensional point cloud is directly projected to a bird's eye view, the difficulty of the method is that the depth of the pixel is not easy to obtain and train, the corresponding improvement is to replace the determined depth prediction with the distribution of the depth, and on the other hand, the performance of the method is easily influenced by the wrong depth estimation; the third method, which predicts the correspondence of each pixel to the corresponding grid of the bird's eye view from a convolutional network, is still subject to the ambiguity and scene specificity of the perceived minimum cell correspondence, and the generality of the scene is not systematically studied.

The multi-camera target detection task also uses the similar visual angle conversion units with different technologies, and comprises methods of directly predicting a three-dimensional surrounding frame by an image, outputting the surrounding frame by a point cloud prediction head after the image is converted into a pseudo point cloud, and performing depth estimation and frame prediction multi-task learning of the image. An implicit prediction method based on a converter is provided for a panoramic camera, the method does not need to accurately match pixel by pixel and very accurate internal and external parameter corresponding relations, query of a surrounding frame is generated end to end through a converter structure, a positioning problem in detection is changed into an aggregate prediction problem, and the method is simple and effective in design and depends on output consistency and aggregate design.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method and a system for implementing semantic segmentation of an aerial view based on implicit ensemble prediction, which enable a network to simultaneously perceive a dynamic object and a static road and reduce the amount of computation of the perception network.

In order to achieve the above object, in a first aspect, the present invention adopts the following technical solutions: a method for realizing bird's-eye view semantic segmentation based on implicit set prediction comprises the following steps: preprocessing the labeled data in the high-precision map, and converting the semantic map layer of the high-precision map into a labeled polygonal map layer; extracting semantic features in the labeled polygon map layer, performing prediction head analysis on the semantic features, realizing splicing and fusion of the multi-view camera, and predicting a final polygon segmentation frame by an implicit set; and converting the visual angle of the image plane and the top view to finish the implicit prediction based on the converter.

Further, preprocessing the marked data in the high-precision map, including dynamic target preprocessing and static road element preprocessing;

the dynamic target preprocessing comprises the following steps: extruding the target from the height layer to form an occupation mask of the ground, forming a polygon on the ground plane, and endowing the mask with a corresponding semantic label;

the static road element preprocessing comprises the following steps: when the vehicle runs, the vehicle does not need to be completely covered by a high-precision map, only the map data of a partial area is needed to be used as a training label during training, and the training of the semantic map is relied on, so that an effective semantic segmentation effect can be generalized on any structured road.

Further, the static road element preprocessing further comprises: the polygonal line map information is allowed to have a predetermined error, and a predetermined width is given to the polygonal line so that the polygonal line is expanded into a long and narrow polygon for training.

Further, performing predictive header analysis on semantic features, comprising:

the characteristics of the image obtained by the pre-order network are returned to the physical quantity which is finally needed by the prediction head, and the physical quantity is compared with the true value label to obtain a loss function, and the loss function is returned;

the transmission loss function includes three parts: bilateral loss, classification loss, and cross-over ratio loss.

Further, the prediction header is: in the FCOS3D single-stage, no anchor frame exists, each pixel is classified, a two-dimensional polygon is obtained through centralization operation, and the two-dimensional polygon directly guides the generation of a three-dimensional polygon area.

Further, converting the viewing angle of the image plane and the top view comprises:

implicit mapping in visual angle conversion is carried out by adopting a converter, the input of the converter is a dimension reduction characteristic diagram, and the output of the converter is the center point information of a polygon in the aerial view;

compressing the characteristic diagram of the image into (batch size, height multiplied by width, channel number) by (batch size, height, width, channel number), inputting the characteristic diagram into an encoder unit of a converter, adding a position encoding rule based on a trigonometric function, and outputting the characteristic diagram of the image into the central point of a set of polygons as the query of the polygons;

reversely projecting the two-dimensional reference points related to the polygons of the top view onto corresponding positions of an image plane according to the internal reference and the external reference of the camera, and finding the corresponding positions and adjacent feature maps of the two-dimensional reference points in the original image;

the characteristics of the characteristic graph are sampled by bilinear interpolation, the characteristics are introduced into the inquiry of the polygon, the interaction condition of the central point of the polygon and the corresponding position of the characteristic graph is described by a multi-head attention mechanism, and then the characteristic mapping parts are used for carrying out fine modification on the inquiry by converter layers;

for the query corrected after the output of the converter, introducing two multilayer perceptron branches, respectively outputting category information and vertex information of a polygon, and fitting complex semantic area elements by adopting a fitted fixed multi-vertex polygon; the polygons of the prediction set are compared with the polygons of the truth set to form a loss function and pass back the gradient.

Further, the stitching of the multi-view camera comprises: for the inquiry of a plurality of cameras, the reference point of the aerial view is only required to be inversely mapped to the feature map of each camera, the feature maps of a plurality of cameras are simultaneously used for the refinement process of the inquiry and then input to the coding side of the converter; and if the reference point of the bird's eye view is not in the image plane of some cameras after the internal and external parameters are subjected to inverse phase projection, neglecting the adjustment of the corresponding feature map of the camera on the query.

In a second aspect, the invention adopts the technical scheme that: a system for implementing bird's-eye view semantic segmentation based on implicit set prediction, comprising: the preprocessing module is used for preprocessing the labeled data in the high-precision map and converting the semantic map layer of the high-precision map into a labeled polygonal map layer; the prediction head module is used for extracting semantic features in the labeled polygon map layer, performing prediction head analysis on the semantic features, realizing splicing and fusion of the multi-view camera and predicting a final polygon segmentation frame by an implicit set; and the conversion module is used for converting the visual angles of the image plane and the top view to complete the implicit prediction based on the converter.

In a third aspect, the invention adopts the technical scheme that: a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the above methods.

In a fourth aspect, the invention adopts the technical scheme that: a computing device, comprising: one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described above.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. according to the method, the conversion relation from the image plane to the aerial view and the region of interest are modeled through an implicit multi-head attention mechanism, and the depth estimation or conversion corresponding relation from pixel to pixel is not needed.

2. The invention directly adopts the original semantic map polygon mark for training, only one regression is needed in the network, and the calculation load of data preprocessing and post-processing is greatly reduced.

3. The invention can share the backbone and the neck network with the target detection network, and the prediction heads are different, so that the network can simultaneously sense dynamic objects and static roads, and the calculation amount of the sensing network is reduced.

Drawings

FIG. 1 is a flow chart illustrating implicit set prediction in an embodiment of the present invention;

FIG. 2 is a flow chart of a network design and iteration method according to an embodiment of the invention;

FIG. 3 is a diagram illustrating perspective transformation and map element generation in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention, are within the scope of the invention.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

In order to solve the problem of bird's-eye view semantic segmentation in the prior art, the method and the system for realizing bird's-eye view semantic segmentation based on multi-camera implicit ensemble prediction are improved, a vehicle-mounted multi-camera image is used as an information source, a novel visual angle conversion module is arranged, image depth characteristics are directly transferred to a polygonal semantic area under the bird's-eye view, the polygonal semantic area and polygonal labels of an original semantic map are regressed to form a polygonal ensemble prediction problem, two-stage loss functions of matching and intersection comparison are designed, different shared pre-measuring heads are compatible with a backbone network of target detection and are designed to simultaneously sense dynamic objects and static roads, and dynamic and static sensing of a multi-camera single network is formed.

In one embodiment of the invention, a method for realizing semantic segmentation of a bird's-eye view based on implicit set prediction is provided. In this embodiment, as shown in fig. 1, the method includes the following steps:

1) Preprocessing the labeled data in the high-precision map, and converting the semantic map layer of the high-precision map into a labeled polygonal map layer;

2) Extracting semantic features in the labeled polygon map layer, performing prediction head analysis on the semantic features, realizing splicing and fusion of the multi-view camera, and predicting a final polygon segmentation frame by an implicit set;

3) And converting the visual angle of the image plane and the top view to finish the implicit prediction based on the converter.

In the step 1), preprocessing the marked data in the high-precision map, including dynamic target preprocessing and static road element preprocessing; wherein:

specifically, the dynamic target is generally a labeling object of a data set mainly based on a target detection task, and generally includes ten types of subdivided targets such as people and vehicles, the original label of the data set is generally a three-dimensional enclosure box, and the attributes of the enclosure box include length, width, height, three-dimensional position xyz, yaw angle, and semantic attributes. In the data processing of the bird's-eye view in the embodiment, the target is mainly squeezed from the height layer to form an occupation mask of the ground, a polygon (generally a quadrangle) is formed on the plane of the ground, and corresponding semantic labels are given to the mask.

In the above embodiment, the static road element preprocessing further includes: the linear map information is allowed to have a predetermined error, and a predetermined width is given to the polygonal line so that the polygonal line is expanded into a long and narrow polygon for training. Preferably, the predetermined width may be 0.3 m.

Specifically, the static road element preprocessing is generally provided by a high-precision map of the area where the static road element preprocessing is located, and the method is noted that the static road element preprocessing is not required to be completely covered by the high-precision map when a vehicle runs, only the map data of a partial area is required to be used as a training label during training, and an effective semantic segmentation effect can be generalized on any structured road during testing by relying on the training of a semantic map. The road elements comprise surface information represented by polygons, including travelable areas, sidewalks, pedestrian crossings, lanes, parking areas, road blocks and the like; the linear information is expressed by a plurality of broken lines, and comprises information such as road dividing lines, lane dividing lines, stop lines and the like. In the invention, aiming at the fact that the linear map information has a certain width in the real world, smaller errors are allowed, and a smaller width (0.3 m) is also given to the polyline, so that the polyline is expanded into a long and narrow polygon to be added into training.

In the step 2), the semantic feature extraction specifically comprises: the representation learning of image semantic features is a problem of continuous research, and is generally divided into a backbone network and a neck network, wherein the backbone network can be designed by using networks such as VGG, resNet, vovnet and the like based on a convolutional neural network, and can also be extracted by using methods such as a swin transformer network and the like based on a transformer, the dimensions of a final feature map are (width, height, embedding dimension), and the embedding dimension is generally set to be 256 so as to utilize the performance of a graphics card to the maximum extent. The neck network is generally designed by using a multilayer characteristic pyramid (FPN) or an improved version thereof, the number of characteristic channels is generally four-layer distribution (256, 512,1024, 2048), and the introduction of the characteristic pyramid can enable the network to focus on a large target and a small target at the same time, and the focus on different scales is balanced.

In the step 2), performing predictive header analysis on semantic features specifically includes: the characteristics of the images obtained by the preamble network are returned to the final required physical quantity by the prediction head, and are compared with the truth label to obtain a loss function, and the loss function is returned.

In this embodiment, the prediction head adopts a single FCOS3D stage without an anchor frame, classifies each pixel, obtains a two-dimensional polygon through a centering operation, and directly guides the area generation of a three-dimensional polygon through the two-dimensional polygon.

Wherein, the transmission loss function comprises three parts: bilateral loss, classification loss, and cross-over ratio loss.

Specifically, the design of the loss function is an important factor for converging the neural network regression. The loss function in the present invention is divided into three parts:

the first part is bilateral loss, because the prediction result and the truth label are polygons containing semantics under a top view, aiming at the set prediction theory of a converter, the binary matching between the prediction and the truth value is calculated by using the Hungarian algorithm, and the process of the binary matching is to obtain the minimum value of the following items:

where σ denotes the permutation of truth values with the predicted values, P denotes all possible permutation sets, M denotes the number of elements representing the prediction set, c _j The elements representing the prediction are represented by a number of elements,

indicates that the result of the prediction is a null element set, or->

Indicates that the jth prediction σ (j) belongs to the class c _j Probability of L _{polygon_iou} Representing the intersection ratio of the prediction to the true polygon, b _j Represents a true value polygon>

Representing a predictor polygon.

Each true value is forced to be matched with the predicted value, bilateral loss is calculated according to the weight of Hungarian matching, and the loss function calculation formula from the set to the set is as follows:

in the formula, loss represents a Loss function of geometric prediction, and N represents the number of matching truth values and predicted values.

The second part is classification Loss, generally, contents to be segmented are divided into several semantic categories according to the requirements of local map construction, such as semantic labels of lanes, sidewalks, pedestrian crossings and the like, and then, a biased focus cross entropy Loss function (Focal local) is used for calculation.

The third part is the cross-over ratio loss, which characterizes the similarity between a single matched prediction polygon and a true value polygon, and the calculation formula of the two-dimensional cross-over ratio is as follows:

IoU＝(A∩B)/(A∪B)。

in the formula, ioU represents the intersection ratio, a represents the area covered by the graph a, B represents the area covered by the graph B, a ═ B represents the area within both the a and B images, and a ═ B represents the total area covered by a or B.

The range of the cross-over ratio is [0,1]. The higher the overlap the more reasonable the prediction box is. The IoU loss regresses the four boundary information of the candidate frame as a whole, so that accurate and efficient positioning is realized, and the IoU loss has good scale invariance. In order to solve the phenomenon that the IoU measurement is not conductive, a negative Ln norm is introduced to indirectly calculate the IoU loss.

In the step 3), the converting of the viewing angles of the image plane and the top view, as shown in fig. 2 and fig. 3, includes the following steps:

in the embodiment, a converter is adopted for implicit mapping in visual angle conversion, the input of the converter is a dimension reduction characteristic diagram, and the output of the converter is the central point information of a polygon in an aerial view;

because a camera lacks clear depth information, the two-dimensional to three-dimensional visual angle conversion has obvious difficulty, particularly, the estimation of the height direction is very dependent on the implicit regression of a neural network, and no direct perceptual evidence exists, so that the uncertainty of the height direction is resolved only by considering a top view, the conversion of an image plane and the top view can be realized by inverse projection conversion, but the accuracy of the method is very dependent on an internal parameter, and the distortion of a real camera is difficult to be completely expressed by a simple internal parameter model, so that an implicit estimation method needs to be realized, only the less accurate internal parameter is needed to carry out rough conversion, and the implicit mapping process in the visual angle conversion is realized by adopting a converter.

The iterative process of the two-dimensional feature map of a single camera to the return of the polygon prediction of the aerial view along with the loss function is circularly operated by the following four steps of 3.1) to 3.4)

3.1 The characteristic diagram of the image is compressed into (batch size, height multiplied by width, channel number) from (batch size, height, width, channel number), the characteristic diagram is input into an encoder unit of a converter, a universal position encoding rule of the converter based on a trigonometric function is added, and the output of a decoder is the central point of a set of polygons and used as the query of the polygons;

wherein, the query generates two-dimensional reference points under the top view through a simple convolution network, and the number of the reference points is in proportion to the number of the converter layers and the query number.

3.2 The two-dimensional reference points related to the polygons of the top view are back-projected onto the corresponding positions of the image plane according to the internal reference and the external reference of the camera, and the corresponding positions and the adjacent feature maps in the original image are found;

3.3 The characteristics of the characteristic graph are sampled by bilinear interpolation, the characteristics are introduced into the inquiry of the polygon, the interaction condition of the central point of the polygon and the corresponding position of the characteristic graph is described by a multi-head attention mechanism, and then the characteristic mapping parts are used for carrying out fine modification on the inquiry by converter layers;

wherein, the multi-head attention module is realized by adopting a standard multi-head attention module.

3.4 For the query corrected after the output of the converter, introducing two multi-layer perceptron branches, respectively outputting category information and vertex information of a polygon, and fitting complex semantic area elements by adopting a fitted fixed multi-vertex polygon; the polygons of the prediction set are compared with the polygons of the truth set to form a penalty function and pass back the gradient.

In the above embodiment, the stitching of the multi-view camera specifically includes: the implicit transducer attention-based visual angle conversion method does not need explicit reconstruction of a three-dimensional scene, so for query of a plurality of cameras, only a reference point of a bird's-eye view needs to be inversely mapped to a feature map of each camera, the feature maps of a plurality of cameras are simultaneously used for a query refinement process, and then are input to a coding side of a transducer; and if the reference point of the bird's eye view is not in the image plane of some cameras after the internal and external parameters are subjected to inverse phase projection, neglecting the adjustment of the corresponding feature map of the camera on the query.

In a real vehicle such as a vehicle used with a nuScenes dataset or a hundredth Apollo vehicle, the panoramic vision is operated by six cameras simultaneously, and after the back projection, the number of M × L × NUM _ CAM centroids is obtained. One compromise considering network performance and efficiency is dimension M =900 of hidden layers, number L =6 of transformer hidden layers, and number NUM _ CAM =6 of cameras.

In one embodiment of the present invention, a system for implementing bird's-eye view semantic segmentation based on implicit set prediction is provided, which includes:

the preprocessing module is used for preprocessing the labeled data in the high-precision map and converting the semantic map layer of the high-precision map into a labeled polygonal map layer;

the prediction head module is used for extracting semantic features in the labeled polygon map layer, performing prediction head analysis on the semantic features, realizing splicing and fusion of the multi-view camera and predicting a final polygon segmentation frame by an implicit set;

and the conversion module is used for converting the visual angles of the image plane and the top view to complete the implicit prediction based on the converter.

In the preprocessing module, the marked data in the high-precision map are preprocessed, wherein the preprocessing comprises dynamic target preprocessing and static road element preprocessing;

In this embodiment, the static road element preprocessing further includes: the polygonal line map information is allowed to have a predetermined error, and a predetermined width is given to the polygonal line so that the polygonal line is expanded into a long and narrow polygon for training.

In the prediction header module, performing prediction header analysis on semantic features includes:

the characteristics of the image obtained by the preamble network are returned to the final required physical quantity through the prediction head, and are compared with the truth value label to obtain a loss function, and the loss function is returned;

In the prediction head module, the prediction head adopts a single FCOS3D stage without an anchor frame, each pixel is classified, a two-dimensional polygon is obtained through centralization operation, and the region generation of the three-dimensional polygon is directly guided by the two-dimensional polygon.

In the above conversion module, converting the viewing angle between the image plane and the top view includes:

In the above embodiment, the stitching of the multi-view camera specifically includes: for the inquiry of a plurality of cameras, the reference point of the aerial view is only required to be inversely mapped to the feature map of each camera, the feature maps of a plurality of cameras are simultaneously used for the refinement process of the inquiry and then input to the coding side of the converter; and if the reference point of the bird's eye view is not in the image plane of some cameras after the internal and external parameters are subjected to inverse phase projection, neglecting the adjustment of the corresponding feature map of the camera on the query.

The system provided in this embodiment is used for executing the above method embodiments, and for details of the process and the details, reference is made to the above embodiments, which are not described herein again.

In the computing device structure provided in an embodiment of the present invention, the computing device may be a terminal, and may include: a processor (processor), a communication Interface (Communications Interface), a memory (memory), a display screen, and an input device. The processor, the communication interface and the memory are communicated with each other through a communication bus. The processor is used to provide computing and control capabilities. The memory comprises a non-volatile storage medium, an internal memory, an operating system and a computer program, wherein the computer program is executed by the processor to implement a method for realizing bird's-eye view semantic segmentation based on implicit set prediction; the internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a manager network, NFC (near field communication) or other technologies. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computing equipment, an external keyboard, a touch pad or a mouse and the like. The processor may call logic instructions in the memory.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In one embodiment of the invention, a computer program product is provided, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-described method embodiments.

In one embodiment of the invention, a non-transitory computer-readable storage medium is provided that stores server instructions that cause a computer to perform the methods provided by the embodiments described above.

The implementation principle and technical effect of the computer-readable storage medium provided by the above embodiments are similar to those of the above method embodiments, and are not described herein again.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for realizing bird's-eye view semantic segmentation based on implicit set prediction is characterized by comprising the following steps:

preprocessing the labeled data in the high-precision map, and converting the semantic map layer of the high-precision map into a labeled polygonal map layer;

extracting semantic features in the labeled polygon map layer, performing predictive header analysis on the semantic features, realizing splicing and fusion of the multi-view camera, and predicting a final polygon segmentation frame by an implicit set;

and converting the visual angle of the image plane and the top view to finish the implicit prediction based on the converter.

2. The method for realizing semantic segmentation of the bird's eye view based on implicit set prediction according to claim 1, wherein the pre-processing is performed on the labeled data in the high-precision map, and comprises dynamic target pre-processing and static road element pre-processing;

3. The method for implementing bird's eye view semantic segmentation based on implicit set prediction of claim 2, wherein the static road element preprocessing further comprises: the polygonal line map information is allowed to have a predetermined error, and a predetermined width is given to the polygonal line so that the polygonal line is expanded into a long and narrow polygon for training.

4. The method for realizing semantic segmentation of the bird's eye view based on implicit set prediction according to claim 1, wherein performing predictive header analysis on semantic features comprises:

the transmission loss function includes three parts: bilateral loss, categorical loss, and cross-over-ratio loss.

5. The method for realizing semantic segmentation of the bird's eye view based on implicit set prediction according to claim 1, wherein the prediction head is: in the FCOS3D single-stage, no anchor frame exists, each pixel is classified, a two-dimensional polygon is obtained through centralization operation, and the two-dimensional polygon directly guides the generation of a three-dimensional polygon area.

6. The method for realizing semantic segmentation of the bird's-eye view based on implicit set prediction according to claim 1, wherein converting the view angle of the image plane and the top view comprises:

7. The method for realizing semantic segmentation of the bird's eye view based on implicit set prediction of claim 6, wherein the stitching of the multi-view cameras comprises: for the inquiry of a plurality of cameras, the reference point of the aerial view is only required to be inversely mapped to the feature map of each camera, the feature maps of a plurality of cameras are simultaneously used for the refinement process of the inquiry and then input to the coding side of the converter; and if the reference point of the bird's eye view is not in the image plane of some cameras after the internal and external parameters are subjected to inverse phase projection, neglecting the adjustment of the corresponding feature map of the camera on the query.

8. A system for implementing bird's-eye view semantic segmentation based on implicit set prediction, comprising:

9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-7.

10. A computing device, comprising: one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-7.