CN112862839B

CN112862839B - Method and system for enhancing robustness of semantic segmentation of map elements

Info

Publication number: CN112862839B
Application number: CN202110203999.XA
Authority: CN
Inventors: 杨蒙蒙; 唐雪薇; 江昆; 杨殿阁
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2022-12-23
Anticipated expiration: 2041-02-24
Also published as: CN112862839A

Abstract

The invention relates to a method and a system for enhancing the robustness of semantic segmentation of map elements, which are characterized by comprising the following steps: 1) And dividing the driving scene video acquired by the vehicle-mounted camera sensor into independent video frames according to a time sequence. 2) Semantic segmentation is carried out on each independent video frame data in the step 1) based on a preset semantic segmentation network to obtain masks corresponding to semantic segmentation results of various map elements in each frame image, and optical flow information is introduced between adjacent frame images to enhance the video semantic segmentation stability. According to the method, only continuous video information of a camera sensor is used, each frame of semantic segmentation result is connected through optical flow information, and robust map elements can be accurately identified through low cost; therefore, the invention can be widely applied to the field of automatic driving. The invention can be widely applied to the field of automatic driving.

Description

Method and system for enhancing robustness of semantic segmentation of map elements

Technical Field

The invention belongs to the field of automatic driving, and particularly relates to a video map element semantic segmentation robustness enhancing method and system based on computer vision and optical flow information fusion.

Background

The high-precision map is used as an indispensable perception container for high-level automatic driving, is a key basis for realizing automatic driving, not only provides lane-level navigation and driving environment information for the automatic driving vehicle, but also enriches the prior environment information of the automatic driving vehicle to assist the automatic driving vehicle in subsequent decision judgment. Two major tasks for building high-precision maps are acquisition and updating. The adoption of faster and lower cost methods to accomplish these two tasks is a practical challenge for high precision maps. The problems associated with this are also a matter of intense research in the current field of automotive driving. Meanwhile, with the continuous and deep research in the related field of computer vision, different elements of a high-precision map based on image perception also become an important way for solving the perception problem.

At present, the mainstream scheme for high-precision map construction and updating is to adopt video data of a camera sensor and perceive real-world lane information based on a visual method. The semantic segmentation of the map elements in the visual sensor screen is a high-precision and low-cost map element information extraction mode, and the map element information can be effectively provided for a subsequent three-dimensional map element modeling module through the method. Semantic segmentation is a task of semantically classifying all pixels in an image. The current semantic segmentation model research based on the deep convolutional neural network is very common, the convolutional neural network has very strong characteristic learning capability, and a stable output effect can be obtained by fully training a large amount of data. The semantic segmentation neural network based on deep learning is generally divided into an encoding and a decoding part. The coding part is a feature extraction network, and deep features of the input image are extracted through multilayer convolution; the decoding part is an up-sampling network to get an output result that is consistent with the input size.

The existing semantic segmentation network can obtain a segmentation result with higher precision on the existing open source data set, but has the defect that the existing semantic segmentation network is directly applied to an actual high-precision map element sensing and modeling task aiming at actual scene data. The current semantic segmentation of a single-frame image omits time sequence information acquired by a camera sensor, and the situation of segmentation and hopping is caused by the uncorrelated processing between adjacent frames, so that the direct application of the existing method often has unstable and hopping results, and the direct application of the existing method is difficult to be applied to actual engineering.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method and a system for enhancing robustness of semantic segmentation of map elements based on optical flow information fusion, in which only continuous video information of a camera sensor is used, and semantic segmentation results of each frame are connected through optical flow information, so that accurate identification of robust map elements can be realized at a low cost.

In order to realize the purpose, the invention adopts the following technical scheme:

the first aspect of the invention provides a robustness enhancing method for semantic segmentation of map elements, which comprises the following steps: 1) And dividing the driving scene video acquired by the vehicle-mounted camera sensor into independent video frame images according to the time sequence. 2) Semantic segmentation is carried out on each independent video frame image data in the step 1) based on a preset semantic segmentation network to obtain masks corresponding to semantic segmentation results of various map elements in each frame image, and optical flow information is introduced between adjacent frame images to enhance the stability of the semantic segmentation video images.

Further, in the step 2), the method for enhancing the video semantic segmentation stability includes the following steps:

2.1 Reading the ith frame of image, and inputting the ith frame of image into a preset semantic segmentation network for processing to obtain masks corresponding to semantic segmentation results of various map elements in the ith frame of image; the preset semantic segmentation network is a network aiming at three map elements of a lane line, a lamp post and a road signboard;

2.2 Reading the (i + 1) th frame image, and inputting the (i + 1) th frame image into a preset semantic segmentation network for processing to obtain masks corresponding to semantic segmentation results of various map elements in the (i + 1) th frame image;

2.3 Calculating optical flow information between the ith frame image and the (i + 1) th frame image to obtain an inter-frame optical flow diagram;

2.4 Based on the obtained inter-frame light flow graph, transmitting masks corresponding to semantics of various map elements in the ith frame image to the (i + 1) th frame image, and performing enhancement operation on a semantic segmentation result of the (i + 1) th frame image within a preset limited region range, so that an incomplete segmentation region in the (i + 1) th frame image is supplemented.

2.5 Iterative enhancement: and repeating the steps 2.2) to 2.4) until all the independent video frame images in the step 1) are processed.

Further, the step 2.4) of merging optical flow information includes the steps of:

2.4.1 According to the calculated optical flow graph between the two frame images, the semantic segmentation result of the map element of the previous frame image is propagated to the corresponding position of the later frame image along the displacement vector of the optical flow graph according to the corresponding relation of the pixel in the optical flow graph, so as to obtain a corrected semantic element area;

2.4.2 The corrected semantic element region obtained in the step 2.4.1) and the semantic element segmentation result of the later frame image obtained in the step 2.2) are compared, and the part of the later frame image with incomplete map element segmentation is corrected based on the comparison result, so that the enhanced supplement is realized.

Further, in the step 2.4.1), the calculation formula that the semantic segmentation result of the map element of the previous frame image is propagated to the corresponding position of the subsequent frame image along the displacement vector of the optical flow map according to the corresponding relationship of the pixel in the optical flow map is as follows:

in the formula (I), the compound is shown in the specification,

the abscissa of a certain pixel on the image of the ith frame,

the vertical coordinate of a certain pixel of the ith frame on the image of the previous frame;

is the abscissa of a certain pixel of the (i + 1) th frame on the image,

the vertical coordinate of a certain pixel of the (i + 1) th frame on the image; dt is elapsed time; (u) _(x，y) ，v _(x，y) ) The pixel of that position stored for each coordinate in the light flow map corresponds to the velocity vector propagated to the next frame.

Further, in the step 2.4.2), the semantic segmentation result corresponding to the modified post-frame image is:

I ^(i+1|i+1) ＝I ⁽ⁱ⁺¹⁾ ∪(I ^(i+1|i) ∩I ^{(restricted area)} )

Wherein, I ⁽ⁱ⁺¹⁾ Semantic segmentation for the i +1 th frame imageCutting results; i is ^(i+1|i) The corrected semantic segmentation result of the ith frame image of optical flow propagation; i is ^(i+1|i+1) Semantic segmentation result I for fusing I +1 frame image ⁽ⁱ⁺¹⁾ Modified semantic segmentation result I of ith frame image propagated with optical flow ^(i+1|i) And (5) a final semantic segmentation result of the (i + 1) th frame image.

In a second aspect of the present invention, there is provided a robustness enhancing system for semantic segmentation of map elements, comprising:

the video frame image acquisition module is used for dividing the driving scene video acquired by the vehicle-mounted camera sensor into independent video frames according to time sequence;

and the semantic enhancement module is used for performing semantic segmentation on each independent video frame data based on a preset semantic segmentation network to obtain masks corresponding to semantic segmentation results of various map elements in each frame image, and introducing optical flow information between adjacent frame images to enhance the video semantic segmentation stability.

Further, the semantic enhancement module includes:

the front frame image processing module is used for reading the front frame image and inputting the front frame image into a semantic segmentation network aiming at three map elements of a lane line, a lamp post and a road signboard for processing to obtain masks corresponding to semantic segmentation results of various map elements in the front frame image;

the post-frame image processing module is used for reading a post-frame image and inputting the post-frame image into a semantic segmentation network aiming at three map elements of a lane line, a lamp post and a road signboard for processing to obtain masks corresponding to semantic segmentation results of various map elements in the post-frame image;

the optical flow graph acquisition module is used for calculating optical flow information between the images of the front frame and the rear frame to obtain an inter-frame optical flow graph;

the optical flow information fusion module is used for transmitting the mask corresponding to the front frame image to the rear frame image through the obtained interframe optical flow graph, and performing enhancement operation on the result of the rear frame image in a preset limited area range to supplement the incomplete semantic segmentation area of the rear frame image;

and the iteration enhancement module is used for adding 1 to the number of the image of the next frame and returning to the image processing module of the previous frame until all the images of the frames are processed.

Further, the optical flow information fusion module comprises a correction module and a growing area limiting module, wherein the correction module is used for transmitting the map element semantic segmentation result of the front frame image to the corresponding position of the rear frame image along the optical flow image displacement vector according to the corresponding relation of the pixels in the optical flow map through the optical flow map corresponding to the optical flow information according to the optical flow information between the two calculated frame images to obtain a corrected semantic element area; the growth region limiting module is used for comparing the obtained corrected semantic element region with the semantic element segmentation result of the later frame image, and correcting the part of the later frame image with incomplete map element segmentation based on the comparison result, so that the enhanced supplement is realized.

Due to the adoption of the technical scheme, the invention has the following advantages: 1) According to the method, only continuous video information of a camera sensor is used, each frame of semantic segmentation result is connected through optical flow information, and robust map elements can be accurately identified through low cost; 2) The invention enhances the segmentation effect of the later frame by spreading the information of the former frame to the later frame through the optical flow, can reduce the unstable segmentation situation of the map elements and enhance the robustness of the segmentation of the map elements. Therefore, the invention can be widely applied to the field of automatic driving.

Drawings

FIG. 1 is a flow chart of the optical flow fusion algorithm of the present invention;

FIG. 2 is a block diagram of the iterative enhancement algorithm of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

As shown in FIG. 1, the method for enhancing the semantic segmentation robustness of map elements provided by the invention takes a video shot by a vehicle-mounted camera sensor in an actual automatic driving engineering task as input, adopts a basic semantic segmentation network to realize the preliminary extraction of the map elements, on the basis, a segmentation mask of a front frame is propagated to a rear frame through optical flow by fusing optical flow information between continuous video frames, and adopts a certain fault-tolerant mechanism to continuously iterate an optimization result to enhance the map elements which are segmented and jumped. Specifically, the method comprises the following steps:

1) And dividing the driving scene video acquired by the vehicle-mounted camera sensor into independent video frame images according to the time sequence.

2) Semantic segmentation is carried out on each independent video frame image data in the step 1) based on a preset semantic segmentation network to obtain masks corresponding to semantic segmentation results of various map elements in each frame image, and optical flow information is introduced between adjacent frame images to enhance the stability of the semantic segmentation video images.

Specifically, the method comprises the following steps:

2.1 Processing ith frame image: reading the ith frame of image, and inputting the ith frame of image into a preset semantic segmentation network for processing to obtain masks corresponding to semantic segmentation results of various map elements in the ith frame of image; the preset semantic segmentation network is a network aiming at three map elements of a lane line, a lamp post and a road signboard;

2.2 Processing the i +1 th frame image: reading the (i + 1) th frame image, and inputting the (i + 1) th frame image into a preset semantic segmentation network for processing to obtain masks corresponding to semantic segmentation results of various map elements in the (i + 1) th frame image;

2.3 Obtain an optical flow map: calculating optical flow information between the ith frame image and the (i + 1) th frame image to obtain an inter-frame optical flow diagram;

2.4 Optical flow information fusion: and transmitting masks corresponding to various map element semantics in the ith frame image to the (i + 1) th frame image based on the obtained interframe light flow graph, and performing enhancement operation on the semantic segmentation result of the (i + 1) th frame image within a preset limited region range to supplement the incomplete segmentation region in the (i + 1) th frame image. The limited area refers to a neighborhood with a detection result area in the (i + 1) th frame of image, for the lamp post map elements, the neighborhood can be a longitudinal neighborhood, for the road signboard map elements, the neighborhood can be a horizontal neighborhood and a longitudinal neighborhood, and the size of the neighborhood can be adjusted according to actual problems;

2.5 Iterative enhancement: and repeating the steps 2.2) -2.4) until all the independent video frame images in the step 1) are processed.

Further, as shown in fig. 2, the semantic segmentation networks preset in step 2.1) and step 2.2) are generally composed of an encoder and a decoder. The encoder is usually a deep convolutional network with a complex structure, and the purpose of the deep convolutional network is to extract deep feature information, and different encoder models have different characterization capabilities on features and different segmentation effects. The decoder is usually an upsampled network whose purpose is to convert the depth feature information extracted by the encoder into a segmentation result that is consistent with the size of the input image.

In an actual high-precision map modeling task, three map elements of a lane line, a lamp post and a road signboard need to be extracted, so that the semantic segmentation network only segments the three semantics in a video image, labels input into the semantic segmentation network are subjected to certain processing before the semantic segmentation network is trained, only semantic labels of the three map elements of the lane line, the lamp post and the road signboard are reserved, and after full training, the semantic segmentation network applied to the invention can directly output three semantic segmentation results. The step only provides a preliminary semantic segmentation result of the map elements in a single frame, so that the method does not relate to a specific method for realizing semantic segmentation, and does not make other special requirements on the structure of a semantic segmentation network.

Further, in the step 2.3), after each frame of map element is extracted, optical flow information is introduced between adjacent frames to enhance the video semantic segmentation stability. The invention is not concerned with, and is therefore not limited to, methods of specifically implementing optical flow computations.

Further, in the step 2.4), when the optical flow information fusion is performed, the method includes the steps of:

2.4.1 Iterative enhancement algorithm: and according to the calculated optical flow graph between the two frame images, transmitting the semantic segmentation result of the map element of the previous frame image to the corresponding position of the later frame image along the displacement vector of the optical flow graph according to the corresponding relation of the pixels in the optical flow graph, and obtaining a corrected semantic element area.

In order to make the map element area of the optical flow diagram correction effectively utilized, the invention uses the corrected segmentation result as the semantic segmentation result of the final frame image and iterates to the operation of the next frame as the previous frame information.

Wherein, the optical flow graph is interframe information calculated according to the pixel relation of two frames, and each coordinate stores a velocity vector (u) of the pixel of the position correspondingly propagating to the next frame _(x，y) ，v _(x，y) ). Is provided with

In order to fuse the ith frame semantic segmentation result with the abscissa of a certain pixel of the ith frame on the image after the ith-1 frame segmentation result of optical flow propagation,

in order to fuse the i-th frame semantic segmentation result and the i-1 frame segmentation result of optical flow propagation, the vertical coordinate of a certain pixel of the i-th frame on the image is subjected to dt, and then the pixel point

The corresponding position coordinates of the propagated delay flow diagram to the next frame of pixels are:

2.4.2 Growth region restriction algorithm: comparing the corrected semantic element region obtained in the step 2.4.1) with the semantic element segmentation result of the later frame image obtained in the step 2.2), and correcting the part of the later frame image with incomplete map element segmentation based on the comparison result, thereby realizing enhanced supplement.

Specifically, the area corresponding to the semantic segmentation result of the next frame image is compared with the semantic element area corresponding to the corresponding position obtained in the step 2.4.1) according to the semantic segmentation network in the step 2.2) to judge whether the next frame image can be modified, whether the semantic pixel points of the previous frame image propagated to the subsequent frame image can modify the semantic segmentation result of the subsequent frame image is judged, and if the subsequent frame image has semantic segmentation information in the semantic element area, the subsequent frame image is considered to be a feasible area, so that the optical flow information can only be increased in the action area of the original result.

In order to limit the influence of the error propagation of the optical flow on the subsequent frame image, ensure that the area of the optical flow enhancement is not gradually expanded along with the iteration and ensure that the error area of the optical flow propagation can be avoided, the invention designs an algorithm for limiting the area of the optical flow enhancement to a certain feasible area, and avoids the conditions of error propagation and error iteration. Specifically, let I ⁽ⁱ⁺¹⁾ As a result of semantic segmentation of the I +1 th frame image, I ^(i+1|i) Modified semantic segmentation results for the ith frame image of optical flow propagation, I ^(i+1|i+1) A modified semantic segmentation result I of the ith frame image for fusing the (I + 1) th frame semantic segmentation result I (I + 1) with optical flow propagation ^(i+1|i) The final semantic segmentation result of the i +1 th frame image includes:

I ^(i+1|i+1) ＝I ⁽ⁱ⁺¹⁾ ∪(I ^(i+1|i) ∩I ^{(restricted area)} )

Because the precision of the optical flow calculation method is limited, the complete one-to-one correspondence of the pixels of the front frame and the rear frame cannot be realized, and the algorithm is only effective for map elements with obviously insufficient segmentation areas in the frame subjected to segmentation and jumping. If the accuracy itself is high enough, the algorithm may introduce new errors due to the introduction of optical flow.

Based on the map element semantic segmentation enhancing method, the invention also provides a map element semantic segmentation enhancing system, which comprises the following steps: the video frame image acquisition module is used for dividing the driving scene video acquired by the vehicle-mounted camera sensor into independent video frames according to time sequence; and the semantic enhancement module is used for performing semantic segmentation on each independent video frame data based on a preset semantic segmentation network to obtain masks corresponding to semantic segmentation results of various map elements in each frame image, and introducing optical flow information between adjacent frame images to enhance the video semantic segmentation stability.

Further, the semantic enhancement module comprises:

and the iteration enhancement module is used for adding 1 to the number of the next frame image and returning to the previous frame image processing module until all the frame images are processed.

Further, the optical flow information fusion module comprises a correction module and a growth area limiting module, wherein the correction module is used for transmitting the map element semantic segmentation result of the front frame image to the corresponding position of the rear frame image along the optical flow image displacement vector according to the corresponding relation of pixels in the optical flow map through the optical flow map corresponding to the optical flow information according to the optical flow information between the two frame images obtained by calculation so as to obtain a corrected semantic element area; the increase region limiting module is used for comparing the obtained corrected semantic element region with the semantic element segmentation result of the later frame image, and correcting the part of the later frame image with incomplete map element segmentation based on the comparison result to realize enhanced supplement.

The above embodiments are only used for illustrating the present invention, and the structure, connection mode, manufacturing process, etc. of the components may be changed, and all equivalent changes and modifications performed on the basis of the technical solution of the present invention should not be excluded from the protection scope of the present invention.

Claims

1. A method for enhancing robustness of semantic segmentation of map elements is characterized by comprising the following steps:

1) Dividing a driving scene video acquired by a vehicle-mounted camera sensor into independent video frame images according to a time sequence;

2) Semantic segmentation is carried out on each independent video frame image data in the step 1) based on a preset semantic segmentation network to obtain masks corresponding to semantic segmentation results of various map elements in each frame image, and optical flow information is introduced between adjacent frame images to enhance the stability of the semantic segmentation video images;

in the step 2), the method for enhancing the video semantic segmentation stability comprises the following steps:

2.4 Based on the obtained interframe light flow graph, transmitting masks corresponding to semantics of various map elements in the ith frame image to the (i + 1) th frame image, and performing enhancement operation on a semantic segmentation result of the (i + 1) th frame image in a preset limited region range to supplement an incomplete segmentation region in the (i + 1) th frame image;

the optical flow information fusion method comprises the following steps:

2.4.1 According to the calculated optical flow graph between the two frame images, spreading the map element semantic segmentation result of the front frame image to the corresponding position of the rear frame image along the optical flow graph displacement vector according to the corresponding relation of pixels in the optical flow graph to obtain a corrected semantic element area;

2.4.2 Comparing the corrected semantic element region obtained in the step 2.4.1) with the semantic element segmentation result of the later frame image obtained in the step 2.2), and correcting the part of the later frame image with incomplete map element segmentation based on the comparison result to realize enhanced supplement;

the semantic segmentation result corresponding to the modified post-frame image is as follows:

I ^(i+1|i+1) ＝I ⁽ⁱ⁺¹⁾ ∪(I ^(i+1|i) ∩I ^{(restricted area)} )

Wherein, I ⁽ⁱ⁺¹⁾ The semantic segmentation result is the (i + 1) th frame image; i is ^(i+1|i) The corrected semantic segmentation result of the ith frame image of optical flow propagation; i is ^(i+1|i+1) Semantic segmentation result I for fusing I +1 frame image ⁽ⁱ⁺¹⁾ Modified semantic segmentation result I of ith frame image propagated with optical flow ^(i+1|i) The final semantic segmentation result of the (i + 1) th frame image;

2. The method as claimed in claim 1, wherein the robustness of the semantic segmentation of the map elements is enhanced by: in the step 2.4.1), the calculation formula that the semantic segmentation result of the map element of the previous frame image is propagated to the corresponding position of the subsequent frame image along the displacement vector of the optical flow map according to the corresponding relation of the pixel in the optical flow map is as follows:

in the formula (I), the compound is shown in the specification,

the abscissa of a certain pixel on the image of the ith frame,

the vertical coordinate of a certain pixel in the ith frame on the image;

is the abscissa of a certain pixel of the (i + 1) th frame on the image,

the vertical coordinate of a certain pixel of the (i + 1) th frame on the image; dt is elapsed time; (U) _(x,y) ,V _(x,y) ) The pixel of that position stored for each coordinate in the light flow map corresponds to the velocity vector propagated to the next frame.

3. A robustness enhancement system for semantic segmentation of map elements, comprising:

the semantic enhancement module is used for performing semantic segmentation on each independent video frame data based on a preset semantic segmentation network to obtain masks corresponding to semantic segmentation results of various map elements in each frame image, and introducing optical flow information between adjacent frame images to enhance the video semantic segmentation stability;

the semantic enhancement module comprises:

the iteration enhancement module is used for adding 1 to the number of the next frame image and returning to the previous frame image processing module until all the frame images are processed;

the optical flow information fusion module comprises a correction module and a growing area limiting module, wherein the correction module is used for transmitting the map element semantic segmentation result of the front frame image to the corresponding position of the rear frame image along the optical flow image displacement vector according to the corresponding relation of pixels in the optical flow map through the optical flow map corresponding to the optical flow information according to the optical flow information between the two calculated frame images to obtain a corrected semantic element area; the growth region limiting module is used for comparing the obtained corrected semantic element region with the semantic element segmentation result of the later frame image, and correcting the part of the later frame image with incomplete map element segmentation based on the comparison result to realize enhanced supplement;

I ^(i+1|i+1) ＝I ⁽ⁱ⁺¹⁾ ∪(I ^(i+1|i) ∩I ^{(restricted area)} )

Wherein, I ⁽ⁱ⁺¹⁾ The semantic segmentation result is the (i + 1) th frame image; i is ^(i+1|i) The corrected semantic segmentation result of the ith frame image of optical flow propagation; i is ^(i+1|i+1) Semantic segmentation result I for fusing I +1 frame image ⁽ⁱ⁺¹⁾ Modified semantic segmentation result I of ith frame image propagated with optical flow ^(i+1|i) And (5) a final semantic segmentation result of the i +1 th frame image.