CN115482280A

CN115482280A - Visual positioning method based on adaptive histogram equalization

Info

Publication number: CN115482280A
Application number: CN202211106319.3A
Authority: CN
Inventors: 张会清; 杨永建
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-09-11
Filing date: 2022-09-11
Publication date: 2022-12-16

Abstract

The invention discloses a visual positioning method based on self-adaptive histogram equalization. Then, a binary mask is designed by utilizing the depth information of the image to eliminate the interference of dynamic targets and shielding in the scene. Then using a two-branch network of encoder-decoder network structure: the pose estimation network and the depth estimation network respectively extract the depth information of the image and the pose information of the adjacent image frame, and the pose estimation result of the network is supervised through luminosity consistency loss, edge smoothing loss and depth consistency loss in the training process. The invention can extract the space-time characteristics of main adjacent image frames, ensure the pose estimation of the network to have the characteristics of consistent scale through the image depth information and accurately estimate the position in real time.

Description

Visual positioning method based on adaptive histogram equalization

Technical Field

The invention belongs to the field of visual positioning, and relates to a method based on a two-branch network, which comprises the following steps: and the pose estimation network and the depth estimation network are used for estimating the visual positioning algorithm of the monocular camera motion trail. Aiming at the scenes with overexposure or underexposure of indoor scene images, the algorithm improves the problem of image overexposure by introducing a contrast enhancement algorithm based on a self-adaptive histogram equalization theory (CLAHE), and effectively improves the visual positioning precision.

Background

With the continuous development of robot technology, the accurate positioning of the robot position has a crucial influence on downstream tasks such as motion planning and navigation of the robot. The service scenes of the intelligent robot can be mainly divided into an indoor scene and an outdoor scene at present, in an outdoor environment, a Global Navigation Satellite System (GNSS) can provide accurate position service through satellite signals, in the indoor environment, due to interference of indoor buildings, positioning accuracy based on the GNSS is not satisfactory due to instability of wireless satellite signals, and a Visual odometer (Visual odometer) determines the relative position of a main body only by estimating motion between adjacent image frames without any external signals.

Visual odometers can be currently classified into monocular and binocular visual odometers according to the type of sensor. The monocular vision odometer is high in instantaneity, simple in system structure, but lack of absolute scale and has the problem of scale uncertainty, and the binocular vision odometer can acquire the image absolute scale through the fixed base line in triangulation. In recent years, deep learning gradually becomes mainstream in the field of computer vision, a visual mileage calculation method based on data driving replaces a visual feature extraction module in the traditional method through a convolutional neural network, higher visual features can be obtained, and final pose estimation is obtained by means of strong learning capacity of the neural network.

Although deep learning brings huge performance improvement to the visual odometer, at present, some problems still exist in the field of the visual odometer and need to be optimized. The image frame serves as an input to a visual odometer, where the texture and edge features contained therein are primarily affected by light, in outdoor scenes, the image may appear too bright due to overexposure, and in indoor scenes, the image may appear too dark due to underillumination. The lack of image texture and edge information can be caused by the over-brightness or over-darkness of the image, so that the learning of the subsequent neural network on the image visual characteristics is influenced, and the model reasoning precision is reduced.

In order to solve the problems, the invention provides a visual positioning algorithm based on a histogram equalization theory by recovering and enhancing the texture and edge detail characteristics of an overexposed or underexposed image through a fusion limited contrast self-adaptive histogram equalization algorithm (CLAHE).

Disclosure of Invention

Aiming at the influences of scene illumination conditions and weather factors, the method adopts an image enhancement algorithm based on histogram equalization to recover image texture detail information of an overexposed or underexposed image, and then adopts a two-branch prediction network based on a depth residual error neural network to calculate a camera pose result of an input video frame.

In order to achieve the purpose, the invention adopts the following technical scheme:

firstly, in an off-line stage, restoring and enhancing texture and edge structure information of an input image through an image enhancement module for an input time sequence image frame sequence, then removing a dynamic target and a shield in the image through an enhanced image input mask module, then feeding a preprocessed enhanced image into a Resnet encoder to extract image motion and depth information, finally restoring the image size through a decoder and obtaining a pose prediction result, and then updating network weight through back propagation according to a gradient obtained by a loss function to finally obtain a robust pose estimation network model. And in an online stage, feeding the input image frame sequence into a network model obtained by training to obtain a corresponding pose estimation result. A schematic diagram of the positioning method of the present invention is shown in fig. 1.

A visual positioning method based on adaptive histogram equalization is disclosed, wherein an algorithm flow chart is shown in figure 2, and the method mainly comprises two stages of off-line model construction and on-line real-time positioning:

the off-line model training stage mainly comprises the following modules:

(1) An image enhancement module: for an input image frame sequence, firstly, carrying out block filling on an image, then calculating a mapping relation for each block based on a histogram equalization strategy, carrying out contrast limitation based on the obtained mapping relation, and finally obtaining an enhanced image through bilinear interpolation. And enhancing the edge and texture information of the image by using a histogram equalization method for limiting the contrast, and improving the display of the target object in an excessively bright or excessively dark area in the image.

(2) Resnet based encoder: the time and space characteristics of the enhanced image are extracted by using a classical characteristic extraction network Resnet, and the steps are as follows:

the positioning model consists of a two-branch network: a depth prediction network and a pose estimation network. The depth prediction network is used for generating depth information of the image, and the problem of scale drift of long sequence pose estimation is relieved by means of the depth information of the image. For depth prediction networks, the decoder section uses the Resnet50 to extract image spatio-temporal features. For the pose estimation network, resnet18 is used as a decoder, and in order to input two images simultaneously for the input image, the Resnet first layer input channel is modified to 6-channel input.

Determining the number of layers of hidden layers, the number of neurons in each layer and the learning rate, and setting network parameters such as pre-training iteration times and fine-tuning iteration times of each layer; and setting the activation function of the output layer as a sigmoid function, and setting the activation function of other nonlinear activation layers as an ELU.

The two networks are jointly trained through the same loss function, and the loss function is obtained through three parts, namely luminosity consistency loss, edge smoothing loss and depth consistency loss.

(3) A decoder: for the depth prediction network, the dispesnet is used as a decoder to recover the spatiotemporal features of the image by up-sampling layer by layer as shown in fig. 3. For the pose estimation network, as shown in fig. 4, four convolutional layers are designed by using posereset to obtain a 6-degree-of-freedom predicted pose.

(4) Mask based on occlusion mask: dynamic objects and occlusions in the scene violate static scene assumptions. Due to the existence of the dynamic target and the occlusion, the depth information of the predicted image and the depth of the real image generate obvious depth inconsistency in the dynamic target area, and further the correct learning of the depth network and the pose network parameters is influenced. As shown in fig. 5, based on the predicted single-view depth information, the target view is flipped to the source view by view synthesis principle to obtain a synthesized target depth map, then the target view depth map and the synthesized target depth map are subtracted to obtain a depth difference map, and the depth difference map is further calculated as a binary mask, the binary mask has a large difference value in a dynamic target area and a nearly zero difference in a static area, and the binary mask is used for eliminating the interference of the dynamic target and the occlusion before calculating the loss.

Drawings

Fig. 1 is a flow chart of the present visual positioning method.

Figure 2 is a flow chart of the CLAHE combined Resnet positioning algorithm of the present invention at the off-line stage.

Fig. 3 is a basic configuration diagram of a depth estimation network.

Fig. 4 is a basic configuration diagram of a pose estimation network.

Fig. 5 is a basic configuration diagram of a mask generation network.

Detailed Description

The method of the invention is illustrated in flow diagram form in FIG. 1. In the off-line training stage, the final pose estimation is obtained through the collected video data and the image enhancement module, the Resnet-based image feature coding module, the PosRenet decoder module, the DispesNet decoder module and the mask generation module. In the training process, the pose with scale consistency is obtained through the learning of the camera pose by a luminosity consistency loss, an edge smoothing loss and a depth consistency loss supervision network. In an online stage, a trained network model is imported, and the pose and the position relative to the scene of the camera are estimated in real time through image frames input by the camera.

The specific implementation steps are as follows:

(1) An image enhancement module: the method comprises the steps that under the influence of weather and scenes, changes of illumination conditions have important influence on image frames obtained by a camera, and in order to improve the negative influence of insufficient illumination and overexposure on image texture information, an image enhancement algorithm based on histogram equalization is adopted to enhance the edge and texture information of an original input image.

For an input image frame, based on the locality principle of human vision, firstly, the image is partitioned, and local image blocks are used as basic units of histogram equalization, so that the phenomenon that the histogram equalization amplifies noise in an excessively dark area is avoided.

For image slices after image blocking, because of the existence of too dark regions in an image, the probability density of the regions on a histogram is characterized to be too large on a certain pixel, and the larger pixel probability density amplification can be generated after a mapping function is transformed, so that the image has noise. The image contrast is calculated according to the following formula:

wherein, I _{sc_max} Representing the maximum value of a pixel of the image, I _{sc_min} Representing the minimum value of the image pixel.

After histogram equalization is completed on each image block, if the image blocks are directly spliced, blocking effect of the synthesized image blocks can be caused, and in order to avoid the blocking effect, the image blocks are spliced into an enhanced image through secondary linear interpolation. And for the pixel point P (i, j) in the image, obtaining the pixel value of the target pixel point through the bilinear quadratic difference.

p(x,y)＝f(p ₁₁ )w ₁₁ +f(p ₂₁ )w ₂₁ +f(p ₁₂ )w ₁₂ +f(p ₂₂ )w ₂₂

Wherein p is ₁₁ ，p ₁₂ ，p ₂₁ And p ₂₂ Respectively represent P(i, j) four pixel coordinates in the vicinity.

(2) ResNet based encoder: considering that the network is a two-branch network, wherein the depth estimation network is mainly used for providing scale information with long-time sequence consistency for the pose network, and for the depth estimation network, the ResNet50 is adopted for extracting image spatio-temporal features. For the pose estimation network, resnet18 is used as a decoder, and in order to input two images simultaneously for the input image, the Resnet first layer input channel is modified to 6-channel input.

After the structure of the Resnet network is determined, determining the number of layers of hidden layers, the number of neurons of each layer and the learning rate according to the network structure, and setting network parameters such as pre-training iteration times and fine-tuning iteration times of each layer; and setting an activation function of an output layer as a sigmoid function, and setting an activation function of other nonlinear activation layers as an ELU.

In the training stage, the two networks are jointly trained through the same loss function, and the loss function is obtained through three parts, namely luminosity consistency loss, edge smoothing loss and depth consistency loss.

(3) A decoder: for depth prediction networks, dispesnet is used as a decoder to recover scale information of the image by layer-by-layer upsampling with a transposed convolution based on a 3x3 convolution kernel. For the pose estimation network, as shown in fig. 4, four convolutional layers are designed by using posereset to obtain a 6-degree-of-freedom predicted pose. In the feature decoding stage, the image features are decoded by adopting 1x1 and 3x3 two-dimensional convolution kernels, and the final pose with six degrees of freedom is obtained.

(4) Mask based on occlusion mask: taking into account that dynamic objects and occlusions in a scene violate static scene assumptions. Due to the existence of the dynamic target and the occlusion, the depth information of the predicted image and the depth of the real image generate obvious depth inconsistency in a dynamic target area, and further the correct learning of a depth network and a pose network parameter is influenced. Based on the predicted single-view depth information, the target view is turned into the source view through a view synthesis principle to obtain a synthesized target depth map, and the view synthesis principle can be characterized by the following formula:

wherein upsilon is _ij Representing a target image I _i (x) And flipping the source image

And turning the source image to obtain pose transformation information and depth information between the target image and the source image according to the pixel value difference, wherein the pose transformation information is obtained according to the following formula:

wherein K is the internal reference of the camera,

is a pose transformation matrix from a target image to a source image,

the depth information of the target image is obtained by a depth prediction network. Depth map D based on target image _j (x _j ) And a source image depth map D _i (x _i ) The mask based on scale uniformity can be calculated by the following formula:

where thre is empirically set to 0.25 _i-＞j (x) Representing the difference in depth information between the two images, calculated by:

the binary mask based on the depth consistency is used for eliminating the interference of the dynamic target and the shielding before calculating the loss, and the difference value of the dynamic target area is large, and the difference of the static area is nearly zero.

(5) Loss function: in order to improve the prediction accuracy of a depth estimation network and a pose estimation network, the space-time characteristics of an image are considered, a loss function is composed of three parts, a two-branch network is jointly trained in a training stage, and an overall loss function is composed of three parts:

L _c ＝αL _photo +βL _smooth +γL _depth

the first part is the luminosity consistency loss, which is used to constrain the luminosity loss between adjacent image frames:

wherein I _i (x) And

representing gray values of the reference image and the inverted source image.

The second part is edge smoothness loss, which is used to compensate the prediction accuracy of the scene in low texture or single plane areas:

wherein

Representing the first derivative of the spatial direction.

And the third part is scale consistency loss, and the scale consistency of the pose is kept in long-sequence pose estimation by means of a map depth information constraint network.

Wherein SSIM (I) _i -I _j ) Representing the difference in structural similarity of the two images.

(6) And in the on-line stage, video data of a test scene is collected, a trained model is imported, the inference model encodes the edge and texture features of the test scene, and a pose with 6 degrees of freedom is finally obtained through four layers of transposed convolution decoding layers.

The method of the invention fully utilizes the characteristic characteristics and the fusion mechanism of the depth residual error network to obtain a high-precision positioning method.

Claims

1. A visual positioning method based on adaptive histogram equalization is characterized by comprising the following steps:

in the off-line positioning stage, the collected scene video is input into a positioning model to be trained; performing contrast-limit-based histogram equalization on the original input image to enhance the edge texture information of the image;

extracting motion information and depth information of consecutive video frames through a Resnet-based encoder network;

acquiring depth information difference of adjacent image frames based on a view synthesis principle through the depth information of the adjacent image frames, and generating a binary mask through the depth information difference;

based on visual characteristics obtained by an encoder, based on transposed convolution, obtaining a pose and a depth map with six degrees of freedom through layer-by-layer upsampling of four layers of convolution layers, and updating weights of a supervision model based on luminosity consistency loss, edge smoothing loss and depth consistency loss;

and in the on-line positioning stage, video data of a test scene is acquired, a trained model is imported, the model extracts scene edge texture information through scene information coding, and real-time pose estimation and depth map estimation with six degrees of freedom are obtained through different decoders.

2. The visual positioning method based on adaptive histogram equalization according to claim 1, wherein the image enhancement method based on histogram equalization comprises:

filling blocks of the image, calculating a mapping relation for each block based on a histogram equalization strategy, carrying out contrast limitation based on the obtained mapping relation, and finally obtaining an enhanced image through bilinear interpolation; and enhancing the edge and texture information of the image by using a histogram equalization method for limiting the contrast, and improving the display of the target object in an excessively bright or dark area in the image.

3. The visual positioning method based on adaptive histogram equalization according to claim 1, wherein the visual feature extraction method comprises:

for an input image, extracting advanced visual features of the image through an encoder-decoder network, extracting the visual features through four-layer downsampling convolution by using a classical network ResNet based on a 3x3 convolution kernel in a decoder part, wherein a pose estimation network decoder adopts ResNet-18, the encoder adopts PoseResNet to design four layers of convolution layers to obtain a 6-degree-of-freedom prediction pose, a depth estimation network decoder adopts ResNet-50, a DispesNet is used as the decoder, image spatio-temporal features are restored through layer-by-layer upsampling by adopting the DispesNet as the decoder, and the image spatio-temporal features are restored into a depth map through layer-by-layer upsampling.

4. The visual positioning method based on adaptive histogram equalization as claimed in claim 1, wherein said occlusion mask based binary mask comprises:

turning the target view into the source view based on a view synthesis principle to obtain a synthesized target depth map, wherein the view synthesis principle is characterized by the following formula:

The difference of the pixel values between the target image and the source image, and the turning of the source image is obtained through the pose transformation information and the depth information between the target image and the source imageObtaining pose transformation information, wherein the pose transformation information is obtained by:

wherein K is the internal reference of the camera,

is a pose transformation matrix from a target image to a source image,

the depth information of the target image is obtained by a depth prediction network; depth map D based on target image _j (x _j ) And a source image depth map D _i (x _i ) The mask based on scale uniformity can be calculated by the following formula:

5. the visual positioning method based on adaptive histogram equalization as claimed in claim 1, wherein said deep network loss function design comprises:

the loss function is composed of three parts, the two branch networks are jointly trained in the training stage, and the overall loss function is composed of three parts:

L _c ＝αL _photo +βL _smooth +γL _depth

the first part is the loss of photometric consistency, which is used to constrain the photometric loss between adjacent image frames:

wherein I _i (x) And

representing gray values of the reference image and the inverted source image;

the second part is the loss of edge smoothness, which is used to compensate the prediction accuracy of the scene in low texture or single plane areas:

wherein ^ represents the first derivative of the spatial direction;

the third part is scale consistency loss, and the scale consistency of the pose is kept in long-sequence pose estimation by means of a map depth information constraint network;

6. The visual positioning method based on adaptive histogram equalization as claimed in claim 1, wherein the model comprises an online inference stage comprising:

the method comprises the steps of performing edge enhancement on an image based on histogram equalization, extracting image visual features through a Resnet encoder, calculating an image depth map based on a DispNet decoder, and obtaining a 6-degree-of-freedom pose based on an image pose and depth information through a PoseNet decoder.