CN111553193B

CN111553193B - Visual SLAM closed-loop detection method based on lightweight deep neural network

Info

Publication number: CN111553193B
Application number: CN202010249172.8A
Authority: CN
Inventors: 金世俊; 刘泽
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2022-11-11
Anticipated expiration: 2040-04-01
Also published as: CN111553193A

Abstract

The invention discloses a visual SLAM closed-loop detection method based on a lightweight deep neural network. In the method, the image recognition model adopts a lightweight deep neural network, and the training method is to perform atlas training on the constructed network model by using a data set of a similar scene, and to achieve a certain precision by training an optimization network. The final purpose is to enable the trained neural network model to learn the probability distribution corresponding to the image sample from the training sample, so that the purpose of detecting closed loop by extracting scene features and obtaining the similarity of the image is achieved, and preparation is made for subsequent SLAM mapping optimization. The method can obtain better detection effect under complex illumination, can improve the speed of the model in actual introduction, and greatly improves the accuracy of the algorithm under lower calculation cost. The method has important application value in the aspects of closed loop detection and the like.

Description

Visual SLAM closed-loop detection method based on lightweight deep neural network

Technical Field

The invention belongs to the field of computer vision and robot motion closed-loop detection, and relates to a closed-loop detection method based on a lightweight deep neural network.

Background

Closed loop detection is a problem of determining whether a mobile robot returns to a previously visited location, and is a key module in SLAM, aiming at reducing accumulated errors when an environment map is constructed, solving the process of drift of location estimation over time, and being very important for constructing a consistent environment map. To develop a closed-loop detection algorithm, a popular and successful technique is to match previously visited locations using the similarity of the current robot's view to the view in the robot map, consistent with the principle that the human eye distinguishes two similar locations. In this case, the closed-loop detection problem is essentially an image matching problem.

Image matching is generally divided into two steps: image description and similarity measurement, where the image descriptor compresses the image into a more compact, discriminating one-dimensional vector than the original image, is the most critical step in visual loop closure detection. Many image description techniques are currently used for visual closed-loop inspection with great success. However, most conventional appearance-based methods employ artificial features derived through professional calculations, i.e., they are designed through the process of property engineering, in which human expertise and insight lead the development process to achieve the desired properties. Image descriptors based on manual characteristics often have common weaknesses, including lack of robustness in terms of illumination changes and higher computational costs.

With the progress of computer performance and the rapid rise of the GPU in recent years, computer vision technology is greatly developed, and the appearance of deep learning provides a new idea for image description. The deep learning method can automatically learn the characteristics from the original data and has better adaptability to complex environmental changes. The deep neural network model can learn and extract image features from increasingly abstract visual data, good research results are obtained in the fields of image classification, image denoising and the like, a closed-loop detection technology applying deep learning to improve the recognition capability is also in a new rapid development stage, but the problems of more network model parameters and low real-time performance need to be overcome for rapid development of closed-loop detection based on the neural network. Therefore, if a general deep learning method is directly adopted, the algorithm cannot achieve excellent adaptability in various actual scenes.

Disclosure of Invention

The invention aims to solve the problems and provides a stable and reliable robot motion closed-loop detection method based on a lightweight deep neural network. Aiming at a limited image data set, an image characteristic discrimination model based on a convolutional neural network is designed, and model parameters are optimized to achieveOptimization ofThe generation network after the state can convert any scene picture into a group of characteristic vectors, and the normalized characteristic vectors are used for constructing a similarity matrix to judge the closed loop.

In order to achieve the purpose, the method adopted by the invention comprises the following steps: a visual SLAM closed-loop detection method based on a lightweight deep neural network comprises the following steps:

step 1: a closed loop test data set is selected. The training of the CNN model is a process with supervised learning, and if the data has no label information, the training of the model cannot be completed. Aiming at the problems, the training of a lightweight deep neural network model is completed on a large-scale labeled scene data set, the trained model is used as a feature extractor of a scene image, and finally the extracted features are applied to closed-loop detection;

step 2: and constructing a lightweight deep neural network. Preprocessing training data and test data of a prepared input model, uniformly adjusting images to be 224X224 in size (the actual size can be adjusted to be other different sizes according to needs), searching for characteristics of certain aspects of the images through convolution kernels, inputting the characteristics into the model, establishing a relation with the result, classifying the characteristics, and finally taking the output of the final full-connection layer as a characteristic vector of the images;

and 3, step 3: and optimizing the network model. Loading image samples of a data set, firstly initializing and setting the weight of a neural network model by adopting MSRA, inputting real image samples into a lightweight deep neural network model, training the neural network model by using a well-defined forward propagation process, and alternately training and optimizing model parameters by using backward propagation. After the training is finished, the model obtained by training is stored so as to be convenient for direct use next time. Testing by using the model stored after the previous training is finished, and training the network model to reach certain discrimination accuracy;

and 4, step 4: and carrying out closed-loop detection by using a network model. And (4) training through the step 3 to obtain a deep neural network model which can be used for acquiring image features. The method comprises the steps of calculating a feature descriptor of each query image (current robot view) by utilizing a neural network, preprocessing original CNN features, adding an enhancement step, and carrying out Principal Component Analysis (PCA) and whitening, so that the capability of representing the images can be remarkably improved, meanwhile, the calculation efficiency is improved, and finally, the feature descriptors are used for detection circulation. And after normalization, acquiring a similarity matrix between the images according to Euclidean distance, and reducing the rank of the matrix to reduce noise. The similarity is measured to determine if a loop closure has occurred and after all images in the data set are considered, an accuracy and recall pair result is obtained. By the method, the similarity relation between the images can be obtained, and the problem of closed-loop detection when the robot walks is solved.

As an improvement of the invention, in step 1, a standard-college365 is used to establish an image sample data set. The image sample adopts the size of 224X224, and the method adopts a supervised learning mode, so that a training set, a verification set and a test set are required, and model training is completed on a large number of labeled scene data sets. Training data is stored in a plurality of TFRecord files to improve processing efficiency, then samples are read from the TFRecord files to be analyzed, a file list of original data is appointed, the data is read from the files, and after preprocessing of gray values and mean values is carried out on the data, the data are combined and sorted into a batch to be used as neural network input. Meanwhile, in order to ensure that the training sample has enough representativeness, the coverage of various scenes such as different landforms, different distances, different illumination, camera shooting angles and the like needs to be considered during sample collection.

As an improvement of the invention, the lightweight deep neural network model in the step 2 adopts the structure of a traditional neural network and consists of an input layer, two convolutional layers, two maximum pooling layers, two blocks and a full-connection layer. The input layer is a 224X224 three-channel image; the Conv1 and Conv2 convolutional layers can perform feature extraction on input data, the maximum pooling layers Pool1 and Pool2 can perform effective information filtering, the convolutional layers and the pooling layers are linearly activated by adopting a correction linear unit C.RELU, parameters required by a convolutional kernel can be reduced by cascading an inverted image and an original image, and in addition, batch normalization operation is performed after each layer to accelerate convergence. Then two self-defined modules, namely block1 and block2, are started by a residual error network, and the two modules use skip connection to solve the problem that a deep neural network is difficult to optimize; in order to control the dimensionality of the characteristic diagram, reduce the process parameter quantity, increase the operating efficiency of the network, apply the bottleneck structure to the main line part of the residual error module; considering that a neural model always faces the defects of more parameters and slow operation, a network model utilizes the technologies of point-by-point group convolution and channel rearrangement to avoid the embarrassment of unsmooth information circulation, the feature maps obtained from the upper layer are divided into two groups, the two groups are inspired by an inclusion structure, a plurality of convolution kernels with different sizes are respectively used for the feature maps of the same layer to obtain features with different scales, the features are combined, the obtained features are often better than those of a single convolution kernel, depthwise operation is adopted to replace standard convolution operation to reduce a large number of parameters and obtain better effect at the same time, because each channel is learned, and not all channels correspond to the same filter, the advantages are that the calculated amount is smaller under the same weight parameter, and the operation speed is higher; the residual network structure enables the gradient to flow into a shallow network more easily, and the problem of gradient dispersion caused by deepening of the stratification degree is avoided. In order to improve the generalization of the model, channel shuffle is performed once after each split operation, the operation can fuse the features among different groups, and the next layer of group convolution is entered after the group conv is performed once, so that the cycle is performed. In order to avoid the over-fitting phenomenon, a reasonable regularization process is required. And finally, fusing and classifying the feature maps transmitted in the front by adopting a full connection layer Fc, wherein the number of the neurons of the output layer is the number of the categories of the data set. The network training can obtain the zero-sum game solution only by ensuring that the number of the real samples is far larger than the parameter quantity of the generated model. Secondly, in order to ensure that the discriminant model has good adaptability and discriminant capability, the model is also trained by using a Dropout and L2 regularization aided model.

As an improvement of the present invention, in step 3, the training process of the deep neural network can be described as an optimization process of the model parameters according to the model generation result. According to the method, optimization is carried out according to the real label in the sample and the model generation result. Taking a loss function cross entropy loss function L (loss function) corresponding to the Softmax classifier as an optimization objective function of the training process, and expressing the loss function cross entropy loss function L (loss function) as follows:

wherein m is the number of samples of each training batch; theta is a parameter matrix to be optimized of the network model; x (i) is the ith picture sample; y (i) is the ith sample true label; k is the classification number. In order to effectively avoid overfitting and regularization, weight is added to each parameter w in a loss function, and a model complexity index is introduced, so that model noise is suppressed, and overfitting is reduced. When the neural network is trained, all parameters in the neural network need to be changed continuously, and a random gradient descent algorithm (SGD) enables a loss function to be reduced continuously, so that a neural network model with higher accuracy is trained.

To suppress the SGD oscillations, inertia is added during the gradient descent. An ADMA algorithm is introduced on the basis of SGD:

m (t) is the exponential moving average of the gradient, and V (t) is the non-central variance value at the second moment of the gradient. The Adam algorithm, namely an Adaptive Moment Estimation method (Adaptive Moment Estimation), can calculate the Adaptive learning rate of each parameter. This method not only stores the exponentially decaying average of AdaDelta previous squared gradients, but also maintains the exponentially decaying average of previous gradients M (t), which is similar to momentum; beta is a ₁ Empirical value of the parameter 0.9, beta ₁ Empirical values of 0.999 for the parameters used to control the exponential decay; the momentum term only updates the parameters of the related samples, and unnecessary parameter updating is reduced, so that faster and stable convergence is obtained, and the oscillation process is also reduced.

The specific deployment flow in step 3 is as follows:

in step 301, because scene recognition is an image multi-classification problem, the network finally adopts a Softmax classifier to classify the input image. Softmax is a classifier, which maps the output of a plurality of neurons into a (0, 1) interval, calculates the probability of a class, performs mean centering preprocessing on an input image, transmits the processed image into a neural network in batches, performs forward calculation in a network model, and obtains a prediction result s through a discriminant formula:

wherein o is a parameter matrix of the network model, and k is a classification number;

and 302, optimizing the model parameters by using a random gradient descent algorithm and a self-adaptive learning gradient descent optimization algorithm. After the data is subjected to a prediction result through a discrimination formula, parameters are updated by using the training samples and the expected values, so that loss is minimized. Each time a sample is randomly selected from the training set for learning, each learning is very fast and can be updated on-line, the parameters such as weights and biases in the network are updated by the following equations:

θt＝θ _t-1 -V _t

wherein

Is m _t ，V _t Correction of (D), V _t For the t-th iterationIs the learning rate of the negative gradient,

is the partial derivative of the loss function with respect to the parameter, x (i), y (i) are the training samples, and θ t is the parameter value for the t-th iteration.

As an improvement of the present invention, in step 4, the feature dimension of the image is 365 dimensions, a feature threshold is applied to obtain a key frame library, an euclidean distance between feature vectors is calculated for each key frame image, a similarity matrix is obtained, and a closed-loop frame is found. Finding the nearest point, applying a distance threshold value to determine whether cyclic closing occurs, if the similarity is greater than a set threshold value, then the loop is closed, obtaining an accurate recall curve by changing the distance threshold value, and obtaining a similar recall loop by finding a key. And outputting a closed loop detection accuracy recall rate curve and the detected closed loop to be used as subsequent SLAM mapping optimization. Different training super-parameter settings can be tested during actual model training, and the model with the most excellent performance is selected.

(1) Judging a key frame; in order to avoid the situation that the keyframes are too close to each other, which results in too high similarity between the two keyframes, the frames for loop detection need to be sparse, not much the same, and need to cover the whole environment. Every time the camera moves for a certain interval, a new key frame is taken and stored, and in addition, a closed-loop closed image is determined by a method of limiting the matching range of the current position image, and the range of the detected image is set by using a threshold S. Specifically, if the number of current images is N and the number of excluded images is S, then loop closure occurs only in images other than the S frames prior to the current image.

(2) Acquiring a candidate key frame library from the key frames; the system does not directly match the current key frame with all possible closed-loop frames, but first obtains key frames which are near the key frames and comprise more than or equal to W key frames of categories which are not 0, and sets the key frames; the value of W is reasonably selected, too small value of W can cause too many acquired key frames and increase the calculated amount, too large value of W can exceed the category number, and closed-loop frames cannot be acquired;

(3) Computing key frames and key frame librariesA similarity score for each frame in (a); firstly, normalizing vectors, measuring similarity scores among images by using Euclidean distances of characteristic vectors, selecting a negative correlation function to record scores as the distances and the similarities are in negative correlation, and indicating that the similarity score is higher as the matching score is lower; then apply the distance threshold τ _i To determine whether a cycle closure has occurred;

in the above formula, dis (I, j) is image I _i ,I _i Distance between, G is the similarity score, k ₁ ,k ₂ Is a process parameter, where k ₁ <0, the similarity score is normalized to [0,1 ] before the detection loop is closed]. For measurement purposes, normalized distances are used to obtain a score value at [0,1%]。

(4) Performing rank reduction processing on the similarity matrix to avoid noise; the similarity scores for each pair of keyframes form a matrix M that describes the relationship between them. M is a real pair matrix n x n matrix, there being an orthogonal matrix V and a diagonal matrix D such that M satisfies the formula, where V _i Is a feature vector, d _i Is the eigenvalue on the diagonal:

the dominant eigenvectors of M are related to the subject matter of penetration into a particular environment, are detrimental to detection loop closure, can create ambiguity due to the repetitive nature of different scenarios and lead to false positive detections. The noise value can be reduced by removing the maximum characteristic value by utilizing the rank reduction matrix, a real loop is reserved, and the detection ambiguity is favorably reduced.

The above formula is obtained by calculating λ _i Occupied lambda _r To lambda _n Entropy measures the complexity of M decomposition, removing outer products sequentially from M, obtaining r that maximizes H (M, r) ^ι ；

A similarity matrix with no single topic dominance is obtained, and a reduced order matrix is used to replace M. By decomposing the similarity matrix into a series of outer products, the effects of common similarity can be removed without removing the image itself and the degree of washout in the enhanced closed-loop detection can also be enhanced. And obtaining the mouth base candidate loop frame of the current frame by checking the high-partition area of the matrix.

(5) Loop frames detected by i frames before the current frame need to be verified whether the loop frames have a direct connection relation with the optimal candidate loop frame, and the optimal candidate loop frame after the spatial continuity check is determined to be the loop frame. After all the images in the dataset are considered, a precision and recall pair result is obtained, and once a loop is found, the spanning trees of adjacent frames are computed and the entire trajectory is optimized.

Compared with the prior art, the invention has the following advantages:

(1) The method is based on the lightweight deep convolution neural network, semantic information which is difficult to obtain due to the characteristics of manual manufacturing can be expressed, the image characteristics are obtained by utilizing the neural network model, the perception capability is enhanced, the texture and the distribution characteristics of an image sample can be effectively learned, in addition, the speed of the model in actual introduction can be improved due to the lightweight characteristics, and the detection speed is improved while the higher accuracy is ensured.

(2) The invention designs an enhancement step aiming at the original characteristics obtained by the network model, thereby obtaining the final loop-back judgment result. The original features are preprocessed, a key frame library is obtained by utilizing an algorithm, pairwise distances of candidate key frames are calculated to judge a loop, wherein the representation capability of the candidate key frames to images can be obviously improved by rank reduction operation, the calculation efficiency is improved, and the accuracy of detection results can be improved by verification after judgment.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a simplified diagram of a lightweight deep neural network architecture of the present invention;

fig. 3 is a block1 architecture diagram of the network of the present invention.

Fig. 4 is a diagram of a network block2 structure of the present invention.

FIG. 5 is a schematic diagram of the detection loop of the present invention;

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

A robot walking closed-loop detection method based on a lightweight deep neural network is disclosed, as shown in a flow chart of fig. 1, the method comprises the following steps:

step 1: a closed loop test data set is selected. The training of the CNN model is a process with supervised learning, and if the data has no label information, the training of the model cannot be completed. Aiming at the problems, the training of a lightweight deep neural network model is completed on a large labeled scene data set, the trained model is used as a feature extractor of a scene image, and finally the extracted features are applied to closed-loop detection.

An image sample data set is established by adopting a standard-college365, the size of the image sample is 224X224, a supervised learning mode is adopted, and therefore a training set, a verification set and a test set are needed, and model training is completed on a large number of labeled scene data sets. Training data is stored in a plurality of TFRecord files to improve processing efficiency, then samples are read from the TFRecord files to be analyzed, a file list of original data is appointed, the data is read from the files, and after preprocessing of gray values and mean values is carried out on the data, the data are combined and sorted into a batch to be used as neural network input. Meanwhile, in order to ensure that the training sample has enough representativeness, the coverage of various scenes such as different landforms, different distances, different illumination, camera shooting angles and the like needs to be considered during sample collection.

And 2, step: and constructing a lightweight deep neural network. The method comprises the steps of preprocessing training data and test data of a prepared input model, uniformly adjusting images to be 224X224 in size (the actual size can be adjusted to be other different sizes according to needs), enabling the model to be composed of a convolution layer, a pooling layer, a block and a full-connection layer, searching for characteristics of certain aspects of the images through convolution kernels, inputting the characteristics into the model, establishing a relation with results, classifying the characteristics, and finally enabling the output of the last full-connection layer to serve as a characteristic vector of the images.

The lightweight deep neural network model adopts the structure of a traditional neural network, a simplified network structure diagram is shown in fig. 2 and comprises an input layer, two convolutional layers, two maximum pooling layers, two blocks and a full-connection layer, and simplified block1 and block2 structures are respectively shown in fig. 3 and fig. 4. The input layer is a 224X224 three-channel image; the Conv1 and Conv2 convolutional layers can perform feature extraction on input data, the maximum pooling layers Pool1 and Pool2 can perform effective information filtering, the convolutional layers and the pooling layers are linearly activated by adopting a correction linear unit C.RELU, parameters required by a convolutional kernel can be reduced by cascading an inverted image and an original image, and in addition, batch normalization operation is performed after each layer to accelerate convergence. Then two self-defining modules, block1 and block2, are started by a residual error network, and the two modules use skip connection to solve the problem that a deep neural network is difficult to optimize; in order to control the dimensionality of the characteristic diagram, reduce the process parameter quantity, increase the operating efficiency of the network, use the bottleneck structure to the main line part of the residual error module; considering that the neural model always faces the defects of more parameters and slow operation, the network model utilizes the technologies of point-by-point group convolution and channel rearrangement to avoid the dilemma of unsmooth information circulation, divides the feature maps obtained from the upper layer into two groups, is inspired by an inclusion structure, respectively uses a plurality of convolution kernels with different sizes for the feature maps of the same layer to obtain features with different scales, combines the features, and the obtained features are always better than the features using a single convolution kernel; the residual network structure enables the gradient to flow into a shallow network more easily, and the problem of gradient dispersion caused by deepening of the stratification degree is avoided. In order to improve the generalization of the model, channel shuffle is performed once after each split operation, the operation can fuse the features among different groups, and the next layer of group convolution is entered after the group conv is performed once, so that the cycle is performed. In order to avoid the over-fitting phenomenon, a reasonable regularization process is required. And finally, fusing and classifying the feature maps transmitted in the front by adopting a full connection layer Fc, wherein the number of the neurons of the output layer is the number of the categories of the data set. The network training can obtain the zero-sum game solution only by ensuring that the number of the real samples is far larger than the parameter quantity of the generated model. Secondly, in order to ensure that the discriminant model has good adaptability and discriminant capability, the model in the method is trained by using a Dropout and L2 regularization auxiliary model.

And step 3: and optimizing the network model. Loading image samples of a data set, firstly initializing and setting the weight of a neural network model by adopting MSRA, inputting real image samples into a lightweight deep neural network model, training the neural network model by using a well-defined forward propagation process, and alternately training and optimizing model parameters by using backward propagation. After the training is finished, the model obtained by training is stored to be convenient for direct use next time. And testing by using the model stored after the training is finished, and training the network model to achieve certain discrimination accuracy.

In step 3, the training process of the deep neural network can be described as an optimization process of the model parameters according to the model generation result. According to the method, optimization is carried out according to the real label in the sample and the model generation result. Taking a loss function cross entropy loss function L (loss function) corresponding to the Softmax classifier as an optimization objective function of the training process, and expressing the loss function cross entropy loss function L (loss function) as follows:

To suppress the oscillation of the SGD, inertia is added during the gradient descent. An ADMA algorithm is introduced on the basis of SGD:

m (t) is the exponential moving average of the gradient, and V (t) is the non-central variance value at the second moment of the gradient. The Adam algorithm, namely an Adaptive Moment Estimation method (Adaptive motion Estimation), can calculate the Adaptive learning rate of each parameter. This method not only stores the exponentially decaying average of AdaDelta previous squared gradients, but also maintains the exponentially decaying average of previous gradients M (t), which is similar to momentum; beta is a beta ₁ Empirical value of the parameter 0.9, beta ₁ Empirical values of 0.999 for the parameters used to control the exponential decay; the momentum term only updates the parameters of the related samples, and unnecessary parameter updating is reduced, so that faster and stable convergence is obtained, and the oscillation process is also reduced.

The specific deployment flow in step 3 is as follows:

θt＝θ _t-1 -V _t

wherein

Is m _t ，V _t Correction of (D), V _t For the parameter update value of the t-th iteration, λ is the learning rate of the negative gradient,

is the partial derivative of the loss function with respect to the parameter, x (i), y (i) are training samples, and θ t is the parameter value for the t-th iteration.

And 4, step 4: closed-loop detection is performed by using the network model, as shown in fig. 5, a deep neural network model that can be used for obtaining image features is obtained through training in step 3. The method comprises the steps of calculating a feature descriptor of each query image (current robot view) by using a neural network, preprocessing original CNN features, adding an enhancement step, and carrying out Principal Component Analysis (PCA) and whitening, so that the capability of representing the images can be remarkably improved, the calculation efficiency is improved, and the feature descriptors are finally used for detection circulation. And after normalization, acquiring a similarity matrix between the images according to Euclidean distance, and reducing the rank of the matrix to reduce noise. The similarity is measured to determine if a loop closure has occurred and after all images in the data set are considered, an accuracy and recall pair result is obtained. By the method, the similarity relation between the images can be obtained, and the problem of closed-loop detection when the robot walks is solved.

In step 4, the image feature dimension in the method is 365 dimensions, a feature threshold is applied to obtain a key frame library, the Euclidean distance between feature vectors is calculated for each key frame image, a similarity matrix is obtained, and a closed-loop frame is found. And finding a nearest point, determining whether circular closure occurs or not by applying a distance threshold, if the similarity is greater than a set threshold, determining that the circular closure is a closed loop, obtaining an accurate recall curve by changing the distance threshold, and obtaining a similar recall loop by finding a key. And outputting a closed loop detection accuracy recall rate curve and the detected closed loop to be used as subsequent SLAM mapping optimization. Different training super-parameter settings can be tested during actual model training, and the model with the most excellent performance is selected.

(1) Judging a key frame; in order to avoid the situation that the keyframes are too close to each other, which results in too high similarity between the two keyframes, the frames for loop detection need to be sparse, not much the same, and need to cover the whole environment. Every time the camera moves for a certain interval, a new key frame is taken and stored, and in addition, a closed-loop closed image is determined by a method for limiting the image matching range of the current position, and the range of the detected image is set by using a threshold S. Specifically, if the number of current images is N and the number of excluded images is S, then loop closure occurs only in images other than the S frames prior to the current image.

(3) Calculating a similarity score between the key frame and each frame in the key frame library; firstly, normalizing vectors, measuring similarity scores among images by using Euclidean distances of characteristic vectors, selecting a negative correlation function to record scores as the distances and the similarities are in negative correlation, and indicating that the similarity score is higher as the matching score is lower; then apply a distance threshold τ _i To determine whether a cycle closure has occurred;

in the above formula, dis (I, j) is an image I _i ,I _i Distance between, G is the similarity score, k ₁ ,k ₂ Is a process parameter, where k ₁ <0, the similarity score is normalized to [0,1 ] before the detection loop is closed]. For measurement purposes, normalized distances are used to obtain a score value at [0,1%]。

(4) Performing rank reduction processing on the similarity matrix to avoid noise; the similarity scores for each pair of keyframes form a matrix M that describes the relationship between them. M is a real diagonal n matrix, there is an orthogonal matrix V and a diagonal D such that M satisfies the formula, where V is _i Is a feature vector, d _i Is the eigenvalue on the diagonal:

the dominant eigenvectors of M are related to the subject matter of penetration into a particular environment, are detrimental to detection loop closure, can create ambiguity due to the repetitive nature of different scenarios and lead to false positive detections. The noise value can be reduced by removing the maximum eigenvalue by using the rank reduction matrix, a real loop is reserved, and the detection ambiguity is favorably reduced.

A similarity matrix with no single topic dominance is obtained, and a reduced order matrix is used to replace M. By decomposing the similarity matrix into a series of outer products, the impact of similarity commonalities can be removed without removing the image itself and the degree of washout in the enhanced closed-loop detection can also be enhanced. And obtaining the mouth base candidate loop frame of the current frame by checking the high-resolution area of the matrix.

(5) The loop frame detected by the i frames before the current frame needs to verify whether the loop frame has a direct connection relation with the optimal candidate loop frame, and the optimal candidate loop frame after the spatial continuity check is determined as the loop frame. After all the images in the dataset are considered, a precision and recall pair result is obtained, and once a loop is found, the spanning trees of adjacent frames are computed and the entire trajectory is optimized.

Claims

1. A robot walking closed-loop detection method based on a lightweight deep neural network is characterized by comprising the following steps:

step 1, selecting a closed-loop detection data set, finishing training of a lightweight deep neural network model on a large-scale labeled scene data set, using the trained model as a feature extractor of a scene image, and finally applying the extracted features to closed-loop detection;

step 2: building a lightweight deep neural network, preprocessing training data and test data of a prepared input model, wherein the model consists of a convolution layer, a pooling layer, a block and a full-link layer, searching for characteristics of certain aspects of an image through the convolution kernel, inputting the characteristics into the model, establishing a relationship with a result, classifying the characteristics, and finally taking the output of the final full-link layer as a characteristic vector of the image;

and step 3: optimizing a network model, loading an image sample of a data set, initializing a weight value of a neural network model by adopting MSRA (modeling, retrieval and retrieval), inputting a real image sample into a lightweight deep neural network model, training the neural network model by using a well-defined forward propagation process, alternately training and optimizing model parameters by using backward propagation, storing the trained model after training is finished so as to be convenient for next direct use, testing by using the model stored after the previous training is finished, and training the network model to achieve certain discrimination accuracy;

and 4, step 4: performing closed-loop detection by using a network model, obtaining a deep neural network model which can be used for obtaining image characteristics through training in the step 3, calculating a characteristic descriptor of each query image by using a neural network, preprocessing the original CNN characteristics, adding an enhancement step, performing principal component analysis and whitening, and finally using the principal component analysis and whitening for detection circulation; after normalization, a similarity matrix between the images is obtained according to Euclidean distance, rank reduction is carried out on the matrix to reduce noise, whether circular closure occurs or not is determined by measuring similarity, and a precision and recall ratio pair result is obtained after all images in a data set are considered.

2. The robot walking closed-loop detection method based on the lightweight deep neural network as claimed in claim 1, wherein: in the step 1, a standard-college365 is adopted to establish an image sample data set, and the size of the image sample is 224X 224; the training data are stored in a TFRecord file, then samples are read from the TFRecord file to be analyzed, a file list of original data is appointed, the data are read from the file, and after preprocessing of gray value and mean value reduction is carried out on the data, the data are combined and sorted into a batch which is used as neural network input.

3. The robot walking closed-loop detection method based on the lightweight deep neural network as claimed in claim 1, wherein: the lightweight deep neural network model in the step 2 consists of an input layer, two convolutional layers, two maximum pooling layers, two blocks and a full-connection layer; the input layer is a 224X224 three-channel image; the Conv1 and Conv2 convolutional layers are used for performing feature extraction on input data, the maximum pooling layers Pool1 and Pool2 are used for performing effective information filtering, and the convolutional layers and the pooling layers are linearly activated by adopting a correction linear unit C.

4. The robot walking closed-loop detection method based on the lightweight deep neural network as claimed in claim 1, wherein: in step 3, optimizing according to the real label in the sample and the model generation result, taking a loss function cross entropy loss function L corresponding to the Softmax classifier as an optimization target function of the training process, and expressing by the following formula:

wherein m is the number of samples of each training batch; theta is a parameter matrix to be optimized of the network model; x (i) is the ith picture sample; y (i) is the ith sample true label; k is the classification number;

adding weight to each parameter w in the loss function, and introducing a model complexity index, thereby suppressing model noise and reducing overfitting; when the neural network is trained, all parameters in the neural network are continuously changed, a loss function is continuously reduced by a random gradient descent algorithm SGD, and a neural network model with higher accuracy is trained;

inertia is added in the gradient descending process to inhibit the oscillation of the SGD, and an ADMA algorithm is introduced on the basis of the SGD:

m (t) is an exponential moving average value of the gradient, and V (t) is a non-central variance value of the gradient at a second moment;

the specific deployment flow in step 3 is as follows:

step 301, classifying an input image by using a Softmax classifier, mapping output of a plurality of neurons into a (0, 1) interval by Softmax, calculating the probability of the class, performing mean centering preprocessing on the input image, transmitting the processed images into a neural network in batches, performing forward calculation in a network model, and obtaining a prediction result s through a discrimination formula:

302, optimizing model parameters by using a random gradient descent algorithm and a self-adaptive learning gradient descent optimization algorithm, updating parameters by using training samples and expected values after data obtain a prediction result through a discrimination formula, solving optimal parameters by using a random gradient iteration algorithm SDA, randomly selecting one sample from a training set each time to learn, and updating parameters such as weight, bias and the like in the network by using the following formulas:

θt＝θ _t-1 -V _t

wherein

Is m _t ，V _t Correction of (V) _t For parameter update values for the t-th iteration, λ is the learning rate for negative gradients,

5. The robot walking closed-loop detection method based on the lightweight deep neural network according to claim 1, characterized in that: the specific flow of the step 4 is as follows:

(1) Judging a key frame; taking and storing a new key frame every time the camera moves for a certain interval, determining a closed-loop closed image by a method of limiting the matching range of the current position image, and setting the range of the detected image by using a threshold S; if the number of current images is N and the number of excluded images is S, then loop closure occurs only in images other than S frames before the current image;

(2) Acquiring a candidate key frame library from the key frames; firstly, obtaining key frames which are near the key frames and comprise more than or equal to W categories which are not 0 in the key frames, and collecting the key frames; reasonably selecting the value of W;

(3) Calculating a similarity score between the key frame and each frame in the key frame library; firstly, normalizing vectors, measuring similarity scores among images by using Euclidean distance of characteristic vectors, selecting a negative correlation function record score, and indicating that the similarity score is higher when the matching score is lower; then apply the distance threshold τ _i To determine whether a cycle closure has occurred;

in the above formula, dis (I, j) is image I _i ,I _i Distance between, G is the similarity score, k ₁ ,k ₂ Is a process parameter, where k ₁ <0, the similarity score is normalized to [0,1 ] before the detection loop is closed]Using normalized distance to obtain a score value at [0,1 ]]；

(4) Performing rank reduction processing on the similarity matrix to avoid noise; the similarity score of each pair of key frames constitutes a description of themThe matrix M of the relationships is a real pair matrix n x n, there is an orthogonal matrix V and a diagonal matrix D such that M satisfies the formula, where V _i Is a feature vector, d _i Is the eigenvalue on the diagonal:

m main eigenvectors are related to subjects permeating into a specific environment, are harmful to detection cycle closing, can generate fuzziness due to repeated properties of different scenes and lead to false positive detection, remove maximum eigenvalue by using a rank reduction matrix to reduce noise value, reserve real loop and reduce detection fuzziness

The above formula is obtained by calculating lambda _i Occupied lambda _r To lambda _n Entropy measures the complexity of M decomposition, removing outer products sequentially from M, obtaining r that maximizes H (M, r) ^l ；

Replacing M with a reduced-order matrix, decomposing the similar matrix into a series of outer products, removing the influence of similarity of common features under the condition of not removing the image, enhancing the erosion degree in closed-loop detection, and checking high-partition areas of the matrix to obtain a mouth base candidate loop frame of the current frame;

(5) Loop frames detected by i frames before the current frame need to be verified whether to have a direct connection relation with the optimal candidate loop frame, and the optimal candidate loop frame after the spatial continuity test is determined to be the loop frame; after all images in the dataset are considered, a precision and recall pair result is obtained, once a loop is found, the spanning tree of the adjacent frames is calculated and the whole trajectory is optimized.