CN113793472A

CN113793472A - Image type fire detector pose estimation method based on feature depth aggregation network

Info

Publication number: CN113793472A
Application number: CN202111078643.4A
Authority: CN
Inventors: 钟晨; 王珂; 戴崑
Original assignee: Shenyang Fire Research Institute of MEM
Current assignee: Shenyang Fire Research Institute of MEM
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2021-12-14
Anticipated expiration: 2041-09-15
Also published as: CN113793472B

Abstract

The invention relates to a method for estimating the pose of an image type fire detector based on a characteristic depth aggregation network, and belongs to the technical field of video image shooting pose estimation. The method comprises the following steps: s1, collecting data in different building environments, wherein the collected data comprise RGB images, depth maps and camera poses recorded at the same time when each frame of image is shot; s2, preprocessing the data acquired in S1; s3, building a characteristic deep aggregation network; s4, training the characteristic deep aggregation network to obtain an optimal network model; and S5, inputting the RGB images subjected to the test centralized normalization processing into the optimal network model obtained in the S4, and calculating the pose of the image type fire detector. The method can improve the learning ability of the network, prevent overfitting and improve the positioning accuracy of the image type fire detector.

Description

Image type fire detector pose estimation method based on feature depth aggregation network

Technical Field

The invention relates to the technical field of video image shooting pose estimation, in particular to an image type fire detector pose estimation method based on a characteristic depth aggregation network.

Background

In recent years, with the wide application of video data acquisition and the development of video image pattern recognition technology, research on video image fire detection methods is going deep. At present, the image type fire detector is widely applied to indoor and outdoor places of large-space buildings and cultural buildings, and fire prevention and control of environments such as forests, grasslands and the like. According to parameters such as the field angle, the definition, the focal length and the like of a camera in the image type fire detector, the effective fire monitoring range scale can be determined, but an effective means is lacked for estimating the pose of the camera erection installation, and accurate modeling of the effective monitoring area of the detector is difficult to perform. In actual installation and use, monitoring dead corners or overlapped areas can exist, so that leakage protection or over-protection problems can be caused. In order to realize tasks such as registration of an effective monitoring area of an image type fire detector in a three-dimensional scene, full-scene visual modeling and the like, the shooting posture estimation of a detector camera in a specific scene is a problem which needs to be solved.

In the traditional camera positioning, the detection of key points, the calculation and matching of an image descriptor are required to be carried out on an image, or point cloud registration is carried out after point cloud data is collected by using a depth camera, so that a large amount of calculation time is consumed, and high-precision positioning is difficult to realize. To address the above problems, Alex Kendall et al propose using a neural network to achieve camera positioning. GoogleNet is used as a backbone network, and a model in a Places data set classification task is used for transfer learning, so that the pose of the camera is directly predicted, and the high-precision positioning of the camera is realized. Eric Brachmann et al propose DSAC, which uses two neural networks to predict spatial point coordinates in an indoor scene and score poses, respectively, and achieves an advanced effect in a current camera positioning task. On this basis, Eric Brachmann et al further proposed a DSAC + + network. The network improves the extraction mode of the feature map, and realizes high-precision camera pose estimation by using the neural network comprising 11 convolutional layers and 3 downsampling layers. After the network predicts scene coordinates corresponding to pixels, a pose pool is generated by using RANSAC and PNP algorithms, and then each pose is scored according to the reprojection error to obtain the optimal pose. The DSAC series firstly proposes to use a neural network to predict scene point coordinates, the method greatly improves the positioning precision of the camera, and provides a new idea for positioning the camera. The following problems still remain:

1. DSAC series are only connected in series with a plurality of convolution layers on a network structure, and the fusion of characteristics extracted from different receptive fields is not realized;

2. in a scene with less repetition and texture, the prediction effect of the network is deteriorated due to the similarity between image blocks.

Disclosure of Invention

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a method for estimating pose of an image-based fire detector based on a feature depth aggregation network. The invention designs a characteristic depth aggregation module, improves the fusion mode of low-level and high-level characteristic graphs, and is used for effectively positioning an image type fire detector in a building scene.

In order to achieve the purpose, the invention adopts the main technical scheme that:

the invention provides an image type fire detector pose estimation method based on a characteristic depth aggregation network, which comprises the following steps of:

s1, collecting data in different building environments, wherein the collected data comprise RGB images, depth maps and camera poses recorded at the same time when each frame of image is shot;

s2, data preprocessing is carried out on the data collected in the S1, and the data preprocessing specifically comprises the following steps:

s21, dividing the data collected in S1 into a training set and a testing set;

s22, calculating real scene coordinates corresponding to the pixel points according to the depth map and the camera pose in the training set;

s23, performing data enhancement processing on the images in the training set;

s24, performing normalization processing on the RGB images in the training set after the data enhancement processing and the RGB images in the testing set;

s3, building a characteristic deep aggregation network;

s4, training the feature deep aggregation network, specifically:

s41, inputting the RGB images subjected to normalization processing in the training set in the S24 into the feature deep aggregation network established in the S3, and performing sample training for the first time to obtain a trained network model;

s42, inputting the RGB images subjected to the test set normalization processing in the S24 into the network model trained in the S41, and obtaining the predicted scene coordinates and uncertainty corresponding to each pixel point;

s43, obtaining a predicted camera pose according to the predicted scene coordinates and the uncertainty in the S42;

s44, comparing the predicted camera pose in the S43 with the camera pose collected in the S1 to obtain a pose error, comparing the pose error with the pose error obtained in the last test, and reserving a network model with small pose error;

s45, repeating the steps S41-S44 until an optimal network model is obtained;

and S5, inputting the RGB images subjected to the test centralized normalization processing into the optimal network model obtained in the S45, and calculating the pose of the image type fire detector.

Further, the method for calculating the real scene coordinates corresponding to the pixel points in S22 specifically includes: according to the collected depth map, obtaining camera coordinates P corresponding to the pixel points_cCombined with camera pose T_cwCalculating the real scene coordinates corresponding to the pixel points

Further, the data enhancement processing in S23 specifically includes: for the RGB images and the depth maps in the training set, the images are translated by-20 pixels in the horizontal direction or the vertical direction, the image size is scaled by 0.7-1.5 times, and the images are rotated by-30 degrees to increase the number of samples of the building database.

Further, the feature deep aggregation network built in S3 includes three parts: the first part is a feature extraction layer which is used for extracting low-level and high-level features of an image in a building scene and respectively coding geometric spatial information and semantic information in the low-level and high-level features; the second part is a feature fusion layer, and a channel attention mechanism is used for fusing extracted feature graphs on different scales to realize more fine coding of environmental information; the third part is a regression layer for predicting scene coordinates and uncertainty.

Further, the feature extraction layer comprises a series of convolution layers for coding the features in the building image to obtain a first feature map, a second feature map and a third feature map.

Furthermore, the feature fusion layer takes the first feature graph, the second feature graph and the third feature graph as input; generating a fourth feature map by the third feature map through a channel attention module, and adding the fourth feature map and the second feature map pixel by pixel to obtain a fifth feature map; the fifth characteristic diagram is used for obtaining a sixth characteristic diagram through the convolution layer, and the sixth characteristic diagram is used for obtaining a seventh characteristic diagram through the attention module; adding the seventh characteristic diagram and the first characteristic diagram pixel by pixel to obtain an eighth characteristic diagram, and obtaining a ninth characteristic diagram by the eighth characteristic diagram through a convolution layer; and splicing the third, sixth and ninth feature maps in the channel dimension to obtain a tenth feature map.

Further, the regression layer includes a series of convolutional layers for predicting predicted scene coordinates and uncertainty in the building.

Further, a deep supervision technology is used when the characteristic deep aggregation network is trained, when the characteristic deep aggregation network is subjected to sample training by using data in a training set in S41, a loss function is obtained by combining the real scene coordinates with the obtained prediction scene coordinates and uncertainty, and then the loss function is used for performing back propagation to correct the network parameters of the characteristic deep aggregation network.

Further, the specific step of S43 is: setting a threshold value of the uncertainty, and eliminating the predicted scene coordinates of which the uncertainty is greater than the threshold value; and in the prediction scene coordinates with the uncertainty smaller than the threshold value, calculating the prediction camera pose of the image type fire detector by using the RANSAC algorithm and the PNP algorithm.

Further, the specific step of S45 is: defining the training times of the sample according to the size of the data; performing a test on the data in the test set by using the model parameters of the current network model every time the sample training is finished; if the pose error of the test result is better than the stored optimal network model, storing the model parameters of the current network model as the optimal parameters; and when the training times of the network reach the set value, stopping training to obtain the trained optimal network model.

The invention has the beneficial effects that: the invention provides a method for realizing the estimation of the pose of an image type fire detector in a building scene by using a characteristic depth aggregation network. The invention collects RGB images, depth maps and camera poses in different building scenes, and simultaneously carries out data preprocessing operation on collected image data in order to improve the learning capability of the network and prevent overfitting, and then the invention is used for training and testing the characteristic depth aggregation network provided by the invention. The experimental result shows that the network model realizes higher positioning precision of the image type fire detector in the test set under the condition of considering both precision and memory occupation: a positional deviation of 0.018m is offset by an angular deviation of 0.640 deg..

In order to verify robustness, the method respectively performs Gaussian blur and motion blur processing on the image, and then performs testing as input of a characteristic depth aggregation network. Experimental results show that the method has certain inhibition capability on Gaussian blur and motion blur.

The invention designs a novel and effective characteristic depth aggregation module, which has beneficial effect on improving the positioning accuracy of the image type fire detector.

Drawings

FIG. 1 is a schematic flow chart provided by an embodiment of the present invention;

FIG. 2 is a diagram of data collected in a building scene according to an embodiment of the present invention;

fig. 3 is a network diagram of a feature deep aggregation network according to an embodiment of the present invention;

fig. 4 is an image blurring operation sequentially showing an original image, gaussian blurring, slight motion blurring, and severe motion blurring according to the embodiment of the present invention;

FIGS. 5A and 5B are schematic diagrams of a position error and an angle error predicted by a network before and after Gaussian blur processing, respectively, in which a solid line indicates that the network is not subjected to the Gaussian blur processing, and a dotted line indicates that the network is subjected to the Gaussian blur processing;

fig. 6A and 6B are respectively a position error and an angle error predicted by the network before and after the slight motion blur processing, in which a solid line indicates that the slight motion blur processing is not performed, and a dotted line indicates that the slight motion blur processing is performed;

fig. 7A and 7B are diagrams illustrating a position error and an angle error predicted by the network before and after the sharp motion blur processing, respectively, where a solid line indicates that the sharp motion blur processing is not performed, and a dotted line indicates that the sharp motion blur processing is performed.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

The first embodiment is as follows: referring to fig. 1, the method for estimating the pose of the image-based fire detector based on the feature depth aggregation network provided by this embodiment is applied to the positioning process of the image-based fire detector in a building scene. The method comprises the following steps:

step one, in different building environments, a depth camera is used for collecting data. The acquired data includes RGB images, depth maps and camera poses. The acquired building scene image needs to include all representative objects in the acquired scene, and at least 3 groups of data are acquired in each scene, and each group of data includes a certain amount of pictures.

And step two, carrying out data preprocessing on the data acquired in the step one. And D, randomly dividing the data acquired in the step one into a training set and a testing set. And calculating the real scene coordinate corresponding to each pixel point of the depth map in the training set according to the depth map in the training set and the camera pose, and calculating the real scene coordinate and the predicted scene coordinate output by the neural network to obtain a loss function. And carrying out data enhancement operation on the images in the training set. And carrying out normalization processing on the RGB images after the data enhancement in the training set, and carrying out normalization processing on the RGB images in the testing set.

And step three, building a characteristic deep aggregation network.

And step four, training the characteristic deep aggregation network. And inputting the RGB images subjected to normalization processing in the training set into the constructed characteristic depth aggregation network for training. According to the size of the data, the number of sample training times is defined. And carrying out a test by using the current model parameters and the data in the test set every time the sample training is finished. And if the error and the accuracy of the test result are better than those of the stored optimal model, storing the current model parameter as the optimal parameter. And when the training times of the network reach the set value, stopping training to obtain the trained characteristic deep aggregation network.

And fifthly, inputting the RGB images subjected to the test centralized normalization processing into the trained feature depth aggregation network to obtain the predicted scene coordinates and uncertainty corresponding to each pixel point. And eliminating the predicted scene coordinates with poor prediction effect according to the obtained uncertainty. And according to the residual predicted scene coordinates which are more accurate in prediction, randomly selecting 256 sets of scene coordinates and corresponding pixel coordinates by using RANSAC and PNP algorithms, and calculating the predicted pose of the image type fire detector.

The second embodiment is as follows: the embodiment further limits the method for estimating the pose of the image-based fire detector based on the feature depth aggregation network. The specific process of the first step in the embodiment is as follows:

a depth camera is used to collect a large number of RGB images, depth maps and camera poses recorded simultaneously when each frame of image is taken in a building environment. The cameras used in different scenes are the same device, and the camera parameters need to be consistent. The data collected is shown in figure 2.

The third concrete implementation mode: the second embodiment further limits the method for estimating the pose of the image-based fire detector based on the feature depth aggregation network. In step two of the present embodiment, data preprocessing is performed on the data acquired in step one. The specific process comprises the following steps:

randomly dividing the collected data into a training set and a testing set, wherein the proportion of the training set to the testing set approximately satisfies 2: 1.

and (5) calculating scene coordinates of the pixels in the training set. Specifically, from the acquired depth map, the camera coordinates P corresponding to the pixel points can be obtained_cCombined with camera pose T_cwAnd the real scene coordinates corresponding to the pixel points can be calculated

Performing data enhancement operation on the images in the training set, specifically comprising: the image is randomly translated by-20 pixels in the horizontal direction or the vertical direction, the size of the image is randomly zoomed by 0.7-1.5 times, the image is randomly rotated by-30 degrees, the number of samples of a building database is increased, and overfitting is effectively prevented while the learning capacity of a neural network is improved.

And carrying out normalization operation on the RGB images of the training set or the testing set. Specifically, the adopted method is as follows: v'_i＝(v_i/255). times.2-1. Wherein v is_iIs an initial pixel value, v'_iTo normalize the processed pixel values, the RGB values are limited to [ -1,1 [ ]]Within the range.

The fourth concrete implementation mode: the third embodiment further limits the method for estimating the pose of the image-based fire detector based on the feature depth aggregation network. The specific process of the third step in the embodiment is as follows: and (3) constructing a characteristic deep aggregation network, wherein the network structure is shown in figure 3.

The characteristic deep aggregation network comprises three parts: the first part is a feature extraction layer which is used for extracting low-level and high-level features of images in a building scene and coding geometric spatial information and semantic information in the environment; the second part is a feature fusion layer, and a channel attention mechanism is used for fusing extracted feature graphs on different scales to realize more fine coding of environmental information; the third part is a regression layer for predicting scene coordinates and uncertainty.

The fifth concrete implementation mode: the fourth embodiment is further limited to the fourth embodiment, in which the method for estimating the pose of the image-based fire detector based on the feature depth aggregation network is described. The structure of the feature extraction layer is as follows:

the feature extraction layer has a dimension of an input tensor of 5 × 480 × 640, where 5 represents a color value (R, G, B) and a pixel coordinate (u, v) of each pixel, and 480, 640 represent a height and a width of an image, respectively.

The feature extraction layer is composed of a series of convolution layers and is used for coding features in the building image. The feature extraction layer uses the modified ResNet18 as the backbone network, taking into account the dimensions of the input tensor. The modification includes replacing the 7 x 7 convolutional layer of the first layer convolution kernel with two 3 x 3 convolutional layers of the convolution kernel, while removing the last tie pooling layer and the full link layer.

The encoding of features in the architectural image is divided into 7 stages, which are respectively defined as C1, C2, M1, B1, B2, B3, and B4, referring to table 1, corresponding to the first convolutional layer, the second convolutional layer, the first max pooling layer, the first residual block, the second residual block, the third residual block, and the fourth residual block. When feature extraction is carried out, firstly, a first convolution layer is provided, the size of a convolution kernel is 3 multiplied by 3, and the step length is 1; then another second convolution layer, the convolution kernel size is 3 x 3, the step size is 2; using a first maximum pooling layer, the convolution kernel size is 3 x 3 with a step size of 2. And then, further extracting features by using the first residual block, the second residual block, the fourth residual block and the fourth residual block to obtain a first feature map, a second feature map and a third feature map. The benefits of using a residual block are: and the low-dimensional features are added to the high-dimensional features in the residual block through a short-circuit structure, so that the loss of information and the degradation of a network structure are prevented. Meanwhile, the short-circuit structure further avoids gradient disappearance and gradient explosion. The residual block does not increase a lot of network parameters, but can improve the training effect of the network.

Table 1 feature extraction layer network module architecture

After each convolution, the BN layer and RELU activation function are used for processing. The BN layer can smooth the surface of the loss function and is helpful for improving the training speed of the network. The activation function helps to increase the degree of non-linearity of the network.

The sixth specific implementation mode: the embodiment further limits the pose estimation method of the image-based fire detector based on the feature depth aggregation network described in the fifth embodiment. The structure of the feature fusion layer is as follows:

in the feature fusion layer, in consideration of the fact that different feature graphs describe different building scene features, information confusion can be caused by a traditional pixel-by-pixel addition mode, the invention provides a feature depth aggregation module by using a channel attention module and a channel dimension splicing mode, and the extracted feature graphs containing different context information of the building images are effectively fused.

The characteristic fusion layer takes the first characteristic diagram, the second characteristic diagram and the third characteristic diagram as input. And the third feature map generates a fourth feature map through the channel attention module, and the fourth feature map and the second feature map are added pixel by pixel to obtain a fifth feature map. The fifth feature map is obtained by a convolution layer, and the sixth feature map is obtained by an attention module. And adding the seventh characteristic diagram and the first characteristic diagram pixel by pixel to obtain an eighth characteristic diagram, and obtaining a ninth characteristic diagram by the eighth characteristic diagram through a convolution layer. And splicing the third, sixth and ninth feature maps in the channel dimension to obtain a tenth feature map. Except for the tenth feature map with dimensions 1536 × 60 × 80, the dimensions of the rest feature maps are 512 × 60 × 80.

The channel attention module, the output feature map can be expressed as

Wherein m is_iFor inputting a feature map, P_globalFor global average pooling, C_1×1For convolutional layer and batch normalization with a convolution kernel of 1 x 1, δ refers to sigmoid activation function,

for pixel-by-pixel multiplication, m_oTo output speciallyAnd (5) figure drawing.

The seventh embodiment: the present embodiment further limits the method for estimating the pose of the image-based fire detector based on the feature depth aggregation network according to the sixth embodiment. The structure of the regression layer is as follows:

the regression layer is composed of a series of convolution layers and used for predicting scene coordinates and uncertainty in a building. The regression layer is divided into 6 stages, which correspond to the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer and the sixth convolution layer in parallel, respectively, as shown in table 2. All convolutional layer convolutional kernels have a size of 3 × 3, and the step size is 1. Obtaining an eleventh feature map through the first to fourth convolution layers; and the eleventh feature map passes through a fifth convolution layer and a sixth convolution layer respectively to obtain the coordinates and the uncertainty of the predicted scene. Wherein the first through fourth convolutional layers contain convolution operations, batch normalization, and RELU activation function processing, and the fifth convolutional layer and the sixth convolutional layer contain only convolution operations.

Table 2 regression layer network module architecture

The specific implementation mode is eight: the present embodiment further limits the method for estimating the pose of the image-based fire detector based on the feature depth aggregation network according to the seventh embodiment. In the fourth step, the RGB images after normalization processing in the training set are input into the built feature depth aggregation network for training, which specifically comprises:

the input is a single RGB image in a building scene, and the output is the predicted scene coordinates and uncertainty corresponding to the pixel points in the RGB image. In inputting the training set image into the network, the size of the mini-batch is set to 4. Using an ADAM optimizer with the hyper-parameter set to β₁＝0.9，β₂0.999. And (3) adopting a learning rate attenuation strategy, wherein the learning rate is set as:

wherein: l_nFor the current learning rate, I_iFor the initial learning rate, set to 0.0002 in the present invention, iter is the current iteration number.

In the specific training, the number of sample training times is set to 500. And inputting the data in the test set into the current characteristic deep aggregation network for testing once after finishing the sample training. And if the test result of the current characteristic deep aggregation network is better than the stored characteristic deep aggregation network, storing the current characteristic deep aggregation network as the optimal characteristic deep aggregation network. And when the sample training times reach 500, finishing the training to obtain the optimal characteristic deep aggregation network.

The specific implementation method nine: the present embodiment further limits the method for estimating a pose of an image-based fire detector based on a feature depth aggregation network according to the eighth embodiment. Aiming at data in a training set, the deep feature aggregation network uses a deep supervision technology, and meanwhile, the learning capacity and the optimization speed of the network are improved. The implementation of the deep supervision technique is as follows:

the depth supervision technology predicts scene coordinates and uncertainty by using a regression layer according to a third feature map and a tenth feature map respectively, and then calculates auxiliary loss L respectively₁And main loss L₂. The auxiliary loss and the main loss are added to obtain the total loss L_regIn addition, the mode of addition is L_reg＝L₂+0.4L₁. Referring to fig. 3(a), the auxiliary losses are on the left and the main losses are on the right.

The detailed implementation mode is ten: the present embodiment further limits the method for estimating a pose of an image-based fire detector based on a feature depth aggregation network according to the ninth embodiment. The loss function L in this embodiment considers the euclidean distance and uncertainty between the predicted scene coordinates and the real scene coordinates, and is defined as follows:

wherein: n is the number of pixels of the input image, P_wiFor the predicted scene coordinates of the ith pixel,

as the real scene coordinates of the ith pixel, v_iIs the uncertainty of the ith pixel. Wherein the first term 3logv_iAs a penalty term, a second term

And if the predicted scene coordinate accuracy is poor, the uncertainty is high, so that the penalty term is increased, the loss function is increased, and the network parameters are greatly corrected in the backward propagation process.

The concrete implementation mode eleven: the present embodiment further limits the method for estimating a pose of an image-based fire detector based on a feature depth aggregation network according to the tenth embodiment. The specific form of the RELU activation function in this embodiment is:

where x represents input and relu (x) represents output.

The specific implementation mode twelve: the present embodiment further limits the method for estimating a pose of an image-based fire detector based on a feature depth aggregation network according to the eleventh embodiment. In the fourth step of the embodiment, the data in the test set is used for carrying out the primary test, and in the fifth step, the data in the test set is input into the trained feature deep aggregation network for testing, so that the coordinates and the uncertainty of the predicted scene can be obtained. The specific process of using the data of the test set to perform the test is as follows:

1. inputting the data in the test set into a characteristic deep aggregation network;

2. outputting the predicted scene coordinates and uncertainty of each pixel point;

3. calculating and predicting the pose of the camera according to the predicted scene coordinates and the uncertainty;

4. and calculating position errors and angle errors according to the predicted camera pose and the real camera pose in the test set.

In step 3: when the pose of the camera is calculated and predicted, the process is as follows:

(1) eliminating the coordinates of the prediction scene with poor prediction according to the uncertainty, and randomly selecting 256 points from the remaining coordinates of the prediction scene with good prediction, wherein the random selection is RANSAC algorithm;

(2) calculating a reprojection error for each predicted scene coordinate, and judging whether the predicted scene coordinate is an external point or not, namely a point with a large error;

(3) if the 256 points are not outer points, optimizing a reprojection error by using a Gauss-Newton method to obtain a predicted camera pose T;

(4) the step (3) is an iterative process, and after the optimization is finished, the optimal predicted camera pose T can be obtained.

And inputting the RGB images in the test set into the trained feature deep aggregation network. In the input process, the mini-batch is set to be 1, and the prediction scene coordinate and the uncertainty corresponding to each pixel point in the RGB image are obtained.

And the uncertainty is used for evaluating the quality of a scene coordinate prediction result, and scene point coordinates with the uncertainty larger than a threshold value are eliminated by setting the threshold value of the uncertainty. In the prediction scene point coordinates with uncertainty smaller than the threshold, 256 groups of pixel coordinates and prediction scene coordinates are selected by using the RANSAC algorithm, the PNP algorithm is realized by optimizing the reprojection error, and the Gaussian Newton method is used for optimizing the reprojection error to obtain the predicted camera pose T^*：

Where N is the number of pixels of the input image, P_uiIs the pixel coordinate of the ith pixel, P_ciIs the depth value of the camera coordinate of the ith pixel, K is the camera internal reference, T is the pose of the camera in the world coordinate system, P_wiIs the ith imagePredicted scene coordinates of the elements. And calculating a reprojection error of each pixel point according to the predicted camera pose, judging whether the pixel point is an outlier, and repeating the link if the outlier exists. And after the optimization is finished, obtaining the optimal predicted camera pose of the image type fire detector.

Considering the requirement of the positioning accuracy of the image type fire detector in the building environment, the performance evaluation mainly comprises two indexes: positional deviation and angular deviation. The smaller the deviation is, the higher the positioning accuracy of the image type fire detector is. Table 3 below shows the results of the estimation method of the present invention compared to other methods, and compared to other methods, the present invention achieves a position deviation of 0.018m and an angle deviation of 0.640 °, which is superior to other methods.

TABLE 3 comparative experimental data of different networks

Examples

The invention aims to position an image type fire detector used in a building scene. In order to verify the effectiveness of the feature deep aggregation network, the invention collects relevant data in a variety of building environments.

The invention collects RGB image, depth map and camera pose, and the collection frequency is FPS 30. 3 sets of data are respectively collected in each scene, wherein each set of data comprises a certain number of pictures and camera poses. 2 groups are training sets, 1 group is a testing set, 2 groups of the training sets are used for training, and the other 1 group is used for testing.

And for the data in the training set, according to the collected camera pose and the depth map, calculating to obtain the real scene coordinates corresponding to the pixel points in the depth map and the RGB map.

During the training process, the overfitting may occur due to the lack of data or the problem of the network model being too large. The invention uses data enhancement operation on images in a training set, and specifically comprises the following steps: and randomly translating the pixels by-20 to 20 along the horizontal direction or the vertical direction, randomly zooming the image by 0.7 to 1.5 times, and randomly rotating the image by-30 degrees to 30 degrees. Meanwhile, the operation speed of the network is considered, and normalization operation is carried out on the RGB images in the training set and the testing set.

Considering that the image size is only 640 multiplied by 480, the invention designs a light-weight characteristic deep aggregation network. The model parameter of the network is only 97MB, the processing speed of each picture is 0.04s, and the requirement on a processor is low. In the training process using the data in the training set, the invention employs an Adam optimizer with the mini-batch set to 4. During the test using the data in the test set, the mini-batch is set to 1.

For each RGB image, the output of the feature depth aggregation network contains 4800 sets of scene coordinates and uncertainty. The method comprises the steps of firstly setting a threshold value of the uncertainty to be 0.1, and eliminating scene point coordinates with the uncertainty larger than 0.1. In scene point coordinates with uncertainty less than 0.1, 256 sets of scene coordinates and corresponding pixel coordinates are randomly selected using the RANSAC algorithm. And finally, optimizing the reprojection error by using a PNP algorithm to obtain the pose of the camera.

The model of the invention has higher convergence rate, and can be converged in 4 hours on a single NVIDIA TITAN RT Intel Core i7-9700K @3.60GHz CPU. Finally on the test set, a positional deviation of 0.018m and an angular deviation of 0.640 ° were achieved.

In consideration of the fact that the actually acquired pictures have fuzzy phenomena, in order to verify the robustness of the feature depth aggregation network, the method carries out Gaussian blur and motion blur processing on the clear pictures in the test data. Fig. 4 shows the effect graphs of the original image after gaussian blur processing, slight motion blur processing, and strong motion blur processing.

The processing mode of Gaussian blur is as follows:

I′_i＝I_i+N(μ，σ)

wherein: i is_iMu is the mean of the normal distribution, set to 0 here, and sigma is the variance of the normal distribution, set to 25, l 'here, as the pixel value of the ith pixel'_iIs the pixel value of the ith pixel after the Gaussian blur processing. Subjecting the mixture to Gaussian model without Gaussian blur processingThe blurred image is input into the feature depth aggregation network for testing, and the test results are shown in fig. 5A and 5B. It can be seen that the feature deep aggregation network has certain robustness to gaussian blur.

The original image is subjected to slight and severe motion blur processing by using motion blur kernels with the sizes of 20 and 30, and then the image without motion blur processing and the image after motion blur processing are input into a characteristic depth aggregation network for testing, and the test results are shown in fig. 6A, 6B, 7A and 7B. It can be seen that as the motion blur increases, the predicted error of the feature depth aggregation network becomes larger, but still within an acceptable range. Therefore, the characteristic deep aggregation network has certain robustness to motion blur.

The image type fire detector plays an important role in building fire prevention and control. The invention provides a characteristic depth aggregation network for carrying out high-precision positioning on an image type fire detector. The invention splices a single RGB image with the pixel coordinates to obtain a tensor of 5 multiplied by 480 multiplied by 640 as the input of a network, thereby obtaining the scene coordinates and the uncertainty corresponding to each pixel. Subsequently, the invention uses uncertainty to eliminate the prediction scene coordinates with poor precision, and finally uses RANSAC and PNP algorithm to realize the high-precision image type fire detector positioning. Experimental results show that a characteristic deep polymerization network can achieve a positional deviation of 0.018m and an angular deviation of 0.640 °.

Although embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are illustrative and not restrictive, and that those skilled in the art may make changes, modifications, substitutions and alterations to the above embodiments without departing from the scope of the present invention.

Claims

1. The image type fire detector pose estimation method based on the feature depth aggregation network is characterized by comprising the following steps of:

s21, dividing the data collected in S1 into a training set and a testing set;

s23, performing data enhancement processing on the images in the training set;

s3, building a characteristic deep aggregation network;

s4, training the feature deep aggregation network, specifically:

s45, repeating the steps S41-S44 until an optimal network model is obtained;

2. The image-based fire detector pose estimation method based on the feature depth aggregation network according to claim 1, wherein real scene coordinates corresponding to pixel points in S22The calculating method specifically comprises the following steps: according to the collected depth map, obtaining camera coordinates P corresponding to the pixel points_cCombined with camera pose T_cwCalculating the real scene coordinates corresponding to the pixel points

3. The image-based fire detector pose estimation method based on the feature depth aggregation network according to claim 1, wherein the data enhancement processing in S23 is specifically: for the RGB images and the depth maps in the training set, the images are translated by-20 pixels in the horizontal direction or the vertical direction, the image size is scaled by 0.7-1.5 times, and the images are rotated by-30 degrees to increase the number of samples of the building database.

4. The image-based fire detector pose estimation method based on the feature depth aggregation network according to claim 1, wherein the feature depth aggregation network constructed in S3 comprises three parts: the first part is a feature extraction layer which is used for extracting low-level and high-level features of an image in a building scene and respectively coding geometric spatial information and semantic information in the low-level and high-level features; the second part is a feature fusion layer, and a channel attention mechanism is used for fusing extracted feature graphs on different scales to realize more fine coding of environmental information; the third part is a regression layer for predicting scene coordinates and uncertainty.

5. The method for estimating the pose of an image-based fire detector based on the feature depth aggregation network according to claim 4, wherein the feature extraction layer comprises a series of convolution layers for coding features in the architectural image to obtain the first feature map, the second feature map and the third feature map.

6. The image-based fire detector pose estimation method based on the feature depth aggregation network according to claim 5, wherein the feature fusion layer takes a first feature map, a second feature map and a third feature map as input; generating a fourth feature map by the third feature map through a channel attention module, and adding the fourth feature map and the second feature map pixel by pixel to obtain a fifth feature map; the fifth characteristic diagram is used for obtaining a sixth characteristic diagram through the convolution layer, and the sixth characteristic diagram is used for obtaining a seventh characteristic diagram through the attention module; adding the seventh characteristic diagram and the first characteristic diagram pixel by pixel to obtain an eighth characteristic diagram, and obtaining a ninth characteristic diagram by the eighth characteristic diagram through a convolution layer; and splicing the third, sixth and ninth feature maps in the channel dimension to obtain a tenth feature map.

7. The feature depth aggregation network-based image-based fire detector pose estimation method according to claim 4, wherein the regression layer comprises a series of convolution layers for predicting predicted scene coordinates and uncertainty in the building.

8. The method for estimating the pose of the image-type fire detector based on the deep feature aggregation network as claimed in claim 1, wherein a deep supervision technique is used during training of the deep feature aggregation network, and when sample training is performed on the deep feature aggregation network by using data in a training set in S41, the obtained predicted scene coordinates and uncertainty are combined with real scene coordinates to obtain a loss function, and then the loss function is used for back propagation to correct network parameters of the deep feature aggregation network.

9. The image-based fire detector pose estimation method based on the feature deep aggregation network according to claim 1, wherein the specific steps of S43 are as follows: setting a threshold value of the uncertainty, and eliminating the predicted scene coordinates of which the uncertainty is greater than the threshold value; and in the prediction scene coordinates with the uncertainty smaller than the threshold value, calculating the prediction camera pose of the image type fire detector by using the RANSAC algorithm and the PNP algorithm.

10. The image-based fire detector pose estimation method based on the feature deep aggregation network according to claim 1, wherein the specific steps of S45 are as follows: defining the training times of the sample according to the size of the data; performing a test on the data in the test set by using the model parameters of the current network model every time the sample training is finished; if the pose error of the test result is better than the stored optimal network model, storing the model parameters of the current network model as the optimal parameters; and when the training times of the network reach the set value, stopping training to obtain the trained optimal network model.