CN114937153A

CN114937153A - Neural network-based visual feature processing system and method under weak texture environment

Info

Publication number: CN114937153A
Application number: CN202210663043.2A
Authority: CN
Inventors: 方浩; 胡家瑞; 王奥博; 陈杰
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-08-23
Anticipated expiration: 2042-06-07
Also published as: CN114937153B

Abstract

The invention discloses a system and a method for processing visual characteristics based on a neural network under a weak texture environment, wherein the processing system comprises: the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network; the main network carries out convolution processing on an original image and outputs a deep characteristic map of the original image; the output of the first spatial module is fused with the output of the first convolution layer in the detector branch, and the detector branch outputs a corner probability map which is used for representing the probability that each point in the original image is a corner; the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.

Description

Neural network-based visual feature processing system and method under weak texture environment

Technical Field

The invention relates to the field of computer vision, in particular to a neural network-based visual feature processing system and method in a weak texture environment.

Background

In recent years, the development of artificial intelligence is vigorous, global automation patterns are gradually formed, computer vision is taken as one of core perception technologies, considerable social, economic and academic values are created, the social application depth and the social application range are continuously enhanced, the industries of security, medicine, agriculture and forestry, manufacturing and the like gradually enter the visual intelligence era, and the computer vision is still an indispensable pioneer technology in intelligent innovation. The characteristic information is important in the technical process of realizing visual enabling and is a key mark for a computing system to understand and recognize images. Researchers provide rich characteristic design schemes based on graphics, and provide good operation elements for visual tasks such as image retrieval, image splicing, VSLAM, three-dimensional reconstruction and the like by considering characteristic information of discrimination and repeatability, wherein the VSLAM scheme endows intelligent bodies such as unmanned aerial vehicles, unmanned vehicles and the like with self-positioning and environment perception capabilities, and is an important technical drive for promoting intelligent unmanned construction. However, the traditional image features based on geometry excessively depend on image quality, are naturally very sensitive to imaging environment changes, and when the traditional image features are oriented to a common weak texture severe scene as shown in fig. 1, feature quality degradation is caused, so that a feature algorithm fails, and tasks such as VSLAM are crashed. The feature processing technology still has significant defects in the aspects of resisting environmental interference, coping with equipment noise and adapting to motion change, and scientific innovation and product incubation work provide more and more urgent needs for robust and accurate feature extraction and description algorithms. For the visual positioning and mapping task under the weak texture environment, the existing solutions are as follows:

scheme 1: yi K M, LIFT: Learned Invariant Feature Transform [ J ]. The scheme utilizes a motion structure recovery method to construct a supervision signal to make up for data loss, and realizes interactive connection of three subtask networks (a detector, a direction estimator and a descriptor) and End-to-End synchronous learning under a unified framework. However, a calculation sharing relation cannot be formed among sub-networks in the LIFT model, so that the LIFT features are difficult to meet the real-time application requirements.

Scheme 2: detone D et al, SuperPoint: Self-Supervised Interest Point Detection and Description [ J ]. The scheme adopts a twin neural network design, basically realizes calculation sharing between a detection network and a description network, has excellent performance in the aspect of real-time performance, adopts a self-labeling method to obtain a training sample in the SuperPoint scheme, utilizes a labeling device and homography transformation to complete false-true value labeling on an original image, and benefits from a homography transformation mechanism, the SuperPoint network can output more dense and more repeatable image characteristics. However, this work does not work well with implicit methods to model spatial characteristics.

Scheme 3: dusmanu M et al, A train able cnn for joint description and detection of local features [ C ]. The scheme provides a concept of synchronous detection and description, breaks through the traditional mode of 'detection before description' in a time dimension, the network output of the scheme simultaneously comprises characteristic position scores and descriptor information, the D2-Net realizes the complete integration of the detector and the descriptor in the true sense, and the excellent effect is achieved in the network efficiency level. However, D2-Net performs poorly in feature accuracy.

Disclosure of Invention

In view of the above, the present invention provides a system and a method for processing visual features based on a neural network in a weak texture environment, which can solve the technical problem of how to reduce the interference of a weak texture scene on the feature extraction and description process in the existing weak texture environment.

In order to solve the above-mentioned technical problems, the present invention has been accomplished as described above.

A neural network-based visual feature processing system in a weak texture environment, comprising:

the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network;

the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the trunk network comprises a plurality of cascaded convolution layers, wherein a shallow characteristic diagram obtained after shallow convolution of the trunk network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;

the detector branch comprises a plurality of concatenated convolutional layers, the output of the first spatial module is fused with the output of the first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for representing the probability that each point in the original image is a corner;

the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor morphology of each point in the original image.

Preferably, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, the detector branch includes a plurality of concatenated convolutional layers, wherein an output of the first spatial module is merged with an output of a first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for characterizing a probability that each point in the original image is a corner;

the descriptor branch is used for receiving a deep feature map of the original image output by the main network, the descriptor branch comprises a plurality of cascaded convolutional layers, wherein the output of the second spatial module is fused with the output of a first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing descriptor morphology of each point in the original image.

Preferably, the detector branches adopt an information quantity loss function in the training process, divide the original image by taking an 8 × 8 neighborhood as a basic unit to obtain a basic unit grid, and assume that the total H is in the grid _C ×W _C A basic unit, each basic unit is represented as x _hw The true value data label set in the real scene data set is marked as Y, and the detector is provided with branch toolsThe bulk loss function is:

wherein H _C Is the total number of rows, W, of the basic cell grid _C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l _p Normalizing the network predicted value at the pixel position of the corner point and taking the negative logarithm, x _hwy Is a network prediction value, x, at the pixel position of the corner in a base unit _hwk K is the channel number for the net prediction value at any pixel position in one base unit.

Preferably, the descriptor branch adopts a change-loss function in the training process, and the specific form is as follows:

the description subgraph corresponding to the original image: d, homography transformation: h, obtaining a description subgraph corresponding to the deformed image after the original image is subjected to homographic transformation: d'

Descriptor corresponding to original image: d is a radical of _hw And a descriptor corresponding to the deformed image: d' _h′w′

Coordinates of 8 × 8 neighborhood center pixels in the original image: p is a radical of _hw

Coordinates of 8 multiplied by 8 neighborhood center pixels in the deformed image: p' _h′w′

Judging the corresponding relation:

l _d (d _hw ，d′ _h′w′ ，s)

＝λ _d *s*max(0，m _p -d _hw ^T d′ _h′w′ )+(1-s)*max(0，d _hw ^T d′ _h′w′ -m _n )

wherein,

to describe the loss function of the device, Hp _hw Is the homographic transformation of the central pixel of 8 x 8 neighborhood in the original image _d Is a weight parameter, s is a correspondence relation determination parameter, m _p As a positive edge parameter, d _hw ^T Is d _hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m _n Is a negative edge parameter.

Further, the first space module and the second space module each include a plurality of convolutional networks, a grid generator, a sampling network, and a sampler; the space module receives a shallow feature map in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator performs grid generation to obtain a sampling grid, and the sampler performs pixel sampling on the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.

Further, the training of the processing system is five-stage training, in the first stage, data enhancement operation is carried out on a training sample data set, and then the training sample data set is used for training the detector branches independently; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and in the fifth stage, the weights of the detector branch and the descriptor branch are cleared, and the secondary labeling data set is utilized to carry out joint training on the detector branch and the descriptor branch.

The invention provides a neural network-based visual feature processing method in a weak texture environment, which comprises the following steps:

step S1: acquiring an original image;

step S2: inputting the raw image into the processing system;

step S3: the processing system performs feature detection and description on an original image to obtain an angular point and a corresponding descriptor of the original image;

step S4: based on the angular points and the descriptors of the original image, image splicing, visual positioning and scene recognition can be completed in a weak texture environment.

Has the advantages that:

the invention gives full play to the advantages of the deep learning method, guides the network to concern the scene area with rich texture information in a data driving mode, and enhances the overall spatial stability and sensitivity of the network by adding the spatial processing module in a targeted manner. In the invention, the space module is connected into the twin part in a layer jump connection mode, and the space adaptive capacity of the network model is expanded on the premise of ensuring the authenticity of deep features of the image as far as possible.

The method has the following technical effects:

(1) the invention provides a visual feature processing system based on a neural network, which is used for reducing the adverse effect of weak texture scenes on the visual feature processing work, breaking through the constraint of geometric rules on the traditional feature algorithm by adopting a data driving method, and further improving the utilization rate of image information while ensuring the real-time property.

(2) The invention introduces a space converter module, and carries out cascade superposition on the converted space conversion characteristic diagram and the original characteristic diagram, and the processing method explicitly models the space characteristic of the image, and has more excellent performance compared with the implicit modeling method in the prior work.

(3) The invention adopts the self-supervision marking method to complete the training, solves the problems of subjective error and sample loss of manual marking, further improves the data utilization rate, fully develops the network structure potential, furthest reduces the damage of scene limitation to the characteristic network, and has significant significance for enhancing the practical application value of the deep learning technology in the characteristic extraction and description problem.

(4) According to the method, strong constraints brought by geometric rules in the feature extraction process are eliminated through a twin neural network architecture and a self-supervision annotation training strategy, so that the network has excellent robustness, flexibility and scene adaptability, and therefore external environment interference and algorithm complexity are reduced.

(5) The processing system is a twin framework, adopts a feature processing algorithm, establishes a standard system integrating feature extraction and description, and performs explicit modeling on a space module by using a structure shown in figure 3, thereby ensuring the specificity and the space quality of the extracted scene features.

Drawings

FIG. 1 is a schematic diagram of a weak texture scene;

FIG. 2 is a schematic diagram of a neural network-based visual feature processing system architecture in a weak texture environment according to the present invention;

FIG. 3 is a schematic diagram of a space module architecture according to the present invention;

FIGS. 4(A) -4 (B) are schematic diagrams of a synthetic data set provided by the present invention;

FIG. 5 is a schematic diagram of a real scene data set used in the present invention;

fig. 6(a) -6 (B) are schematic diagrams of the output results of the detector provided by the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

As shown in fig. 2-3, the present invention provides a neural network-based visual feature processing system in a weak texture environment, the processing system comprising:

the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein a shallow feature map obtained after shallow convolution of the backbone network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;

the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.

The shallow convolution is that the image is processed by only partial convolution layer and does not reach the deep level.

Further, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, the detector branch includes a plurality of concatenated convolutional layers, wherein an output of the first spatial module is fused with an output of a first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for characterizing a probability that each point in the original image is a corner;

The image features are basic operation units of multiple computer vision tasks, are key information for calculating mechanism solution image content, and can get rid of geometric rule constraint and weaken external environment interference by means of strong extraction capability of a convolutional neural network on deep image features in a wide sensing domain, and a feature processing algorithm based on deep learning can get rid of geometric rule constraint and weaken external environment interference, so that after a scene image is read in, firstly, an original image is subjected to convolutional coding through a multiple convolutional layer arranged in a main network to complete deep feature extraction, and meanwhile, a shallow feature map in the main network is input into a space module, namely a space converter to be subjected to space information coding to obtain a space conversion feature map out _{Spatial-Transformer} The main purpose of deep feature extraction and spatial coding is to provide a data base for subsequent feature detection and description tasks.

Deep layer characteristic extraction:

out ₁₁ ＝ReLu(conv_11(raw_image))

out ₁₂ ＝Maxpool(ReLu(conv_12(out ₁₁ )))

out ₂₁ ＝ReLu(conv_21(out ₁₂ ))

out ₂₂ ＝Maxpool(ReLu(conv_22(out ₂₁ )))

out ₃₁ ＝ReLu(conv_31(out ₂₂ ))

out ₃₂ ＝Maxpool(ReLu(conv_32(out ₃₁ )))

out ₄₁ ＝ReLu(conv_41(out ₃₂ ))

out ₄₂ ＝ReLu(conv_42(out ₄₁ ))

the structure of the backbone network is shown in table 1 below.

TABLE 1

The detector branch is composed of a plurality of convolutional layers, the deep layer feature map is received at a first convolutional layer of the detector branch, the spatial conversion feature map is received at a first convolutional layer output position, and fusion of first convolutional layer output and the spatial conversion feature map is carried out, wherein the fusion comprises cascade connection of the first convolutional layer output feature map and the spatial conversion feature map in the detector branch along a channel dimension. The spatial transformation profile is a profile describing a spatial transformation. The first convolutional layer of the detector branch is the first convolutional layer that processes the input of the detector branch.

In the invention, an 8-neighborhood method is adopted for feature detection, non-maximum suppression is adopted in an 8 x 8 neighborhood to ensure the uniqueness of feature information, the detector branches adopt a convolution method to compress a cascade feature map obtained after fusion to 65 channels, then normalization processing is carried out on data in the 65 channels, the detector branches output a corner probability map, and the corner probability map is used for representing the probability that each point in an original image is a corner. Specifically, of the 65 channels, the values in 64 channels represent the probability of feature points at 64 pixel positions in an 8 × 8 neighborhood, and the values in the other 1 channel represent the probability of no feature existing in the 8 × 8 neighborhood. The output results of the detector branches are shown in fig. 6(a) -6 (B).

The detector branches adopt an information quantity loss function in the training process, the original image is divided by taking an 8 multiplied by 8 neighborhood as a basic unit to obtain a basic unit grid, and the total H in the grid is assumed _C ×W _C A basic unit, each basic unit is represented as x _hw And recording a true value data label set in the real scene data set as Y, wherein the specific loss function of the detector branch is as follows:

wherein H _C Is the total number of rows, W, of the basic cell grid _C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l _p Normalizing the network predicted value at the pixel position of the diagonal point and taking a negative logarithm x _hwy For a network prediction value, x, at the pixel position of a corner in a basic unit _hwk K is the channel number for the net prediction value at any pixel position in one base unit.

The output of each convolutional layer of the detector branches is:

out _{dect_1} ＝ReLu(conv_dect_1(out ₄₂ ))

cascade superposition:

out _{dect_3} ＝ReLu(conv_dect_2(out _{dect_2} ))

out _{dect_final} ＝Softmax(out _{dect_3} )

the detector finger network structure is shown in table 2 below.

TABLE 2

The descriptor branch is composed of a plurality of convolutional layers, the deep layer feature map is received at a first layer convolutional layer of the descriptor branch, the spatial conversion feature map is received at a first convolutional layer output position, and fusion of first convolutional layer output and the spatial conversion feature map is carried out, wherein the fusion comprises cascade connection of the first convolutional layer output feature map and the spatial feature map in the descriptor branch along a channel dimension. The first convolutional layer of the descriptor branch is the first convolutional layer to process the input of the descriptor branch.

For real-time considerations, the descriptor branch is also denoted by H _C ×W _C The basic units are used as operation elements to carry out feature description work unit by unit, 256-bit descriptors are adopted to characterize feature points, and in order to enhance the fineness and the consistency of the feature description work, the central points in 8 multiplied by 8 neighborhood are used as position reference to carry out pixel-level descriptor interpolation calculation on the detected feature points, so that the feature description precision is further improved.

As the identification mark of the feature point, the most important attribute of the descriptor is the individual specificity of the descriptor, the clearly identifiable feature descriptor has important significance for feature matching and identification work and is an important guarantee for accurately completing computer vision tasks such as vision positioning, image splicing, scene reconstruction and the like, and therefore, the descriptor branch adopts a change-loss function in the training process, and the specific form is as follows:

the description subgraph corresponding to the original image: d, homographic transformation: h, obtaining a description subgraph corresponding to the deformed image after homographic transformation of the original image: d'

Descriptor corresponding to original image: d is a radical of _hw And the descriptor corresponding to the deformed image: d' _h′w′

Coordinates of central pixels of 8 multiplied by 8 neighborhoods in the deformed image: p' _h′w′

Judging the corresponding relation:

l _d (d _hw ，d′ _h′w′ ，s)

wherein λ is _d ，m _p And m _n For empirical thresholds in the loss function, λ _d Is designed to balance the loss term size of positive pair (s is 1 point pair) and negative pair (s is 0 point pair), ensuring that the network parameters fall in the right direction in harmony, and m is _p And m _n The establishment aims at controlling the network learning process, preventing the overfitting phenomenon caused by the network overfitting and ensuring that the network parameters are converged to a proper value range. s is a correspondence relation determination parameter, m _p As a positive edge parameter, d _hw ^T Is d _hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m _n Is a negative edge parameter.

The descriptor branch output is:

out _{descriptor_1} ＝ReLu(conv_descriptor_1(out ₄₂ ))

cascade superposition:

out _{descriptor_3} ＝ReLu(conv_descriptor-2(out _{descriptor_2} ))

out _{descriptor_final} ＝Normalize(out _{descriptor_3} )

the descriptor branch network structure is shown in table 3.

TABLE 3

In the invention, in the detector branch, the cascade characteristic diagram is compressed along the channel and then is subjected to characteristic position scoring, so that the characteristic position in an 8 x 8 neighborhood is determined, in the descriptor branch, the cascade characteristic diagram is compressed into 256 descriptors along the dimension of the channel, and then descriptor interpolation is carried out by taking the central pixel coordinate of the 8 x 8 neighborhood as a position reference, so that the accuracy and the specificity of the descriptors are improved.

Further, as shown in fig. 3, the space module includes a plurality of convolution networks, a grid generator, a sampling network and a sampler; the space module receives a shallow layer characteristic diagram in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, grid generation is carried out by the grid generator, a sampling grid is obtained, and a sampler carries out pixel sampling on the shallow layer characteristic diagram in the main network according to the sampling grid to obtain a space conversion characteristic diagram.

Furthermore, the training of the processing system is five-stage training, a synthetic data set containing basic geometric patterns is used as a training sample, in the first stage, data enhancement operation is carried out on the training sample data set, and then the training sample data set is used for training the detector branches independently; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and a fifth stage, emptying the weights of the detector branch and the descriptor branch, and performing joint training on the detector branch and the descriptor branch by using the secondary labeling data set to finally obtain a stable processing system.

The whole processing system adopts twin architecture design, and is structurally and hierarchically divided into a front-end main network, a rear-end detector branch and a descriptor branch, an original image is input into a main network (Backbone) part, the main network performs image convolution processing on the original input image and outputs a deep characteristic diagram, and the deep characteristic diagram is used as shared information and submitted to the detector branch and the descriptor branch to perform different tasks. Meanwhile, the backbone network is externally connected with a space processing module, the shallow feature map of the backbone network is separated independently and transmitted to the space processing module, the space processing module plays a role of a space Transformer (Spatial Transformer), space information is obtained after the space processing module processes the space information, and then the space information is coded into the feature information to obtain a space conversion feature map. The detector branch and the descriptor branch are twin modules and are respectively used for feature position detection and feature description tasks, in a twin framework, deep feature maps output by a main network are respectively input into the detector branch and the descriptor branch and are cascaded with the space conversion feature maps, under the guidance of a loss function, weight parameters in a processing system are continuously updated and iterated to obtain more accurate feature positions and descriptor information, and the detector branch and the descriptor branch respectively output a corner probability map of a 65 channel and a descriptor map of a 256 channel at the output end of the processing system.

In order to prevent subjective interference caused by manual labeling of data, the method adopts an automatic supervision mode to complete training, 4 stages are set in a training link, a program is utilized to automatically synthesize a synthetic data set containing basic geometric patterns (polygons, lines and stars … …) in a first stage, as shown in fig. 4(A) -4 (B), data enhancement operations such as contrast adjustment, noise addition, motion blur, brightness adjustment and the like are carried out on the data set, and then the data set is utilized to carry out preliminary training on network detector branches independently; in the second stage, the detector branches obtained by the primary training in the first stage are utilized to label the real scene data set (figure 5) to obtain a feature labeled data set in the real scene; in the third stage, all the weight parameters obtained by training in the first stage are emptied, and the real scene data set obtained in the second stage is used for independently carrying out preliminary training on the network detector branches; in the fourth stage, the detector obtained in the third stage is used for re-labeling the real scene data set, and the secondary labeling aims at further refining the quality of the data set and providing a basis for the training of the final stage; in the fifth stage, the weight is cleared again, and the high-quality data set marked in the fourth stage is used for carrying out combined training on the whole network structure (detector + descriptor), so that a complete and reliable feature processing network is finally obtained.

The invention provides a twin feature processing network based on a deep learning technology, aiming at the problems of unstable feature monitoring and poor repeatability of a Visual feature extraction and description system (VSLAM) under weak textures. The main network is provided with a multilayer convolutional neural network for extracting deep features of the image, a space converter module is externally connected to the middle of the main skeleton for explicitly coding spatial information, the spatial stability and the sensitivity of the feature information are enhanced, and the deep feature map of the image and the space conversion feature map are cascaded and superposed in a rear-end branch to provide rich data for a network output layer. In the feature detector and the descriptor branch, the feature map is divided by taking an 8 × 8 neighborhood as a basic unit, the cascade feature map is respectively compressed to 65 channels and 256 channels, a probability scoring strategy is adopted in the detector branch to determine the feature position in the 8 × 8 neighborhood (the 65 th channel numerical value represents the featureless probability in the neighborhood), and 256 descriptors are adopted in the descriptor branch to mark feature information. In order to overcome subjective errors caused by manual marking characteristics, the data label is constructed in an automatic supervision marking mode, and meanwhile, the training process is finely divided into five stages to improve the data quality and enhance the network precision, so that the data dilemma caused by rare samples is effectively solved. By the visual feature processing system provided by the invention, multiple computer visual tasks such as visual positioning, scene reconstruction, image splicing and the like can be continuously and stably carried out in a weak texture environment, and the original defects such as feature loss, algorithm collapse and the like are alleviated. The invention ensures the real-time performance of the system to the greatest extent while realizing function enhancement, effectively controls the feature detection quantity at the detector level by setting the 8 multiplied by 8 neighborhood, branches at the descriptor, and carries out descriptor interpolation calculation by taking the central pixel coordinate of the 8 multiplied by 8 neighborhood as the position reference in order to further improve the specificity of feature description without damaging the real-time performance of the algorithm.

The invention also provides a visual feature processing method based on the neural network in the weak texture environment, wherein the method is based on the processing system, and the processing method comprises the following steps:

step S1: acquiring an original image;

step S2: inputting the raw image into the processing system;

The above embodiments only describe the design principle of the present invention, and the shapes and names of the components in the description may be different without limitation. Therefore, a person skilled in the art of the present invention can modify or substitute the technical solutions described in the foregoing embodiments; such modifications and substitutions do not depart from the spirit and scope of the present invention.

Claims

1. A neural network-based visual feature processing system in a weak texture environment, the processing system comprising:

2. The system of claim 1, wherein the detector branch is configured to receive a deep feature map of the original image output from the backbone network, the detector branch comprising a plurality of concatenated convolutional layers, wherein an output of the first spatial module is fused with an output of a first convolutional layer in the detector branch, the detector branch outputting a corner probability map, the corner probability map being configured to characterize a probability that each point in the original image is a corner;

3. The system of any of claims 1-2, wherein the detector branches use a traffic loss function during training to partition the original image into a grid of base cells by using an 8 x 8 neighborhood as a base cell, assuming a total of H within the grid _C ×W _C A basic unit, each basic unit is represented as x _hw And recording a true value data label set in the real scene data set as Y, wherein the specific loss function of the detector branch is as follows:

wherein H _C Is the total number of rows, W, of the basic cell grid _C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l _p Normalizing the network predicted value at the pixel position of the corner point and taking the negative logarithm, x _hwy For a network prediction value, x, at the pixel position of a corner in a basic unit _hwk K is the channel number for the net prediction value at any pixel position in one base unit.

4. A system as claimed in any one of claims 1 to 2, wherein said descriptor branch employs a change-loss function during training in the form of:

the description subgraph corresponding to the original image: d, homography transformation: h, obtaining a description subgraph corresponding to the deformed image after homographic transformation of the original image: d'

Descriptor corresponding to original image: d _hw And a descriptor corresponding to the deformed image: d' _h′w′

Judging the corresponding relation:

l _d (d _hw ,d′ _h′w′ ,s)

＝λ _d *s*max(0,m _p -d _hw ^T d′ _h′w′ )+(1-s)*max(0,d _hw ^T d′ _h′w′ -m _n )

wherein,

to describe the loss function of the device, Hp _hw Is the coordinate of 8 multiplied by 8 neighborhood center pixel in the original image after homographic transformation, lambda _d Is a weight parameter, s is a correspondence relation determination parameter, m _p As a positive edge parameter, d _hw ^T Is d _hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m _n Is a negative edge parameter.

5. The system of any of claims 1-2, wherein the first spatial module and the second spatial module each comprise a plurality of convolutional networks, a trellis generator, a sampling network, and a sampler; the space module receives a shallow feature map in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator performs grid generation to obtain a sampling grid, and the sampler performs pixel sampling on the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.

6. The system of any of claims 1-2, wherein the training of the processing system is a five-stage training, a first stage, in which a data enhancement operation is performed on a training sample data set, followed by training of the detector branches separately using the training sample data set; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and in the fifth stage, the weights of the detector branches and the descriptor branches are cleared, and the secondary labeling data sets are utilized to carry out joint training on the detector branches and the descriptor branches.

7. A neural network-based visual feature processing method in a weak texture environment, which is performed by the processing system according to any one of claims 1-6, and comprises the following steps:

step S1: acquiring an original image;

step S2: inputting the raw image into the processing system;