CN114937153A - Neural network-based visual feature processing system and method under weak texture environment - Google Patents
Neural network-based visual feature processing system and method under weak texture environment Download PDFInfo
- Publication number
- CN114937153A CN114937153A CN202210663043.2A CN202210663043A CN114937153A CN 114937153 A CN114937153 A CN 114937153A CN 202210663043 A CN202210663043 A CN 202210663043A CN 114937153 A CN114937153 A CN 114937153A
- Authority
- CN
- China
- Prior art keywords
- descriptor
- branch
- original image
- detector
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 47
- 230000000007 visual effect Effects 0.000 title claims abstract description 25
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 title abstract description 28
- 238000012549 training Methods 0.000 claims description 43
- 238000002372 labelling Methods 0.000 claims description 18
- 230000009466 transformation Effects 0.000 claims description 18
- 238000006243 chemical reaction Methods 0.000 claims description 13
- 238000001514 detection method Methods 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 12
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000003672 processing method Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 11
- 238000000605 extraction Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000011534 incubation Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
- Image Generation (AREA)
Abstract
The invention discloses a system and a method for processing visual characteristics based on a neural network under a weak texture environment, wherein the processing system comprises: the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network; the main network carries out convolution processing on an original image and outputs a deep characteristic map of the original image; the output of the first spatial module is fused with the output of the first convolution layer in the detector branch, and the detector branch outputs a corner probability map which is used for representing the probability that each point in the original image is a corner; the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a neural network-based visual feature processing system and method in a weak texture environment.
Background
In recent years, the development of artificial intelligence is vigorous, global automation patterns are gradually formed, computer vision is taken as one of core perception technologies, considerable social, economic and academic values are created, the social application depth and the social application range are continuously enhanced, the industries of security, medicine, agriculture and forestry, manufacturing and the like gradually enter the visual intelligence era, and the computer vision is still an indispensable pioneer technology in intelligent innovation. The characteristic information is important in the technical process of realizing visual enabling and is a key mark for a computing system to understand and recognize images. Researchers provide rich characteristic design schemes based on graphics, and provide good operation elements for visual tasks such as image retrieval, image splicing, VSLAM, three-dimensional reconstruction and the like by considering characteristic information of discrimination and repeatability, wherein the VSLAM scheme endows intelligent bodies such as unmanned aerial vehicles, unmanned vehicles and the like with self-positioning and environment perception capabilities, and is an important technical drive for promoting intelligent unmanned construction. However, the traditional image features based on geometry excessively depend on image quality, are naturally very sensitive to imaging environment changes, and when the traditional image features are oriented to a common weak texture severe scene as shown in fig. 1, feature quality degradation is caused, so that a feature algorithm fails, and tasks such as VSLAM are crashed. The feature processing technology still has significant defects in the aspects of resisting environmental interference, coping with equipment noise and adapting to motion change, and scientific innovation and product incubation work provide more and more urgent needs for robust and accurate feature extraction and description algorithms. For the visual positioning and mapping task under the weak texture environment, the existing solutions are as follows:
scheme 1: yi K M, LIFT: Learned Invariant Feature Transform [ J ]. The scheme utilizes a motion structure recovery method to construct a supervision signal to make up for data loss, and realizes interactive connection of three subtask networks (a detector, a direction estimator and a descriptor) and End-to-End synchronous learning under a unified framework. However, a calculation sharing relation cannot be formed among sub-networks in the LIFT model, so that the LIFT features are difficult to meet the real-time application requirements.
Scheme 2: detone D et al, SuperPoint: Self-Supervised Interest Point Detection and Description [ J ]. The scheme adopts a twin neural network design, basically realizes calculation sharing between a detection network and a description network, has excellent performance in the aspect of real-time performance, adopts a self-labeling method to obtain a training sample in the SuperPoint scheme, utilizes a labeling device and homography transformation to complete false-true value labeling on an original image, and benefits from a homography transformation mechanism, the SuperPoint network can output more dense and more repeatable image characteristics. However, this work does not work well with implicit methods to model spatial characteristics.
Scheme 3: dusmanu M et al, A train able cnn for joint description and detection of local features [ C ]. The scheme provides a concept of synchronous detection and description, breaks through the traditional mode of 'detection before description' in a time dimension, the network output of the scheme simultaneously comprises characteristic position scores and descriptor information, the D2-Net realizes the complete integration of the detector and the descriptor in the true sense, and the excellent effect is achieved in the network efficiency level. However, D2-Net performs poorly in feature accuracy.
Disclosure of Invention
In view of the above, the present invention provides a system and a method for processing visual features based on a neural network in a weak texture environment, which can solve the technical problem of how to reduce the interference of a weak texture scene on the feature extraction and description process in the existing weak texture environment.
In order to solve the above-mentioned technical problems, the present invention has been accomplished as described above.
A neural network-based visual feature processing system in a weak texture environment, comprising:
the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network;
the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the trunk network comprises a plurality of cascaded convolution layers, wherein a shallow characteristic diagram obtained after shallow convolution of the trunk network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branch comprises a plurality of concatenated convolutional layers, the output of the first spatial module is fused with the output of the first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor morphology of each point in the original image.
Preferably, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, the detector branch includes a plurality of concatenated convolutional layers, wherein an output of the first spatial module is merged with an output of a first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for characterizing a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the main network, the descriptor branch comprises a plurality of cascaded convolutional layers, wherein the output of the second spatial module is fused with the output of a first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing descriptor morphology of each point in the original image.
Preferably, the detector branches adopt an information quantity loss function in the training process, divide the original image by taking an 8 × 8 neighborhood as a basic unit to obtain a basic unit grid, and assume that the total H is in the grid C ×W C A basic unit, each basic unit is represented as x hw The true value data label set in the real scene data set is marked as Y, and the detector is provided with branch toolsThe bulk loss function is:
wherein H C Is the total number of rows, W, of the basic cell grid C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l p Normalizing the network predicted value at the pixel position of the corner point and taking the negative logarithm, x hwy Is a network prediction value, x, at the pixel position of the corner in a base unit hwk K is the channel number for the net prediction value at any pixel position in one base unit.
Preferably, the descriptor branch adopts a change-loss function in the training process, and the specific form is as follows:
the description subgraph corresponding to the original image: d, homography transformation: h, obtaining a description subgraph corresponding to the deformed image after the original image is subjected to homographic transformation: d'
Descriptor corresponding to original image: d is a radical of hw And a descriptor corresponding to the deformed image: d' h′w′
Coordinates of 8 × 8 neighborhood center pixels in the original image: p is a radical of hw
Coordinates of 8 multiplied by 8 neighborhood center pixels in the deformed image: p' h′w′
Judging the corresponding relation:
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein,to describe the loss function of the device, Hp hw Is the homographic transformation of the central pixel of 8 x 8 neighborhood in the original image d Is a weight parameter, s is a correspondence relation determination parameter, m p As a positive edge parameter, d hw T Is d hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameter.
Further, the first space module and the second space module each include a plurality of convolutional networks, a grid generator, a sampling network, and a sampler; the space module receives a shallow feature map in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator performs grid generation to obtain a sampling grid, and the sampler performs pixel sampling on the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.
Further, the training of the processing system is five-stage training, in the first stage, data enhancement operation is carried out on a training sample data set, and then the training sample data set is used for training the detector branches independently; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and in the fifth stage, the weights of the detector branch and the descriptor branch are cleared, and the secondary labeling data set is utilized to carry out joint training on the detector branch and the descriptor branch.
The invention provides a neural network-based visual feature processing method in a weak texture environment, which comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the raw image into the processing system;
step S3: the processing system performs feature detection and description on an original image to obtain an angular point and a corresponding descriptor of the original image;
step S4: based on the angular points and the descriptors of the original image, image splicing, visual positioning and scene recognition can be completed in a weak texture environment.
Has the advantages that:
the invention gives full play to the advantages of the deep learning method, guides the network to concern the scene area with rich texture information in a data driving mode, and enhances the overall spatial stability and sensitivity of the network by adding the spatial processing module in a targeted manner. In the invention, the space module is connected into the twin part in a layer jump connection mode, and the space adaptive capacity of the network model is expanded on the premise of ensuring the authenticity of deep features of the image as far as possible.
The method has the following technical effects:
(1) the invention provides a visual feature processing system based on a neural network, which is used for reducing the adverse effect of weak texture scenes on the visual feature processing work, breaking through the constraint of geometric rules on the traditional feature algorithm by adopting a data driving method, and further improving the utilization rate of image information while ensuring the real-time property.
(2) The invention introduces a space converter module, and carries out cascade superposition on the converted space conversion characteristic diagram and the original characteristic diagram, and the processing method explicitly models the space characteristic of the image, and has more excellent performance compared with the implicit modeling method in the prior work.
(3) The invention adopts the self-supervision marking method to complete the training, solves the problems of subjective error and sample loss of manual marking, further improves the data utilization rate, fully develops the network structure potential, furthest reduces the damage of scene limitation to the characteristic network, and has significant significance for enhancing the practical application value of the deep learning technology in the characteristic extraction and description problem.
(4) According to the method, strong constraints brought by geometric rules in the feature extraction process are eliminated through a twin neural network architecture and a self-supervision annotation training strategy, so that the network has excellent robustness, flexibility and scene adaptability, and therefore external environment interference and algorithm complexity are reduced.
(5) The processing system is a twin framework, adopts a feature processing algorithm, establishes a standard system integrating feature extraction and description, and performs explicit modeling on a space module by using a structure shown in figure 3, thereby ensuring the specificity and the space quality of the extracted scene features.
Drawings
FIG. 1 is a schematic diagram of a weak texture scene;
FIG. 2 is a schematic diagram of a neural network-based visual feature processing system architecture in a weak texture environment according to the present invention;
FIG. 3 is a schematic diagram of a space module architecture according to the present invention;
FIGS. 4(A) -4 (B) are schematic diagrams of a synthetic data set provided by the present invention;
FIG. 5 is a schematic diagram of a real scene data set used in the present invention;
fig. 6(a) -6 (B) are schematic diagrams of the output results of the detector provided by the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and examples.
As shown in fig. 2-3, the present invention provides a neural network-based visual feature processing system in a weak texture environment, the processing system comprising:
the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network;
the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein a shallow feature map obtained after shallow convolution of the backbone network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branch comprises a plurality of concatenated convolutional layers, the output of the first spatial module is fused with the output of the first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.
The shallow convolution is that the image is processed by only partial convolution layer and does not reach the deep level.
Further, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, the detector branch includes a plurality of concatenated convolutional layers, wherein an output of the first spatial module is fused with an output of a first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for characterizing a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the main network, the descriptor branch comprises a plurality of cascaded convolutional layers, wherein the output of the second spatial module is fused with the output of a first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing descriptor morphology of each point in the original image.
The image features are basic operation units of multiple computer vision tasks, are key information for calculating mechanism solution image content, and can get rid of geometric rule constraint and weaken external environment interference by means of strong extraction capability of a convolutional neural network on deep image features in a wide sensing domain, and a feature processing algorithm based on deep learning can get rid of geometric rule constraint and weaken external environment interference, so that after a scene image is read in, firstly, an original image is subjected to convolutional coding through a multiple convolutional layer arranged in a main network to complete deep feature extraction, and meanwhile, a shallow feature map in the main network is input into a space module, namely a space converter to be subjected to space information coding to obtain a space conversion feature map out Spatial-Transformer The main purpose of deep feature extraction and spatial coding is to provide a data base for subsequent feature detection and description tasks.
Deep layer characteristic extraction:
out 11 =ReLu(conv_11(raw_image))
out 12 =Maxpool(ReLu(conv_12(out 11 )))
out 21 =ReLu(conv_21(out 12 ))
out 22 =Maxpool(ReLu(conv_22(out 21 )))
out 31 =ReLu(conv_31(out 22 ))
out 32 =Maxpool(ReLu(conv_32(out 31 )))
out 41 =ReLu(conv_41(out 32 ))
out 42 =ReLu(conv_42(out 41 ))
the structure of the backbone network is shown in table 1 below.
TABLE 1
The detector branch is composed of a plurality of convolutional layers, the deep layer feature map is received at a first convolutional layer of the detector branch, the spatial conversion feature map is received at a first convolutional layer output position, and fusion of first convolutional layer output and the spatial conversion feature map is carried out, wherein the fusion comprises cascade connection of the first convolutional layer output feature map and the spatial conversion feature map in the detector branch along a channel dimension. The spatial transformation profile is a profile describing a spatial transformation. The first convolutional layer of the detector branch is the first convolutional layer that processes the input of the detector branch.
In the invention, an 8-neighborhood method is adopted for feature detection, non-maximum suppression is adopted in an 8 x 8 neighborhood to ensure the uniqueness of feature information, the detector branches adopt a convolution method to compress a cascade feature map obtained after fusion to 65 channels, then normalization processing is carried out on data in the 65 channels, the detector branches output a corner probability map, and the corner probability map is used for representing the probability that each point in an original image is a corner. Specifically, of the 65 channels, the values in 64 channels represent the probability of feature points at 64 pixel positions in an 8 × 8 neighborhood, and the values in the other 1 channel represent the probability of no feature existing in the 8 × 8 neighborhood. The output results of the detector branches are shown in fig. 6(a) -6 (B).
The detector branches adopt an information quantity loss function in the training process, the original image is divided by taking an 8 multiplied by 8 neighborhood as a basic unit to obtain a basic unit grid, and the total H in the grid is assumed C ×W C A basic unit, each basic unit is represented as x hw And recording a true value data label set in the real scene data set as Y, wherein the specific loss function of the detector branch is as follows:
wherein H C Is the total number of rows, W, of the basic cell grid C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l p Normalizing the network predicted value at the pixel position of the diagonal point and taking a negative logarithm x hwy For a network prediction value, x, at the pixel position of a corner in a basic unit hwk K is the channel number for the net prediction value at any pixel position in one base unit.
The output of each convolutional layer of the detector branches is:
out dect_1 =ReLu(conv_dect_1(out 42 ))
out dect_3 =ReLu(conv_dect_2(out dect_2 ))
out dect_final =Softmax(out dect_3 )
the detector finger network structure is shown in table 2 below.
TABLE 2
The descriptor branch is composed of a plurality of convolutional layers, the deep layer feature map is received at a first layer convolutional layer of the descriptor branch, the spatial conversion feature map is received at a first convolutional layer output position, and fusion of first convolutional layer output and the spatial conversion feature map is carried out, wherein the fusion comprises cascade connection of the first convolutional layer output feature map and the spatial feature map in the descriptor branch along a channel dimension. The first convolutional layer of the descriptor branch is the first convolutional layer to process the input of the descriptor branch.
For real-time considerations, the descriptor branch is also denoted by H C ×W C The basic units are used as operation elements to carry out feature description work unit by unit, 256-bit descriptors are adopted to characterize feature points, and in order to enhance the fineness and the consistency of the feature description work, the central points in 8 multiplied by 8 neighborhood are used as position reference to carry out pixel-level descriptor interpolation calculation on the detected feature points, so that the feature description precision is further improved.
As the identification mark of the feature point, the most important attribute of the descriptor is the individual specificity of the descriptor, the clearly identifiable feature descriptor has important significance for feature matching and identification work and is an important guarantee for accurately completing computer vision tasks such as vision positioning, image splicing, scene reconstruction and the like, and therefore, the descriptor branch adopts a change-loss function in the training process, and the specific form is as follows:
the description subgraph corresponding to the original image: d, homographic transformation: h, obtaining a description subgraph corresponding to the deformed image after homographic transformation of the original image: d'
Descriptor corresponding to original image: d is a radical of hw And the descriptor corresponding to the deformed image: d' h′w′
Coordinates of 8 × 8 neighborhood center pixels in the original image: p is a radical of hw
Coordinates of central pixels of 8 multiplied by 8 neighborhoods in the deformed image: p' h′w′
Judging the corresponding relation:
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein λ is d ,m p And m n For empirical thresholds in the loss function, λ d Is designed to balance the loss term size of positive pair (s is 1 point pair) and negative pair (s is 0 point pair), ensuring that the network parameters fall in the right direction in harmony, and m is p And m n The establishment aims at controlling the network learning process, preventing the overfitting phenomenon caused by the network overfitting and ensuring that the network parameters are converged to a proper value range. s is a correspondence relation determination parameter, m p As a positive edge parameter, d hw T Is d hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameter.
The descriptor branch output is:
out descriptor_1 =ReLu(conv_descriptor_1(out 42 ))
out descriptor_3 =ReLu(conv_descriptor-2(out descriptor_2 ))
out descriptor_final =Normalize(out descriptor_3 )
the descriptor branch network structure is shown in table 3.
TABLE 3
In the invention, in the detector branch, the cascade characteristic diagram is compressed along the channel and then is subjected to characteristic position scoring, so that the characteristic position in an 8 x 8 neighborhood is determined, in the descriptor branch, the cascade characteristic diagram is compressed into 256 descriptors along the dimension of the channel, and then descriptor interpolation is carried out by taking the central pixel coordinate of the 8 x 8 neighborhood as a position reference, so that the accuracy and the specificity of the descriptors are improved.
Further, as shown in fig. 3, the space module includes a plurality of convolution networks, a grid generator, a sampling network and a sampler; the space module receives a shallow layer characteristic diagram in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, grid generation is carried out by the grid generator, a sampling grid is obtained, and a sampler carries out pixel sampling on the shallow layer characteristic diagram in the main network according to the sampling grid to obtain a space conversion characteristic diagram.
Furthermore, the training of the processing system is five-stage training, a synthetic data set containing basic geometric patterns is used as a training sample, in the first stage, data enhancement operation is carried out on the training sample data set, and then the training sample data set is used for training the detector branches independently; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and a fifth stage, emptying the weights of the detector branch and the descriptor branch, and performing joint training on the detector branch and the descriptor branch by using the secondary labeling data set to finally obtain a stable processing system.
The whole processing system adopts twin architecture design, and is structurally and hierarchically divided into a front-end main network, a rear-end detector branch and a descriptor branch, an original image is input into a main network (Backbone) part, the main network performs image convolution processing on the original input image and outputs a deep characteristic diagram, and the deep characteristic diagram is used as shared information and submitted to the detector branch and the descriptor branch to perform different tasks. Meanwhile, the backbone network is externally connected with a space processing module, the shallow feature map of the backbone network is separated independently and transmitted to the space processing module, the space processing module plays a role of a space Transformer (Spatial Transformer), space information is obtained after the space processing module processes the space information, and then the space information is coded into the feature information to obtain a space conversion feature map. The detector branch and the descriptor branch are twin modules and are respectively used for feature position detection and feature description tasks, in a twin framework, deep feature maps output by a main network are respectively input into the detector branch and the descriptor branch and are cascaded with the space conversion feature maps, under the guidance of a loss function, weight parameters in a processing system are continuously updated and iterated to obtain more accurate feature positions and descriptor information, and the detector branch and the descriptor branch respectively output a corner probability map of a 65 channel and a descriptor map of a 256 channel at the output end of the processing system.
In order to prevent subjective interference caused by manual labeling of data, the method adopts an automatic supervision mode to complete training, 4 stages are set in a training link, a program is utilized to automatically synthesize a synthetic data set containing basic geometric patterns (polygons, lines and stars … …) in a first stage, as shown in fig. 4(A) -4 (B), data enhancement operations such as contrast adjustment, noise addition, motion blur, brightness adjustment and the like are carried out on the data set, and then the data set is utilized to carry out preliminary training on network detector branches independently; in the second stage, the detector branches obtained by the primary training in the first stage are utilized to label the real scene data set (figure 5) to obtain a feature labeled data set in the real scene; in the third stage, all the weight parameters obtained by training in the first stage are emptied, and the real scene data set obtained in the second stage is used for independently carrying out preliminary training on the network detector branches; in the fourth stage, the detector obtained in the third stage is used for re-labeling the real scene data set, and the secondary labeling aims at further refining the quality of the data set and providing a basis for the training of the final stage; in the fifth stage, the weight is cleared again, and the high-quality data set marked in the fourth stage is used for carrying out combined training on the whole network structure (detector + descriptor), so that a complete and reliable feature processing network is finally obtained.
The invention provides a twin feature processing network based on a deep learning technology, aiming at the problems of unstable feature monitoring and poor repeatability of a Visual feature extraction and description system (VSLAM) under weak textures. The main network is provided with a multilayer convolutional neural network for extracting deep features of the image, a space converter module is externally connected to the middle of the main skeleton for explicitly coding spatial information, the spatial stability and the sensitivity of the feature information are enhanced, and the deep feature map of the image and the space conversion feature map are cascaded and superposed in a rear-end branch to provide rich data for a network output layer. In the feature detector and the descriptor branch, the feature map is divided by taking an 8 × 8 neighborhood as a basic unit, the cascade feature map is respectively compressed to 65 channels and 256 channels, a probability scoring strategy is adopted in the detector branch to determine the feature position in the 8 × 8 neighborhood (the 65 th channel numerical value represents the featureless probability in the neighborhood), and 256 descriptors are adopted in the descriptor branch to mark feature information. In order to overcome subjective errors caused by manual marking characteristics, the data label is constructed in an automatic supervision marking mode, and meanwhile, the training process is finely divided into five stages to improve the data quality and enhance the network precision, so that the data dilemma caused by rare samples is effectively solved. By the visual feature processing system provided by the invention, multiple computer visual tasks such as visual positioning, scene reconstruction, image splicing and the like can be continuously and stably carried out in a weak texture environment, and the original defects such as feature loss, algorithm collapse and the like are alleviated. The invention ensures the real-time performance of the system to the greatest extent while realizing function enhancement, effectively controls the feature detection quantity at the detector level by setting the 8 multiplied by 8 neighborhood, branches at the descriptor, and carries out descriptor interpolation calculation by taking the central pixel coordinate of the 8 multiplied by 8 neighborhood as the position reference in order to further improve the specificity of feature description without damaging the real-time performance of the algorithm.
The invention also provides a visual feature processing method based on the neural network in the weak texture environment, wherein the method is based on the processing system, and the processing method comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the raw image into the processing system;
step S3: the processing system performs feature detection and description on an original image to obtain an angular point and a corresponding descriptor of the original image;
step S4: based on the angular points and the descriptors of the original image, image splicing, visual positioning and scene recognition can be completed in a weak texture environment.
The above embodiments only describe the design principle of the present invention, and the shapes and names of the components in the description may be different without limitation. Therefore, a person skilled in the art of the present invention can modify or substitute the technical solutions described in the foregoing embodiments; such modifications and substitutions do not depart from the spirit and scope of the present invention.
Claims (7)
1. A neural network-based visual feature processing system in a weak texture environment, the processing system comprising:
the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network;
the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein a shallow feature map obtained after shallow convolution of the backbone network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branch comprises a plurality of concatenated convolutional layers, the output of the first spatial module is fused with the output of the first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.
2. The system of claim 1, wherein the detector branch is configured to receive a deep feature map of the original image output from the backbone network, the detector branch comprising a plurality of concatenated convolutional layers, wherein an output of the first spatial module is fused with an output of a first convolutional layer in the detector branch, the detector branch outputting a corner probability map, the corner probability map being configured to characterize a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the main network, the descriptor branch comprises a plurality of cascaded convolutional layers, wherein the output of the second spatial module is fused with the output of a first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing descriptor morphology of each point in the original image.
3. The system of any of claims 1-2, wherein the detector branches use a traffic loss function during training to partition the original image into a grid of base cells by using an 8 x 8 neighborhood as a base cell, assuming a total of H within the grid C ×W C A basic unit, each basic unit is represented as x hw And recording a true value data label set in the real scene data set as Y, wherein the specific loss function of the detector branch is as follows:
wherein H C Is the total number of rows, W, of the basic cell grid C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l p Normalizing the network predicted value at the pixel position of the corner point and taking the negative logarithm, x hwy For a network prediction value, x, at the pixel position of a corner in a basic unit hwk K is the channel number for the net prediction value at any pixel position in one base unit.
4. A system as claimed in any one of claims 1 to 2, wherein said descriptor branch employs a change-loss function during training in the form of:
the description subgraph corresponding to the original image: d, homography transformation: h, obtaining a description subgraph corresponding to the deformed image after homographic transformation of the original image: d'
Descriptor corresponding to original image: d hw And a descriptor corresponding to the deformed image: d' h′w′
Coordinates of 8 × 8 neighborhood center pixels in the original image: p is a radical of hw
Coordinates of 8 multiplied by 8 neighborhood center pixels in the deformed image: p' h′w′
Judging the corresponding relation:
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein,to describe the loss function of the device, Hp hw Is the coordinate of 8 multiplied by 8 neighborhood center pixel in the original image after homographic transformation, lambda d Is a weight parameter, s is a correspondence relation determination parameter, m p As a positive edge parameter, d hw T Is d hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameter.
5. The system of any of claims 1-2, wherein the first spatial module and the second spatial module each comprise a plurality of convolutional networks, a trellis generator, a sampling network, and a sampler; the space module receives a shallow feature map in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator performs grid generation to obtain a sampling grid, and the sampler performs pixel sampling on the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.
6. The system of any of claims 1-2, wherein the training of the processing system is a five-stage training, a first stage, in which a data enhancement operation is performed on a training sample data set, followed by training of the detector branches separately using the training sample data set; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and in the fifth stage, the weights of the detector branches and the descriptor branches are cleared, and the secondary labeling data sets are utilized to carry out joint training on the detector branches and the descriptor branches.
7. A neural network-based visual feature processing method in a weak texture environment, which is performed by the processing system according to any one of claims 1-6, and comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the raw image into the processing system;
step S3: the processing system performs feature detection and description on an original image to obtain an angular point and a corresponding descriptor of the original image;
step S4: based on the angular points and the descriptors of the original image, image splicing, visual positioning and scene recognition can be completed in a weak texture environment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210663043.2A CN114937153B (en) | 2022-06-07 | 2022-06-07 | Visual characteristic processing system and method based on neural network in weak texture environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210663043.2A CN114937153B (en) | 2022-06-07 | 2022-06-07 | Visual characteristic processing system and method based on neural network in weak texture environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114937153A true CN114937153A (en) | 2022-08-23 |
CN114937153B CN114937153B (en) | 2023-06-30 |
Family
ID=82867108
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210663043.2A Active CN114937153B (en) | 2022-06-07 | 2022-06-07 | Visual characteristic processing system and method based on neural network in weak texture environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114937153B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117710467A (en) * | 2024-02-06 | 2024-03-15 | 天津云圣智能科技有限责任公司 | Unmanned plane positioning method, unmanned plane positioning equipment and aircraft |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861988A (en) * | 2021-03-04 | 2021-05-28 | 西南科技大学 | Feature matching method based on attention-seeking neural network |
CN113066129A (en) * | 2021-04-12 | 2021-07-02 | 北京理工大学 | Visual positioning and mapping system based on target detection in dynamic environment |
CN113610905A (en) * | 2021-08-02 | 2021-11-05 | 北京航空航天大学 | Deep learning remote sensing image registration method based on subimage matching and application |
WO2022000426A1 (en) * | 2020-06-30 | 2022-01-06 | 中国科学院自动化研究所 | Method and system for segmenting moving target on basis of twin deep neural network |
CN113988269A (en) * | 2021-11-05 | 2022-01-28 | 南通大学 | Loop detection and optimization method based on improved twin network |
-
2022
- 2022-06-07 CN CN202210663043.2A patent/CN114937153B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022000426A1 (en) * | 2020-06-30 | 2022-01-06 | 中国科学院自动化研究所 | Method and system for segmenting moving target on basis of twin deep neural network |
CN112861988A (en) * | 2021-03-04 | 2021-05-28 | 西南科技大学 | Feature matching method based on attention-seeking neural network |
CN113066129A (en) * | 2021-04-12 | 2021-07-02 | 北京理工大学 | Visual positioning and mapping system based on target detection in dynamic environment |
CN113610905A (en) * | 2021-08-02 | 2021-11-05 | 北京航空航天大学 | Deep learning remote sensing image registration method based on subimage matching and application |
CN113988269A (en) * | 2021-11-05 | 2022-01-28 | 南通大学 | Loop detection and optimization method based on improved twin network |
Non-Patent Citations (1)
Title |
---|
黎式南 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117710467A (en) * | 2024-02-06 | 2024-03-15 | 天津云圣智能科技有限责任公司 | Unmanned plane positioning method, unmanned plane positioning equipment and aircraft |
CN117710467B (en) * | 2024-02-06 | 2024-05-28 | 天津云圣智能科技有限责任公司 | Unmanned plane positioning method, unmanned plane positioning equipment and aircraft |
Also Published As
Publication number | Publication date |
---|---|
CN114937153B (en) | 2023-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522966B (en) | Target detection method based on dense connection convolutional neural network | |
CN114255238A (en) | Three-dimensional point cloud scene segmentation method and system fusing image features | |
Shen et al. | Vehicle detection in aerial images based on lightweight deep convolutional network and generative adversarial network | |
CN113177560A (en) | Universal lightweight deep learning vehicle detection method | |
CN112818969A (en) | Knowledge distillation-based face pose estimation method and system | |
CN113449691A (en) | Human shape recognition system and method based on non-local attention mechanism | |
CN112101262B (en) | Multi-feature fusion sign language recognition method and network model | |
CN111680739A (en) | Multi-task parallel method and system for target detection and semantic segmentation | |
CN115937552A (en) | Image matching method based on fusion of manual features and depth features | |
CN114821408A (en) | Method, device, equipment and medium for detecting parcel position in real time based on rotating target detection | |
CN116612468A (en) | Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism | |
CN113901928A (en) | Target detection method based on dynamic super-resolution, and power transmission line component detection method and system | |
Sun et al. | IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes | |
CN114937153B (en) | Visual characteristic processing system and method based on neural network in weak texture environment | |
Fu et al. | Complementarity-aware Local-global Feature Fusion Network for Building Extraction in Remote Sensing Images | |
CN114529821A (en) | Offshore wind power safety monitoring and early warning method based on machine vision | |
Xiao et al. | Road extraction from point clouds of open-pit mine using LPFE-Net | |
CN113505808A (en) | Detection and identification algorithm for power distribution facility switch based on deep learning | |
Wang et al. | Summary of object detection based on convolutional neural network | |
CN117152630A (en) | Optical remote sensing image change detection method based on deep learning | |
CN111860361A (en) | Green channel cargo scanning image entrainment automatic identifier and identification method | |
CN114782827B (en) | Object capture point acquisition method and device based on image | |
CN114863103A (en) | Unmanned underwater vehicle identification method, equipment and storage medium | |
CN112733934B (en) | Multi-mode feature fusion road scene semantic segmentation method in complex environment | |
CN114596488A (en) | Lightweight remote sensing target detection method based on dense feature fusion network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |