CN114937153A - Neural network-based visual feature processing system and method under weak texture environment - Google Patents

Neural network-based visual feature processing system and method under weak texture environment Download PDF

Info

Publication number
CN114937153A
CN114937153A CN202210663043.2A CN202210663043A CN114937153A CN 114937153 A CN114937153 A CN 114937153A CN 202210663043 A CN202210663043 A CN 202210663043A CN 114937153 A CN114937153 A CN 114937153A
Authority
CN
China
Prior art keywords
descriptor
branch
original image
detector
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210663043.2A
Other languages
Chinese (zh)
Other versions
CN114937153B (en
Inventor
方浩
胡家瑞
王奥博
陈杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210663043.2A priority Critical patent/CN114937153B/en
Publication of CN114937153A publication Critical patent/CN114937153A/en
Application granted granted Critical
Publication of CN114937153B publication Critical patent/CN114937153B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Image Generation (AREA)

Abstract

The invention discloses a system and a method for processing visual characteristics based on a neural network under a weak texture environment, wherein the processing system comprises: the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network; the main network carries out convolution processing on an original image and outputs a deep characteristic map of the original image; the output of the first spatial module is fused with the output of the first convolution layer in the detector branch, and the detector branch outputs a corner probability map which is used for representing the probability that each point in the original image is a corner; the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.

Description

Neural network-based visual feature processing system and method under weak texture environment
Technical Field
The invention relates to the field of computer vision, in particular to a neural network-based visual feature processing system and method in a weak texture environment.
Background
In recent years, the development of artificial intelligence is vigorous, global automation patterns are gradually formed, computer vision is taken as one of core perception technologies, considerable social, economic and academic values are created, the social application depth and the social application range are continuously enhanced, the industries of security, medicine, agriculture and forestry, manufacturing and the like gradually enter the visual intelligence era, and the computer vision is still an indispensable pioneer technology in intelligent innovation. The characteristic information is important in the technical process of realizing visual enabling and is a key mark for a computing system to understand and recognize images. Researchers provide rich characteristic design schemes based on graphics, and provide good operation elements for visual tasks such as image retrieval, image splicing, VSLAM, three-dimensional reconstruction and the like by considering characteristic information of discrimination and repeatability, wherein the VSLAM scheme endows intelligent bodies such as unmanned aerial vehicles, unmanned vehicles and the like with self-positioning and environment perception capabilities, and is an important technical drive for promoting intelligent unmanned construction. However, the traditional image features based on geometry excessively depend on image quality, are naturally very sensitive to imaging environment changes, and when the traditional image features are oriented to a common weak texture severe scene as shown in fig. 1, feature quality degradation is caused, so that a feature algorithm fails, and tasks such as VSLAM are crashed. The feature processing technology still has significant defects in the aspects of resisting environmental interference, coping with equipment noise and adapting to motion change, and scientific innovation and product incubation work provide more and more urgent needs for robust and accurate feature extraction and description algorithms. For the visual positioning and mapping task under the weak texture environment, the existing solutions are as follows:
scheme 1: yi K M, LIFT: Learned Invariant Feature Transform [ J ]. The scheme utilizes a motion structure recovery method to construct a supervision signal to make up for data loss, and realizes interactive connection of three subtask networks (a detector, a direction estimator and a descriptor) and End-to-End synchronous learning under a unified framework. However, a calculation sharing relation cannot be formed among sub-networks in the LIFT model, so that the LIFT features are difficult to meet the real-time application requirements.
Scheme 2: detone D et al, SuperPoint: Self-Supervised Interest Point Detection and Description [ J ]. The scheme adopts a twin neural network design, basically realizes calculation sharing between a detection network and a description network, has excellent performance in the aspect of real-time performance, adopts a self-labeling method to obtain a training sample in the SuperPoint scheme, utilizes a labeling device and homography transformation to complete false-true value labeling on an original image, and benefits from a homography transformation mechanism, the SuperPoint network can output more dense and more repeatable image characteristics. However, this work does not work well with implicit methods to model spatial characteristics.
Scheme 3: dusmanu M et al, A train able cnn for joint description and detection of local features [ C ]. The scheme provides a concept of synchronous detection and description, breaks through the traditional mode of 'detection before description' in a time dimension, the network output of the scheme simultaneously comprises characteristic position scores and descriptor information, the D2-Net realizes the complete integration of the detector and the descriptor in the true sense, and the excellent effect is achieved in the network efficiency level. However, D2-Net performs poorly in feature accuracy.
Disclosure of Invention
In view of the above, the present invention provides a system and a method for processing visual features based on a neural network in a weak texture environment, which can solve the technical problem of how to reduce the interference of a weak texture scene on the feature extraction and description process in the existing weak texture environment.
In order to solve the above-mentioned technical problems, the present invention has been accomplished as described above.
A neural network-based visual feature processing system in a weak texture environment, comprising:
the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network;
the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the trunk network comprises a plurality of cascaded convolution layers, wherein a shallow characteristic diagram obtained after shallow convolution of the trunk network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branch comprises a plurality of concatenated convolutional layers, the output of the first spatial module is fused with the output of the first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor morphology of each point in the original image.
Preferably, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, the detector branch includes a plurality of concatenated convolutional layers, wherein an output of the first spatial module is merged with an output of a first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for characterizing a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the main network, the descriptor branch comprises a plurality of cascaded convolutional layers, wherein the output of the second spatial module is fused with the output of a first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing descriptor morphology of each point in the original image.
Preferably, the detector branches adopt an information quantity loss function in the training process, divide the original image by taking an 8 × 8 neighborhood as a basic unit to obtain a basic unit grid, and assume that the total H is in the grid C ×W C A basic unit, each basic unit is represented as x hw The true value data label set in the real scene data set is marked as Y, and the detector is provided with branch toolsThe bulk loss function is:
Figure BDA0003681107870000041
Figure BDA0003681107870000042
wherein H C Is the total number of rows, W, of the basic cell grid C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l p Normalizing the network predicted value at the pixel position of the corner point and taking the negative logarithm, x hwy Is a network prediction value, x, at the pixel position of the corner in a base unit hwk K is the channel number for the net prediction value at any pixel position in one base unit.
Preferably, the descriptor branch adopts a change-loss function in the training process, and the specific form is as follows:
the description subgraph corresponding to the original image: d, homography transformation: h, obtaining a description subgraph corresponding to the deformed image after the original image is subjected to homographic transformation: d'
Descriptor corresponding to original image: d is a radical of hw And a descriptor corresponding to the deformed image: d' h′w′
Coordinates of 8 × 8 neighborhood center pixels in the original image: p is a radical of hw
Coordinates of 8 multiplied by 8 neighborhood center pixels in the deformed image: p' h′w′
Judging the corresponding relation:
Figure BDA0003681107870000043
Figure BDA0003681107870000044
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein,
Figure BDA0003681107870000051
to describe the loss function of the device, Hp hw Is the homographic transformation of the central pixel of 8 x 8 neighborhood in the original image d Is a weight parameter, s is a correspondence relation determination parameter, m p As a positive edge parameter, d hw T Is d hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameter.
Further, the first space module and the second space module each include a plurality of convolutional networks, a grid generator, a sampling network, and a sampler; the space module receives a shallow feature map in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator performs grid generation to obtain a sampling grid, and the sampler performs pixel sampling on the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.
Further, the training of the processing system is five-stage training, in the first stage, data enhancement operation is carried out on a training sample data set, and then the training sample data set is used for training the detector branches independently; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and in the fifth stage, the weights of the detector branch and the descriptor branch are cleared, and the secondary labeling data set is utilized to carry out joint training on the detector branch and the descriptor branch.
The invention provides a neural network-based visual feature processing method in a weak texture environment, which comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the raw image into the processing system;
step S3: the processing system performs feature detection and description on an original image to obtain an angular point and a corresponding descriptor of the original image;
step S4: based on the angular points and the descriptors of the original image, image splicing, visual positioning and scene recognition can be completed in a weak texture environment.
Has the advantages that:
the invention gives full play to the advantages of the deep learning method, guides the network to concern the scene area with rich texture information in a data driving mode, and enhances the overall spatial stability and sensitivity of the network by adding the spatial processing module in a targeted manner. In the invention, the space module is connected into the twin part in a layer jump connection mode, and the space adaptive capacity of the network model is expanded on the premise of ensuring the authenticity of deep features of the image as far as possible.
The method has the following technical effects:
(1) the invention provides a visual feature processing system based on a neural network, which is used for reducing the adverse effect of weak texture scenes on the visual feature processing work, breaking through the constraint of geometric rules on the traditional feature algorithm by adopting a data driving method, and further improving the utilization rate of image information while ensuring the real-time property.
(2) The invention introduces a space converter module, and carries out cascade superposition on the converted space conversion characteristic diagram and the original characteristic diagram, and the processing method explicitly models the space characteristic of the image, and has more excellent performance compared with the implicit modeling method in the prior work.
(3) The invention adopts the self-supervision marking method to complete the training, solves the problems of subjective error and sample loss of manual marking, further improves the data utilization rate, fully develops the network structure potential, furthest reduces the damage of scene limitation to the characteristic network, and has significant significance for enhancing the practical application value of the deep learning technology in the characteristic extraction and description problem.
(4) According to the method, strong constraints brought by geometric rules in the feature extraction process are eliminated through a twin neural network architecture and a self-supervision annotation training strategy, so that the network has excellent robustness, flexibility and scene adaptability, and therefore external environment interference and algorithm complexity are reduced.
(5) The processing system is a twin framework, adopts a feature processing algorithm, establishes a standard system integrating feature extraction and description, and performs explicit modeling on a space module by using a structure shown in figure 3, thereby ensuring the specificity and the space quality of the extracted scene features.
Drawings
FIG. 1 is a schematic diagram of a weak texture scene;
FIG. 2 is a schematic diagram of a neural network-based visual feature processing system architecture in a weak texture environment according to the present invention;
FIG. 3 is a schematic diagram of a space module architecture according to the present invention;
FIGS. 4(A) -4 (B) are schematic diagrams of a synthetic data set provided by the present invention;
FIG. 5 is a schematic diagram of a real scene data set used in the present invention;
fig. 6(a) -6 (B) are schematic diagrams of the output results of the detector provided by the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and examples.
As shown in fig. 2-3, the present invention provides a neural network-based visual feature processing system in a weak texture environment, the processing system comprising:
the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network;
the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein a shallow feature map obtained after shallow convolution of the backbone network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branch comprises a plurality of concatenated convolutional layers, the output of the first spatial module is fused with the output of the first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.
The shallow convolution is that the image is processed by only partial convolution layer and does not reach the deep level.
Further, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, the detector branch includes a plurality of concatenated convolutional layers, wherein an output of the first spatial module is fused with an output of a first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for characterizing a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the main network, the descriptor branch comprises a plurality of cascaded convolutional layers, wherein the output of the second spatial module is fused with the output of a first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing descriptor morphology of each point in the original image.
The image features are basic operation units of multiple computer vision tasks, are key information for calculating mechanism solution image content, and can get rid of geometric rule constraint and weaken external environment interference by means of strong extraction capability of a convolutional neural network on deep image features in a wide sensing domain, and a feature processing algorithm based on deep learning can get rid of geometric rule constraint and weaken external environment interference, so that after a scene image is read in, firstly, an original image is subjected to convolutional coding through a multiple convolutional layer arranged in a main network to complete deep feature extraction, and meanwhile, a shallow feature map in the main network is input into a space module, namely a space converter to be subjected to space information coding to obtain a space conversion feature map out Spatial-Transformer The main purpose of deep feature extraction and spatial coding is to provide a data base for subsequent feature detection and description tasks.
Deep layer characteristic extraction:
out 11 =ReLu(conv_11(raw_image))
out 12 =Maxpool(ReLu(conv_12(out 11 )))
out 21 =ReLu(conv_21(out 12 ))
out 22 =Maxpool(ReLu(conv_22(out 21 )))
out 31 =ReLu(conv_31(out 22 ))
out 32 =Maxpool(ReLu(conv_32(out 31 )))
out 41 =ReLu(conv_41(out 32 ))
out 42 =ReLu(conv_42(out 41 ))
the structure of the backbone network is shown in table 1 below.
TABLE 1
Figure BDA0003681107870000091
The detector branch is composed of a plurality of convolutional layers, the deep layer feature map is received at a first convolutional layer of the detector branch, the spatial conversion feature map is received at a first convolutional layer output position, and fusion of first convolutional layer output and the spatial conversion feature map is carried out, wherein the fusion comprises cascade connection of the first convolutional layer output feature map and the spatial conversion feature map in the detector branch along a channel dimension. The spatial transformation profile is a profile describing a spatial transformation. The first convolutional layer of the detector branch is the first convolutional layer that processes the input of the detector branch.
In the invention, an 8-neighborhood method is adopted for feature detection, non-maximum suppression is adopted in an 8 x 8 neighborhood to ensure the uniqueness of feature information, the detector branches adopt a convolution method to compress a cascade feature map obtained after fusion to 65 channels, then normalization processing is carried out on data in the 65 channels, the detector branches output a corner probability map, and the corner probability map is used for representing the probability that each point in an original image is a corner. Specifically, of the 65 channels, the values in 64 channels represent the probability of feature points at 64 pixel positions in an 8 × 8 neighborhood, and the values in the other 1 channel represent the probability of no feature existing in the 8 × 8 neighborhood. The output results of the detector branches are shown in fig. 6(a) -6 (B).
The detector branches adopt an information quantity loss function in the training process, the original image is divided by taking an 8 multiplied by 8 neighborhood as a basic unit to obtain a basic unit grid, and the total H in the grid is assumed C ×W C A basic unit, each basic unit is represented as x hw And recording a true value data label set in the real scene data set as Y, wherein the specific loss function of the detector branch is as follows:
Figure BDA0003681107870000101
Figure BDA0003681107870000102
wherein H C Is the total number of rows, W, of the basic cell grid C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l p Normalizing the network predicted value at the pixel position of the diagonal point and taking a negative logarithm x hwy For a network prediction value, x, at the pixel position of a corner in a basic unit hwk K is the channel number for the net prediction value at any pixel position in one base unit.
The output of each convolutional layer of the detector branches is:
out dect_1 =ReLu(conv_dect_1(out 42 ))
cascade superposition:
Figure BDA0003681107870000103
out dect_3 =ReLu(conv_dect_2(out dect_2 ))
out dect_final =Softmax(out dect_3 )
the detector finger network structure is shown in table 2 below.
TABLE 2
Figure BDA0003681107870000111
The descriptor branch is composed of a plurality of convolutional layers, the deep layer feature map is received at a first layer convolutional layer of the descriptor branch, the spatial conversion feature map is received at a first convolutional layer output position, and fusion of first convolutional layer output and the spatial conversion feature map is carried out, wherein the fusion comprises cascade connection of the first convolutional layer output feature map and the spatial feature map in the descriptor branch along a channel dimension. The first convolutional layer of the descriptor branch is the first convolutional layer to process the input of the descriptor branch.
For real-time considerations, the descriptor branch is also denoted by H C ×W C The basic units are used as operation elements to carry out feature description work unit by unit, 256-bit descriptors are adopted to characterize feature points, and in order to enhance the fineness and the consistency of the feature description work, the central points in 8 multiplied by 8 neighborhood are used as position reference to carry out pixel-level descriptor interpolation calculation on the detected feature points, so that the feature description precision is further improved.
As the identification mark of the feature point, the most important attribute of the descriptor is the individual specificity of the descriptor, the clearly identifiable feature descriptor has important significance for feature matching and identification work and is an important guarantee for accurately completing computer vision tasks such as vision positioning, image splicing, scene reconstruction and the like, and therefore, the descriptor branch adopts a change-loss function in the training process, and the specific form is as follows:
the description subgraph corresponding to the original image: d, homographic transformation: h, obtaining a description subgraph corresponding to the deformed image after homographic transformation of the original image: d'
Descriptor corresponding to original image: d is a radical of hw And the descriptor corresponding to the deformed image: d' h′w′
Coordinates of 8 × 8 neighborhood center pixels in the original image: p is a radical of hw
Coordinates of central pixels of 8 multiplied by 8 neighborhoods in the deformed image: p' h′w′
Judging the corresponding relation:
Figure BDA0003681107870000121
Figure BDA0003681107870000122
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein λ is d ,m p And m n For empirical thresholds in the loss function, λ d Is designed to balance the loss term size of positive pair (s is 1 point pair) and negative pair (s is 0 point pair), ensuring that the network parameters fall in the right direction in harmony, and m is p And m n The establishment aims at controlling the network learning process, preventing the overfitting phenomenon caused by the network overfitting and ensuring that the network parameters are converged to a proper value range. s is a correspondence relation determination parameter, m p As a positive edge parameter, d hw T Is d hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameter.
The descriptor branch output is:
out descriptor_1 =ReLu(conv_descriptor_1(out 42 ))
cascade superposition:
Figure BDA0003681107870000123
out descriptor_3 =ReLu(conv_descriptor-2(out descriptor_2 ))
out descriptor_final =Normalize(out descriptor_3 )
the descriptor branch network structure is shown in table 3.
TABLE 3
Figure BDA0003681107870000124
Figure BDA0003681107870000131
In the invention, in the detector branch, the cascade characteristic diagram is compressed along the channel and then is subjected to characteristic position scoring, so that the characteristic position in an 8 x 8 neighborhood is determined, in the descriptor branch, the cascade characteristic diagram is compressed into 256 descriptors along the dimension of the channel, and then descriptor interpolation is carried out by taking the central pixel coordinate of the 8 x 8 neighborhood as a position reference, so that the accuracy and the specificity of the descriptors are improved.
Further, as shown in fig. 3, the space module includes a plurality of convolution networks, a grid generator, a sampling network and a sampler; the space module receives a shallow layer characteristic diagram in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, grid generation is carried out by the grid generator, a sampling grid is obtained, and a sampler carries out pixel sampling on the shallow layer characteristic diagram in the main network according to the sampling grid to obtain a space conversion characteristic diagram.
Furthermore, the training of the processing system is five-stage training, a synthetic data set containing basic geometric patterns is used as a training sample, in the first stage, data enhancement operation is carried out on the training sample data set, and then the training sample data set is used for training the detector branches independently; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and a fifth stage, emptying the weights of the detector branch and the descriptor branch, and performing joint training on the detector branch and the descriptor branch by using the secondary labeling data set to finally obtain a stable processing system.
The whole processing system adopts twin architecture design, and is structurally and hierarchically divided into a front-end main network, a rear-end detector branch and a descriptor branch, an original image is input into a main network (Backbone) part, the main network performs image convolution processing on the original input image and outputs a deep characteristic diagram, and the deep characteristic diagram is used as shared information and submitted to the detector branch and the descriptor branch to perform different tasks. Meanwhile, the backbone network is externally connected with a space processing module, the shallow feature map of the backbone network is separated independently and transmitted to the space processing module, the space processing module plays a role of a space Transformer (Spatial Transformer), space information is obtained after the space processing module processes the space information, and then the space information is coded into the feature information to obtain a space conversion feature map. The detector branch and the descriptor branch are twin modules and are respectively used for feature position detection and feature description tasks, in a twin framework, deep feature maps output by a main network are respectively input into the detector branch and the descriptor branch and are cascaded with the space conversion feature maps, under the guidance of a loss function, weight parameters in a processing system are continuously updated and iterated to obtain more accurate feature positions and descriptor information, and the detector branch and the descriptor branch respectively output a corner probability map of a 65 channel and a descriptor map of a 256 channel at the output end of the processing system.
In order to prevent subjective interference caused by manual labeling of data, the method adopts an automatic supervision mode to complete training, 4 stages are set in a training link, a program is utilized to automatically synthesize a synthetic data set containing basic geometric patterns (polygons, lines and stars … …) in a first stage, as shown in fig. 4(A) -4 (B), data enhancement operations such as contrast adjustment, noise addition, motion blur, brightness adjustment and the like are carried out on the data set, and then the data set is utilized to carry out preliminary training on network detector branches independently; in the second stage, the detector branches obtained by the primary training in the first stage are utilized to label the real scene data set (figure 5) to obtain a feature labeled data set in the real scene; in the third stage, all the weight parameters obtained by training in the first stage are emptied, and the real scene data set obtained in the second stage is used for independently carrying out preliminary training on the network detector branches; in the fourth stage, the detector obtained in the third stage is used for re-labeling the real scene data set, and the secondary labeling aims at further refining the quality of the data set and providing a basis for the training of the final stage; in the fifth stage, the weight is cleared again, and the high-quality data set marked in the fourth stage is used for carrying out combined training on the whole network structure (detector + descriptor), so that a complete and reliable feature processing network is finally obtained.
The invention provides a twin feature processing network based on a deep learning technology, aiming at the problems of unstable feature monitoring and poor repeatability of a Visual feature extraction and description system (VSLAM) under weak textures. The main network is provided with a multilayer convolutional neural network for extracting deep features of the image, a space converter module is externally connected to the middle of the main skeleton for explicitly coding spatial information, the spatial stability and the sensitivity of the feature information are enhanced, and the deep feature map of the image and the space conversion feature map are cascaded and superposed in a rear-end branch to provide rich data for a network output layer. In the feature detector and the descriptor branch, the feature map is divided by taking an 8 × 8 neighborhood as a basic unit, the cascade feature map is respectively compressed to 65 channels and 256 channels, a probability scoring strategy is adopted in the detector branch to determine the feature position in the 8 × 8 neighborhood (the 65 th channel numerical value represents the featureless probability in the neighborhood), and 256 descriptors are adopted in the descriptor branch to mark feature information. In order to overcome subjective errors caused by manual marking characteristics, the data label is constructed in an automatic supervision marking mode, and meanwhile, the training process is finely divided into five stages to improve the data quality and enhance the network precision, so that the data dilemma caused by rare samples is effectively solved. By the visual feature processing system provided by the invention, multiple computer visual tasks such as visual positioning, scene reconstruction, image splicing and the like can be continuously and stably carried out in a weak texture environment, and the original defects such as feature loss, algorithm collapse and the like are alleviated. The invention ensures the real-time performance of the system to the greatest extent while realizing function enhancement, effectively controls the feature detection quantity at the detector level by setting the 8 multiplied by 8 neighborhood, branches at the descriptor, and carries out descriptor interpolation calculation by taking the central pixel coordinate of the 8 multiplied by 8 neighborhood as the position reference in order to further improve the specificity of feature description without damaging the real-time performance of the algorithm.
The invention also provides a visual feature processing method based on the neural network in the weak texture environment, wherein the method is based on the processing system, and the processing method comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the raw image into the processing system;
step S3: the processing system performs feature detection and description on an original image to obtain an angular point and a corresponding descriptor of the original image;
step S4: based on the angular points and the descriptors of the original image, image splicing, visual positioning and scene recognition can be completed in a weak texture environment.
The above embodiments only describe the design principle of the present invention, and the shapes and names of the components in the description may be different without limitation. Therefore, a person skilled in the art of the present invention can modify or substitute the technical solutions described in the foregoing embodiments; such modifications and substitutions do not depart from the spirit and scope of the present invention.

Claims (7)

1. A neural network-based visual feature processing system in a weak texture environment, the processing system comprising:
the device comprises a trunk network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical subbranches of a twin network;
the main network is used for receiving an input original image, performing convolution processing on the original image and outputting a deep characteristic map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein a shallow feature map obtained after shallow convolution of the backbone network is simultaneously input into the first space module and the second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branch comprises a plurality of concatenated convolutional layers, the output of the first spatial module is fused with the output of the first convolutional layer in the detector branch, the detector branch outputs a corner probability map, and the corner probability map is used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of concatenated convolutional layers, the output of the second spatial module is fused with the output of the first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing the descriptor form of each point in the original image.
2. The system of claim 1, wherein the detector branch is configured to receive a deep feature map of the original image output from the backbone network, the detector branch comprising a plurality of concatenated convolutional layers, wherein an output of the first spatial module is fused with an output of a first convolutional layer in the detector branch, the detector branch outputting a corner probability map, the corner probability map being configured to characterize a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the main network, the descriptor branch comprises a plurality of cascaded convolutional layers, wherein the output of the second spatial module is fused with the output of a first convolutional layer in the descriptor branch, and the descriptor branch outputs a descriptor graph which is used for representing descriptor morphology of each point in the original image.
3. The system of any of claims 1-2, wherein the detector branches use a traffic loss function during training to partition the original image into a grid of base cells by using an 8 x 8 neighborhood as a base cell, assuming a total of H within the grid C ×W C A basic unit, each basic unit is represented as x hw And recording a true value data label set in the real scene data set as Y, wherein the specific loss function of the detector branch is as follows:
Figure FDA0003681107860000021
Figure FDA0003681107860000022
wherein H C Is the total number of rows, W, of the basic cell grid C Is the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner point in one basic unit, l p Normalizing the network predicted value at the pixel position of the corner point and taking the negative logarithm, x hwy For a network prediction value, x, at the pixel position of a corner in a basic unit hwk K is the channel number for the net prediction value at any pixel position in one base unit.
4. A system as claimed in any one of claims 1 to 2, wherein said descriptor branch employs a change-loss function during training in the form of:
the description subgraph corresponding to the original image: d, homography transformation: h, obtaining a description subgraph corresponding to the deformed image after homographic transformation of the original image: d'
Descriptor corresponding to original image: d hw And a descriptor corresponding to the deformed image: d' h′w′
Coordinates of 8 × 8 neighborhood center pixels in the original image: p is a radical of hw
Coordinates of 8 multiplied by 8 neighborhood center pixels in the deformed image: p' h′w′
Judging the corresponding relation:
Figure FDA0003681107860000031
Figure FDA0003681107860000032
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein,
Figure FDA0003681107860000033
to describe the loss function of the device, Hp hw Is the coordinate of 8 multiplied by 8 neighborhood center pixel in the original image after homographic transformation, lambda d Is a weight parameter, s is a correspondence relation determination parameter, m p As a positive edge parameter, d hw T Is d hw Transposing; h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameter.
5. The system of any of claims 1-2, wherein the first spatial module and the second spatial module each comprise a plurality of convolutional networks, a trellis generator, a sampling network, and a sampler; the space module receives a shallow feature map in the main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the plurality of convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator performs grid generation to obtain a sampling grid, and the sampler performs pixel sampling on the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.
6. The system of any of claims 1-2, wherein the training of the processing system is a five-stage training, a first stage, in which a data enhancement operation is performed on a training sample data set, followed by training of the detector branches separately using the training sample data set; in the second stage, the detector branches obtained by the training in the first stage are utilized to label the real scene data set to obtain a characteristic labeling data set in the real scene; in the third stage, the weight parameters of the detector branches obtained by training in the first stage are completely emptied, and the detector branches are independently retrained by using the feature labeling data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeled data set; and in the fifth stage, the weights of the detector branches and the descriptor branches are cleared, and the secondary labeling data sets are utilized to carry out joint training on the detector branches and the descriptor branches.
7. A neural network-based visual feature processing method in a weak texture environment, which is performed by the processing system according to any one of claims 1-6, and comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the raw image into the processing system;
step S3: the processing system performs feature detection and description on an original image to obtain an angular point and a corresponding descriptor of the original image;
step S4: based on the angular points and the descriptors of the original image, image splicing, visual positioning and scene recognition can be completed in a weak texture environment.
CN202210663043.2A 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment Active CN114937153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210663043.2A CN114937153B (en) 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210663043.2A CN114937153B (en) 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment

Publications (2)

Publication Number Publication Date
CN114937153A true CN114937153A (en) 2022-08-23
CN114937153B CN114937153B (en) 2023-06-30

Family

ID=82867108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210663043.2A Active CN114937153B (en) 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment

Country Status (1)

Country Link
CN (1) CN114937153B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117710467A (en) * 2024-02-06 2024-03-15 天津云圣智能科技有限责任公司 Unmanned plane positioning method, unmanned plane positioning equipment and aircraft

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861988A (en) * 2021-03-04 2021-05-28 西南科技大学 Feature matching method based on attention-seeking neural network
CN113066129A (en) * 2021-04-12 2021-07-02 北京理工大学 Visual positioning and mapping system based on target detection in dynamic environment
CN113610905A (en) * 2021-08-02 2021-11-05 北京航空航天大学 Deep learning remote sensing image registration method based on subimage matching and application
WO2022000426A1 (en) * 2020-06-30 2022-01-06 中国科学院自动化研究所 Method and system for segmenting moving target on basis of twin deep neural network
CN113988269A (en) * 2021-11-05 2022-01-28 南通大学 Loop detection and optimization method based on improved twin network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022000426A1 (en) * 2020-06-30 2022-01-06 中国科学院自动化研究所 Method and system for segmenting moving target on basis of twin deep neural network
CN112861988A (en) * 2021-03-04 2021-05-28 西南科技大学 Feature matching method based on attention-seeking neural network
CN113066129A (en) * 2021-04-12 2021-07-02 北京理工大学 Visual positioning and mapping system based on target detection in dynamic environment
CN113610905A (en) * 2021-08-02 2021-11-05 北京航空航天大学 Deep learning remote sensing image registration method based on subimage matching and application
CN113988269A (en) * 2021-11-05 2022-01-28 南通大学 Loop detection and optimization method based on improved twin network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黎式南 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117710467A (en) * 2024-02-06 2024-03-15 天津云圣智能科技有限责任公司 Unmanned plane positioning method, unmanned plane positioning equipment and aircraft
CN117710467B (en) * 2024-02-06 2024-05-28 天津云圣智能科技有限责任公司 Unmanned plane positioning method, unmanned plane positioning equipment and aircraft

Also Published As

Publication number Publication date
CN114937153B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN109522966B (en) Target detection method based on dense connection convolutional neural network
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
Shen et al. Vehicle detection in aerial images based on lightweight deep convolutional network and generative adversarial network
CN113177560A (en) Universal lightweight deep learning vehicle detection method
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN113449691A (en) Human shape recognition system and method based on non-local attention mechanism
CN112101262B (en) Multi-feature fusion sign language recognition method and network model
CN111680739A (en) Multi-task parallel method and system for target detection and semantic segmentation
CN115937552A (en) Image matching method based on fusion of manual features and depth features
CN114821408A (en) Method, device, equipment and medium for detecting parcel position in real time based on rotating target detection
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN113901928A (en) Target detection method based on dynamic super-resolution, and power transmission line component detection method and system
Sun et al. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes
CN114937153B (en) Visual characteristic processing system and method based on neural network in weak texture environment
Fu et al. Complementarity-aware Local-global Feature Fusion Network for Building Extraction in Remote Sensing Images
CN114529821A (en) Offshore wind power safety monitoring and early warning method based on machine vision
Xiao et al. Road extraction from point clouds of open-pit mine using LPFE-Net
CN113505808A (en) Detection and identification algorithm for power distribution facility switch based on deep learning
Wang et al. Summary of object detection based on convolutional neural network
CN117152630A (en) Optical remote sensing image change detection method based on deep learning
CN111860361A (en) Green channel cargo scanning image entrainment automatic identifier and identification method
CN114782827B (en) Object capture point acquisition method and device based on image
CN114863103A (en) Unmanned underwater vehicle identification method, equipment and storage medium
CN112733934B (en) Multi-mode feature fusion road scene semantic segmentation method in complex environment
CN114596488A (en) Lightweight remote sensing target detection method based on dense feature fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant