CN115731441A

CN115731441A - Target detection and attitude estimation method based on data cross-modal transfer learning

Info

Publication number: CN115731441A
Application number: CN202211512260.8A
Authority: CN
Inventors: 刘勇; 诸丰彦; 王蒙蒙
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-03

Abstract

The invention relates to the technical field of computer vision, and discloses a target detection and attitude estimation method based on data cross-modal transfer learning. And the attitude angle estimation task is converted into a classification and regression task, and training is performed in a multi-task mode, so that the network effect is improved.

Description

Target detection and attitude estimation method based on data cross-modal transfer learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection and posture estimation method based on data cross-modal transfer learning.

Background

Object detection and attitude angle estimation are common tasks in the field of computer vision. The mainstream methods for target detection mainly include methods such as YOLO and SSD of one-stage, and target detection algorithms such as fast-RCNN of two-stage. The main indexes of the target detection model are that the detection accuracy and the forward speed are good and bad respectively. The One-stage algorithm does not need to generate the category probability and the position coordinate of the target through regression without the need of region proxy, and the final detection result can be directly obtained through single detection, so that the detection speed is higher, but the accuracy is lower than that of the two-stage target detection algorithm. The Two-stage algorithm first generates region probes through a region generation module, namely a preselected frame possibly containing the object to be detected, and then classifies the object through a convolutional neural network in each preselected frame. An attitude angle estimation algorithm based on deep learning is often used for estimating hand postures, head postures and the like, and common methods mainly include template-based algorithms which are mainly based on template operation on a target, and the attitude angles of the targets, including pitch angles, roll angles and yaw angles, are obtained by positioning key points of a template and then resolving the key points. The main disadvantage of this kind of method is that the template needs to be defined in advance, and different key points need to be designed for estimation according to different problems. Another common attitude angle estimation algorithm does not require key points, such as the FSA-Net algorithm, which is based on regression and feature aggregation methods to directly estimate angles. The target detection and attitude angle estimation are usually trained and tested in the same data field, and more training samples are available. However, in a real scene, real data and data labels may be difficult to obtain, and the model is difficult to train under a limited training sample.

For the YOLO method, the method first divides the input picture into N × N grids, and if the center point of the detection frame falls on one of the grids, the grid is responsible for predicting the detection frame, including its category, center point coordinates, and length and width. Meanwhile, the idea of an anchor frame is introduced, the anchor frame is preset for each grid through clustering, and the deviation of the anchor frame is predicted when the detection frame is actually estimated. This simplifies the problem and makes the network easier to learn. In the aspect of feature extraction, the output of different resolutions is spliced, so that the network has a good learning effect on targets with different scales, but the method can generate more redundant detection frames, and then the detection frames are screened by adopting a non-maximum inhibition method, so that the efficiency of the network is reduced; for the FSA-Net method, a soft stationary with regression mode is realized, and in the aspect of feature aggregation, fine-grained structure mapping is proposed. In the design of a network structure, the method firstly extracts the features through a plurality of branches, and then carries out aggregation on the space dimension through an aggregation module, so that the number of the feature channels is reduced. Finally, the regression problem is converted into the classification problem based on the SSR-Net method, the angle is predicted by using the following soft stage-wise regression formula,

where K is the number of stages, p is the probability distribution at K stages, and U is a vector consisting of angle groups at K stages. Meanwhile, in order to reduce the generalization error, an offset vector and a scaling vector are introduced, which are respectively used for adjusting the center of each bin and the width of the bins at k stage. The method can be used for any regression problem, but the method completes estimation through a plurality of stages in series, the features obtained from the feature extraction network need to be processed through feature aggregation and the like, and the forward calculation cost is high. And the system is difficult to be fused with a target detection task on a network structure to form a system. For the CDAN approach, the approach introduces multiple linear conditioning and entropy conditioning. The method is characterized in that the recognition rate of the classifier is improved by capturing cross variance between feature representation and classifier prediction, and the transportability of the classifier is ensured by controlling uncertainty of the classifier prediction. Based on this, in the prior art, the target detection and the attitude angle estimation have more designs aiming at respective fields on the model structure, so that a simple and efficient model structure is difficult to obtain in designing a visual system simultaneously comprising the target detection and the attitude estimation. Meanwhile, a plurality of attitude angle estimation algorithms need detection boxes as input, a certain requirement is placed on the consistency of training data quantity and training data and test data based on a deep learning model vision task, and more limitations are placed on the application of a real scene. The reason for this is that it is often difficult to obtain enough data and data labeling in real scenes.

Disclosure of Invention

Aiming at the problems, the invention provides a target detection and attitude estimation method based on data cross-modal transfer learning, a domain classification and network model gradient negative optimization module is designed through summarization, a visual system is completed, and the functions of target detection and target attitude angle estimation can be realized.

In order to achieve the above object, the present invention provides a target detection and posture estimation method based on data cross-modal transfer learning, comprising the following steps:

s1, collecting real scene data and labeling the data, wherein the data comprise detection frame coordinates (x, y, w, h) and attitude angles (yaw, pitch, row), outputting a real data picture and correspondingly labeling;

s2, simulating in simulation software according to the real data picture to obtain a simulated image and a mark, wherein the simulated image and the mark comprise detection frame coordinates (x, y, w, h) and attitude angles (yaw, pitch, row), outputting a simulated data picture and a corresponding mark;

s3, inputting the simulation data picture and the label thereof into the shared feature extraction network and the target detection module for training and calculating by using the input label to obtain loss;

s4, cutting the detection frame information output by the target detection module and the characteristics output by the shared characteristic extraction network, inputting the cut detection frame information and the characteristics into an attitude angle estimation module for training, and calculating with the label input in the step S3 to obtain loss;

s5, optimizing the network by utilizing back propagation after adding the losses calculated in the step S3 and the step S4 until the error on the test data set does not decrease any more;

s6, randomly mixing the simulation data and the real data to generate a domain label, and outputting a mixed image, labeling the domain label and the target detection frame, and labeling the attitude angle;

s7, performing countermeasure training on the shared feature extraction network by using the data generated in the step S6 through a migration learning module to complete domain migration;

and S8, forward calculation is carried out, test or actual data are input, and an attitude angle is output after the test or actual data pass through the shared feature extraction network, the target detection module and the attitude angle estimation module.

Preferably, the shared feature extraction network is a multi-scale fusion feature extraction network, which performs feature extraction on an input image by using a plurality of basic units composed of convolution and pooling in series, a first convolution module includes 1 convolution layer with a step size of 1 and a size of 3 × 3 × 3, padding is 1, a batch normalization layer and a LeakyReLU, and then passes through 1, 2, 8, 4 convolution unit groups respectively, each convolution unit group includes a convolution layer with a step size of 2 and a size of 3 × 3 × n, n is an input feature dimension, padding is 1, downsampling features, and each unit group is connected with a residual error to enable the network to maintain good learning capability when the depth is deeper, each convolution unit group sequentially includes a convolution layer with a step size of 1 × 1 × n, padding is not performed, a normalization layer and a LeakyReyRelU are sequentially passed through a convolution layer with a no error of 1 × 3 × n, padding is 1, a batch of 1 × 1 × n, and a batch of the input image is input image, wherein the input image is represented by a batch space of LeakyReyReyReyRefG, and a space of the input image is represented by a space G _f (x；θ _f ) And converting the input image x into a D-dimensional feature vector.

Preferably, the input of the target detection module is that the features output by the shared feature extraction network are 3 features of different scales, multi-scale detection is adopted for the targets of different sizes, and an ignoring parameter is defined on the calculation of loss to indicate if a prediction is madeMaximum intersection ratio of box to all truth values<When the parameter is ignored, the prediction box is a negative sample; if the center point of the truth value falls in a region, the region is responsible for detecting the object, and the prediction box with the maximum intersection ratio with the object is taken as a positive sample, wherein B target boundary boxes exist in a single grid, and each target boundary box consists of five-dimensional prediction parameters including the center point coordinates (x, y), the width and the height (w, h) of the boundary box and the confidence score s _i Confidence score s _i The following can be calculated:

wherein Object represents the Object, pr (Object) represents the probability of the Object existing in the current mesh Object bounding box,

represents the Intersection ratio (IoU) of the predicted value and the true value of the boundary box, shows the accuracy of the position of the target boundary box predicted by the current model,

given a target bounding box prediction value box _pred And true box _truth Then, then

Can be expressed as:

class probability Pr (C) of an object _i I O) represents the posterior probability that the target belongs to a certain class of objects i under the condition that the target exists in the boundary box, and each grid predicts the i-th class of objects C under the condition that the target detection task has K types of objects in total _i Has a conditional probability of Pr (C) _i |O),i＝1,2,…,K，

The final training loss function calculation is divided into 3 parts: (1) the error caused by x, y, w, h, i.e. the error caused by the loss (2) of the detection frame and the error caused by the confidence (3) of the classification, i.e. the loss caused by the classification, is calculated as follows:

loss＝lbox+lobj+lcls

for the object detection module, the input image is divided into S × S grids, each grid being responsible for detecting the object in which the center point falls,

confidence that an object is present in the target bounding box at the time of testing can be expressed as being suppressed by a non-maximum

Reserving a required detection frame:

preferably, the attitude angle estimation module includes a feature decoupling module and a cross-class center loss module.

Preferably, the feature decoupling module is implemented using three channel attention blocks.

Preferably, the cross-category central loss module defines the central loss part of each angle branch as follows:

where z (i) jz (i) is the ith embedded depth feature, c (yi) jc (yi) is the embedding center of the yi-th class, which is updated during training, m is the sample size of the small batch, and j denotes each angular branch,

the distance between the hidden variables in the same discrete angle truth value is reduced according to different angle categories in the above section, so as to ensure intra-class consistency, while the hidden variables in different angle categories should be distributed in a decoupling subspace, but this is not reflected in the above central loss, and in order to alleviate this deficiency, the decoupling section of the cross-category central loss is further defined as follows:

wherein j, j', j ∈ { yaw; a pitch; roll }, j ≠ j' ≠ j ",

and then

Representing the cross-angle class correlation distance, the denominator plus 1 is to prevent the computation from overflowing,

the proposed CCC loss contains the above two parts, which can be written as follows:

where alpha is a hyper-parameter used to trade-off the two-part loss.

Preferably, the step S7 specifically includes the following steps:

s71, generating a transfer learning label, wherein a transfer learning algorithm based on domain confrontation is adopted, for most visual tasks, a problem can be converted into a posterior probability P (C, B | I) of network learning, wherein I represents an image feature, B represents a detection frame and C represents an object class, for an input picture, the input picture can be regarded as a combined probability distribution P (C, B, I) of B, C and I, and the following formula can be obtained according to a simple Bayesian equation:

P(C,B,I)＝P(C,B|I)P(I)

in most problems of transfer learning, there is a basic assumption that the distribution of features in the target domain and the source domain is the same, because the simulation data and the real infrared data are very close to the whole data distribution, expressed in a formula, that is, the two domains are the same as P (C, B | I), and the difference is P (I), so the detection head for the detection network should be the same, that is, the detection effect should be consistent for both the target domain and the source domain, therefore, in order to make the whole network have the detection effect of the active domain on the target domain, it is necessary to reduce P (I) on the source domain and the target domain as much as possible, i.e., migrate the backsbone extracting features to reduce the difference between P (I), and introduce a domain classifier to accomplish this purpose, the whole structure of the network is that the domain classifier and the target task head are connected as two parallel sub-branches after the feature extraction backbone network (kbone), the domain classifier is a binary classifier, the label of the classification is the source domain/target domain, the target task head is a target-based on the target-line loss, but the simulation data set is obviously over the target domain (target domain) and the target data set; the goal of the domain classifier is to minimize the binary error, i.e., to distinguish between the two domains as much as possible; the main task of the feature extraction network is to extract a feature shared by a target task header and a domain classifier, and the feature has two targets: minimizing a target task loss function (target task header); maximizing the binary classification error (robust domain classifier), wherein the second objective is achieved by gradient reversal, which is essentially to multiply the gradient of the domain classifier with a negative coefficient when propagating back to the feature extraction backbone network, so that the feature extraction backbone network is optimized toward the robust domain classifier, and the source domain data distribution and the target domain data distribution are distributed in the direction of the robust domain classifier

Are spatially distributed in a joint of

And

thus, the functions performed by the classifier include having input samples { x) from the source domain and the target domain ₁ ,x ₂ ,…,x _N Is characterized by defining d _i Is a domain label of the ith sample, wherein d _i E.g., {0,1}, if d _i =0, then

Otherwise if d _i =1, then

Firstly, after an input image passes through a feature extraction backbone network, the feature extraction network is set to be f = G _f (x；θ _f ) Converting the input image x into D-dimensional characteristic vector f E R _D In the learning stage, the goal is to minimize the label prediction loss of the labeled part (namely, the source domain data set part) of the training set, so that the parameters of the feature extraction network and the target task head are optimized to minimize the loss of the source domain samples, which ensures the discriminant of the feature f and the good prediction performance of the feature extractor and the label predictor as a whole, and the good prediction performance of the feature extractor and the label predictor combined on the source domain, and meanwhile, the feature f is favored to have domain invariance, namely, the goal of optimizing the network in the training process is to enable the distribution S (f) = { G (G) = _f (x；θ _f ) | x to S (x) } and T (f) = { G = _f (x；θ _f ) | x-T (x) } which, under covariate transfer assumption, would make the target domain label prediction accuracy the same as that of the source domain (Shimodaira, 2000), however, considering that f is highly dimensional and the distribution itself changes as learning progresses, the dissimilarity of the measured distributions S (f) and T (f) is relative, one way to estimate the dissimilarity is to look at the loss of the domain classifier, provided that the parameters of the domain classifier have been trained to distinguish the two feature distributions in an optimal way, so in training, in order to obtain domain-invariant features, we seek the parameter θ of the feature map that maximizes the loss of the domain classifier _f (by making the two feature distributions as similar as possible), while seeking parameters that minimize the loss of domain classifiersNumber theta _d Furthermore, we also seek to minimize the loss of the target task head, so the task can be embodied as:

wherein L is _y As a target task loss function, L _d In order to be a function of the domain classifier penalty,

and

the loss of the ith sample is asked separately, so based on the above equation, the parameters sought are:

the forward calculation and error back propagation for minimizing the target task loss are consistent with the traditional deep learning process, while the maximized domain classification loss is solved by introducing a gradient back module, and the gradient obtained by normally performing back propagation calculation in a domain classifier is multiplied by a negative coefficient-lambda to optimize a feature backbone network before entering a feature extraction network, namely, the network parameter updating mode is as follows:

based on this, data labeling needs to label data fields in addition to the labeling of the target task itself, and since there are 2 data fields, i.e. real infrared data and simulation data, 0 is used to represent the real data field, and 1 is used to represent the simulation data field, the data format for each input image is as follows [ input image (width, height, c) ], and the target task labeling (taking target detection as example (x, y, w, h), field label (0 or 1)) ];

and S72, network training is carried out.

Preferably, the step S72 specifically includes the following steps:

s721, a confrontation learning module is not started in the initial training, only the feature extraction network and the target task head are started, the input data is only simulation data, so that the network can well express the target task on a simulation data set, meanwhile, part of the simulation data is randomly divided into a test data set, and indexes of a training model are not improved on the test data set;

s722, performing field confrontation training, namely changing each batch of data input in the training process into simulation data and real data on the data input, wherein the input labels comprise target task labels and domain labels, then performing training on the basis of the model obtained by training in the step S721, starting a feature extraction network, a target task head and a domain classifier by training, setting a small learning rate (for example, the target detection task is 1 e-4) for training, and performing 5-10 rounds of iteration;

and S723, performing Finetie training on the real infrared training data set without starting a domain classifier module.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention designs a domain classification and network model gradient negative optimization module through summarization, and concludes a domain migration method with optimal effect, namely an antagonistic domain classifier migration learning unit.

2. The invention completes a visual system, and can realize the functions of target detection and target attitude angle estimation. A shared feature extraction network (backbone) is designed for two visual tasks, and the feature extraction network has good characterization capability on features of different scales and required by different tasks through residual connection and multi-scale feature fusion. By improving the attitude angle task head, the network module carries out attitude angle estimation on the targets in the detection frame on the basis of sharing the trunk feature extraction network, so that the systematic functions of target detection and motion attitude estimation are completed on the basis of saving calculation overhead, and meanwhile, the attitude estimation module can filter more unnecessary background features on the basis of the detection module, and the accuracy of the module is improved. And the attitude angle estimation task is converted into a classification and regression task, and training is performed in a multi-task mode, so that the network effect is improved.

Drawings

FIG. 1 is a schematic diagram of the overall network architecture provided by the present invention;

FIG. 2 is a schematic diagram of the overall structure of a feature extraction network according to the present invention;

FIG. 3 is a schematic diagram of the overall structure of an attitude angle estimation module network according to the present invention;

FIG. 4 is a flow chart of the present invention for constructing a hybrid loss;

fig. 5 is a schematic diagram of the overall structure of the transfer learning network provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention mainly focuses on target detection and attitude angle estimation, and the calculation flow in the training and forward processes is as follows, firstly, an input image is subjected to a shared feature extraction network to obtain an image feature D, then the image feature D is input into a target detection head to be calculated to obtain a detection frame, and then, the detection frame coordinates are cut in the image feature D output by the shared feature extraction network to obtain a cut feature D. Inputting the feature d into an attitude angle estimation module to calculate and obtain attitude angle output, wherein the overall network structure is shown in fig. 1.

The invention provides a target detection and attitude estimation method based on data cross-modal transfer learning, which comprises the following steps:

s2, simulating in simulation software according to the real data picture to obtain a simulated image and a mark, wherein the simulated image and the mark comprise coordinates (x, y, w, h) of a detection frame and a posture angle (yaw, pitch, row), outputting the simulated data picture and correspondingly marking;

s6, randomly mixing the simulation data and the real data to generate a domain label, and outputting a mixed image, a domain label, a target detection frame label and an attitude angle label;

In a real scene, real data and data labels may be difficult to obtain, models are difficult to train in a limited training sample, and data distributed by similar data such as simulation data are often adopted to cooperate with the real data for training. However, compared with test data in a real scene, the data still has some differences in data distribution, so that the method adopts a transfer learning method to align the feature domain. In the following description, the simulation data set is used as the source domain data set, and the real data set is used as the target domain data set. The transfer learning mainly comprises (1) transfer learning based on a data domain sample, wherein the label of a source domain data set is adjusted, the source domain data is used as assistance to adjust the weight of a label of a target domain data set, and a target model is obtained through collaborative training, and the method has certain requirements on the data volume of the target domain data; (2) based on the transfer learning of the input data characteristics, the aligned characteristics can be obtained by adjusting the parameters of the characteristic extraction network, so that the difference between a source domain and a target domain is reduced, and the errors of visual tasks such as classification, regression and the like can be reduced; (3) migration learning based on model parameters, and completing migration by finding a shared parameter or a prior relation between a source domain and a target domain; (4) and (3) migration learning based on correlation, wherein the method is to establish correlation knowledge between a source domain and a target domain and complete the migration of the sample through knowledge mapping.

The feature extraction network is composed of feature extraction units. The invention realizes a multi-scale fusion feature extraction network, which uses a plurality of basic units formed by convolution and pooling in series to extract features of an input image. The first convolution module contains 1 convolution layer of size 3 × 3 × 3 with step size 1, padding 1, one batch normalization layer and one LeakyReLU. Then, the feature is downsampled by respectively passing through 1, 2, 8, 4 convolution unit groups, wherein each unit group comprises a convolution layer with the step size of 3 × 3 × n and 2, n is the input feature dimension, padding is 1. Meanwhile, each unit is connected with the other unit by residual errors, so that the network can keep better learning ability when the depth is deeper. Each convolution unit group is composed of 1 × 1 × n convolution with step size 1, no padding, one batch normalization layer and one LeakyReLU. Then sequentially subjected to a convolution of 3 × 3 × n with good 1Padding is 1,1 lot normalization layer and one LeakyReLU. The image input is X belongs to X, wherein X represents the image input space, and the characteristic extraction network is f = G _f (x；θ _f ) An input image x can be converted into a D-dimensional feature vector, and a schematic diagram of the overall structure of the feature extraction network is shown in fig. 2.

As shown in fig. 1, the set of D-dimensional feature vectors needs to be input into the target detection module and the attitude angle estimation module at the same time. In the set of feature vectors, the required semantic information of one of the visual tasks may be noise that is not useful for the other task, but rather may affect the training of the network. The invention therefore also introduces a channel self-attention module to enable a better adaptation of this set of features to both visual tasks. The module mainly comprises 3 steps: (1) And (3) performing global pooling, performing feature aggregation (2) and activating specific features, and using a structure of a full connection layer 1- - > ReLU- - > full connection layer 2- - > Sigmoid. In order to reduce complexity and realize better generalization, the full connection layer 1 is subjected to dimension reduction compared with the original number of channels; the connection from the full connection layer 1 to the full connection layer 2 is subjected to dimensionality-increasing processing to restore the original channel number; (3) And (4) zooming operation, namely applying the weight obtained after the specific characteristic is activated to the original characteristic diagram.

The input of the target detection module and the input of the attitude angle estimation module are both the features output by the shared feature extraction network, and as shown in a system overall frame diagram, the two part modules are two branches of the shared feature extraction network. The input of the target detection module is 3 features with different scales, the feature pyramid is used for reference, and the targets with different sizes are detected by adopting multiple scales. On the computation of loss, the invention defines an ignore parameter, which indicates if a prediction box crosses the maximum of all truth values<When the parameter is ignored, the prediction box is a negative sample; if the center point of the true value falls within a region, the region is responsible for detecting the object, and the prediction box with the largest cross-over ratio to the object is taken as the positive sample. Wherein, B target boundary frames exist in a single grid, each target boundary frame is composed of a five-dimensional prediction parameter, including the coordinates (x, y) of the center point of the boundary frame and the widthHigh (w, h) and confidence score s _i 。

Confidence score s _i The following can be calculated:

wherein Object represents the Object, and Pr (Object) represents the probability of the Object existing in the current mesh Object bounding box.

And (4) representing the Intersection ratio (IoU) of the predicted value and the true value of the boundary box, and showing the accuracy of the position of the target boundary box predicted by the current model.

Given a target bounding box predictor box _pred And true value box _truth Then, then

Can be expressed as:

class probability Pr (C) of object _i I O) represents the posterior probability that the target belongs to a certain kind of object i in the case where the target exists in the bounding box. Assuming that the target detection task has K kinds of objects in total, each grid predicts the ith object C _i Has a conditional probability of pr (C) _i |O),i＝1,2,…,K。

The final training loss function calculation is divided into 3 parts: (1) the errors caused by x, y, w and h parts, namely the errors caused by (2) confidence degrees caused by detection frames and (3) types, namely the losses caused by classification types. The calculation formula is as follows:

loss＝lbox+lobj+lcls

for the target detection module, the input image is divided into S × S grids, each of which is responsible for detecting the target object in which the center point falls.

The confidence that an object is present in the target bounding box at the time of testing can be expressed as the detection box required to be retained by non-maximum suppression:

the above flow describes a target detection system, and the original image is cropped based on a detection frame obtained by target detection. Due to the translation invariance of convolution and the design of the shared feature extraction network in the invention, the position of the detection frame in the original image can be cropped in the feature image through the same scaling and translation transformation. For example, the width of the original is (w) ₁ ,h ₁ ) The coordinate of the central point and the length and width of the detection frame are (x, y, w) ₂ ,h ₂ ) The feature graph obtained after feature extraction of the backbone network is (w) ₃ ,h ₃ And c), cutting on the characteristic diagram in the same way, wherein the coordinate of the central point and the length and the width of the cutting frame are

The dimension is still C.

The attitude angle estimation module comprises a feature decoupling module and a cross-class center loss module, and the schematic diagram of the overall structure of the attitude angle estimation module network is shown in FIG. 3.

The characteristic decoupling module is used for decoupling the hidden variable spaces at different angles by taking a characteristic graph output by the characteristic extraction network as input, as shown in the figure. The invention adopts three channel attention blocks to realize a feature decoupling module. Researchers commonly believe that different levels of information are encoded at different layers in a CNN. Pose information is typically encoded in higher-level features of the network, while underlying features of the network tend to encode more detailed information of the image, such as edge and texture features. By adding a feature decoupling module to an output feature map of a backbone convolutional neural network, the feature decoupling network provided in this chapter can reduce adverse effects brought by background clutter, semantic uncertainty and the like appearing in low-level features of an image. In the characteristic decoupling module, a parameterized channel attention mechanism is adopted to adaptively readjust channel-by-channel response values of each angle branch, and the characteristic decoupling module is realized by specifically adopting a bottleneck layer comprising two full-connection layers and a nonlinear activation function. Since the correlation between channels is implicitly coded into the learned filter, the feature decoupling module selects channel features containing more information by performing angle-dependent feature readjustment, thereby explicitly capturing more discriminative features for each angle while suppressing less-acting features, updating the module parameters with angle-dependent losses

The attitude angle estimation itself can be viewed as a natural regression problem. Previous work showed that joint supervision of classification and regression could bring about an improvement in model performance. Therefore, the present invention also uses two losses along this line to construct a mixed Loss (Mix Loss), as shown in FIG. 4.

The angle dependent mixing Loss (Mix Loss) takes as input the decoupled features, which are then passed to the fully connected layer, including both classification and regression terms, followed by Softmax to obtain a probabilistic prediction for each lattice. Because in the attitude angle estimation problem, the sequential discrete labels have obvious semantic relation, the classification label of the method is one-dimensional Gaussian distribution which takes a truth value class as a mean value and is accompanied with small variance. The classification loss is obtained by calculating the Kullback-Leibler (KL) divergence between the label distribution and the predicted distribution. Then, by calculating the expectation of the small lattice output, a prediction of the angle values can be obtained and the Mean Squared Error (MSE) loss used as the regression loss. The final Mix Loss for each angular branch is as follows:

wherein qj is the output classification probability, yj is the angle true value, G (-) represents the result of One-dimensional Gaussian filtering adopted by the original class One-hot code,

is the final angle prediction, b is the index of the small lattice, j ∈ { yaw; a pitch; roll represents each angular branch, and λ is a hyper-parameter used to balance the classification penalty and the regression mean square error penalty.

Cross-class Center loss then further suggests Cross-class Center (CCC) loss to achieve both intra-class compactness and inter-class separability of the hidden variable subspace. Similar to Wen et al, the cross-class central loss proposed by the present invention is defined as follows in the central loss portion of each angular branch:

where z (i) jz (i) is the ith embedded depth feature, c (yi) jc (yi) is the embedding center of the yi-th class, which is updated during training, m is the sample size of the small batch, and j represents each angular branch.

The distance between the hidden variables in the same discrete angle truth value is reduced according to different angle categories, so that the intra-category consistency is ensured. While the hidden variables of different angle classes should be distributed in the decoupled subspace, but this is not reflected in the above central loss, to alleviate this deficiency, the decoupled part of the cross-class central loss is further defined as follows:

where j, j', j ∈{yaw；pitch；roll},j≠j′≠j″,

While

Representing the relative distance across the angle categories, the denominator is increased by 1 to prevent the calculation from overflowing.

where alpha is a hyper-parameter used to trade-off the two-part loss.

The method comprises the steps of network structure and optimization, wherein the feature extraction capacity and the model parameter quantity are comprehensively considered, and a backbone network is constructed based on an inverse Residual Block (inversed Residual Block) proposed by Sandler and the like. The input and output and internal structure of the inverse residual block are shown in table 1, compared with the original residual block, the input feature map channel is selected to be subjected to dimension ascending and dimension descending first, and no activation function is added after the last convolution. Specifically, the trunk neural network of the present invention has 9 inverse residual blocks except for the convolutional layer initially including 32 filters and the convolutional layer finally including 640 filters, and the design details thereof are shown in table 2.

Finally the loss per angular branch can be written as follows:

TABLE 1 input-output and internal structure of inverse residual block

TABLE 2 backbone neural network architecture

Input size	Spreading factor	Number of channels	Number of repetitions	Step size
					112×112×32	1	16	1	1
112×112×16	6	24	2	2
					56×56×24	6	32	2	2
28×28×32	6	96	2	2
					14×14×96	6	240	2	2

Where y ∈ { yaw, pitch, roll } represents each angular branch. The embedding center is updated with the following formula,

where jk denotes the kth center of the jth angular branch, δ (condition) =1 if the condition is satisfied, otherwise δ (condition) =0. The following algorithm elaborates the training procedure of the proposed method,

in an actual scene, the data acquisition difficulty of a plurality of training tasks is high, so that the data volume of real data is limited, and a deep learning network is difficult to train. Under this condition, training can be assisted by simulation data or other available data of similar environment, and the simulation data volume is relatively large. The two data fields have certain deviation in distribution, so if the two data fields are trained on simulation data and directly tested on real data, the performance of the model is greatly reduced. In the invention, a transfer learning algorithm is introduced and combined with the Finetune, and simulation data and real data are fully used, so that the model has better performance in a real scene.

And (4) preparing data. And according to the data and the background of the real scene, simulating or searching related similar data sets in simulation software. And different labels are generated according to different downstream subtasks, the label format needs to be consistent with the label of the real data, taking a detection frame as an example, the labels are (x, y, w, h), and the coordinates, the center and the width and the height of the target point are respectively represented.

And (5) generating a migration learning label. The invention adopts a transfer learning algorithm based on domain confrontation, for most visual tasks, the problem can be converted into a posterior probability P (C, B | I) of network learning, wherein I represents image characteristics, B represents a detection frame and C represents an object class. For an input picture, we can consider it as a joint probability distribution P (C, B, I) of B, C, I. Then, according to a simple bayesian equation, the following equation can be obtained:

P(C,B,I)＝P(C,B|I)P(I)

in most problems of transfer learning, there is a basic assumption that the distribution of features in the target domain and the source domain is the same. In the present invention, this assumption holds because the simulation data and the real infrared data are distributed very closely together. Expressed in terms of equations, i.e. P (C, B | I) is the same for both domains, except P (I). Thus, the detection head should be the same for the detection network, i.e. there should be a consistent effect for both the target domain and the source domain. Therefore, in order to enable the whole network to have the detection effect of the active domain on the target domain, it is necessary to reduce the P (I) on the source domain and the target domain as much as possible, i.e. migrate the backbone extracting features to reduce the gap between the P (I). The domain classifier is introduced to achieve the purpose, and the overall structural diagram of the migration learning network is shown in fig. 5:

the domain classifier and the target task header are connected as two parallel subbranches after a feature extraction backbone network (backbone). The domain classifier is a two-classifier, and the label of the classification is the source domain/target domain. The goal of the target task head is to minimize the target task loss function, but it is clear that the target task head will overfit the source domain data set (i.e., the simulation data); the goal of the domain classifier is to minimize the binary error, i.e., to distinguish between the two domains as much as possible; the main task of the feature extraction network is to extract a feature shared by a target task header and a domain classifier, and the feature has two targets: minimizing a target task loss function (target task header); the binary classification error (the opposition domain classifier) is maximized. The second of these is achieved with gradient inversion. The essence is that when the gradient from the domain classifier is propagated back to the feature extraction backbone network, the gradient is multiplied by a negative coefficient to optimize the feature extraction backbone network towards the anti-domain classifier.

In this step, the source domain data and the target domain data are distributed in

Are spatially distributed jointly as

And

thus, the functions performed by the classifier include having input samples { x) from the source domain and the target domain ₁ ,x ₂ ,…,x _N Is characterized by defining d _i Is a domain label of the ith sample, wherein d _i E {0,1}. If d is _i =0, then

Otherwise if d _i =1, then

Firstly, after an input image passes through a feature extraction backbone network, the feature extraction network is set to be f = G _f (x；θ _f ) Converting the input image x into a D-dimensional feature vector f ∈ R ^D . In the learning phase, the goal is to minimize the label prediction loss of the labeled part of the training set (i.e., the source domain data set part), so the parameters of both the feature extraction network and the target task header are optimized to minimize the loss of the source domain samples. This ensures the discriminability of the feature f and the overall good prediction performance of the feature extractor and the label predictor. The combination of feature extractor and label predictor has good prediction performance over the source domain. At the same time, it is also preferred that the feature f has domain invariance, i.e. the goal of the network optimization during training is to have the distribution S (f) = { G = _f (x；θ _f ) | x to S (x) } and T (f) = { G = _f (x；θ _f ) | x to T (x) } are similar. Under the covariate transition assumption, this will make the tag prediction accuracy of the target domain the same as that of the source domain (shimodiaira, 2000). However, considering that f is high-dimensional and the distribution itself changes as learning progresses, the dissimilarity of the measured distributions S (f) and T (f) is relative. One way to estimate the dissimilarity is to look at the loss of the domain classifier, provided that the parameters of the domain classifier have been trained to distinguish the two feature distributions in an optimal way. Thus, in training, to obtain domain-invariant features, we seek the parameter θ of the feature map that maximizes the loss of the domain classifier _f (by making the two feature distributions as similar as possible), while seeking a parameter θ that minimizes the loss of the domain classifier _d . In addition, it is sought to minimize the loss of the target task head. Thus, a task may be embodied as:

wherein L is _y As a function of target task loss, L _d In order to be a function of the domain classifier penalty,

and

the loss of the ith sample is asked separately. Based on the above formula, the parameters we seek are

The forward calculation and error back propagation in which the target task loss is minimized is consistent with the conventional deep learning procedure. And the maximum domain classification loss is solved by introducing a gradient reversal module, and a gradient obtained by normally carrying out reverse propagation calculation in a domain classifier is multiplied by a negative coefficient-lambda to optimize a feature backbone network before entering a feature extraction network, namely, a network parameter updating mode is as follows:

based on this, the data annotation of the invention needs to label the data field in addition to the self-annotation of the target task, and because 2 data fields exist, namely the real infrared data and the simulation data, 0 is adopted to represent the real data field, and 1 represents the simulation data field. Thus, for each input image, the data format is as follows [ input image (width, height, c) ], target task label (e.g., (x, y, w, h), domain label (0 or 1), taking target detection as an example) ].

And (5) network training. The network training is mainly divided into the following steps:

(1) the countermeasure learning module is not started in the initial training, only the feature extraction network and the target task head are started, and the input data is only simulation data, so that the network can well express the target task on the simulation data set. Meanwhile, part of simulation data is randomly divided into test data sets, and indexes are not improved on the test data sets when the model is trained.

(2) Domain confrontational training. And on data input, changing each batch of data input in the training process into simulation data and real data, wherein the input labels comprise target task labels and domain labels. And (2) training on the basis of the model obtained by training in the step (1), starting a feature extraction network, a target task head and a domain classifier, and setting a smaller learning rate (for example, a target detection task is 1 e-4) for training. Better effect can be achieved after 5-10 rounds of iteration, the indexes of the target task on the real data test set need to be concerned in the training process, and the over-fitting condition is avoided.

(3) And f, finetue training. And (4) without starting a domain classifier module, performing Finetue training on the real infrared training data set, so that the effect of the network on the real data set is further improved.

In the network forward process, a domain classifier is not required to participate in calculation, and a forward result is obtained only through the calculation of the feature extraction network and the target task head.

The main effects of the invention are mainly expressed in two aspects: (1) by means of transfer learning, model accuracy on a target data set is obviously improved under the condition of using a small amount of real data (2) by means of a shared feature extraction network and an end-to-end overall architecture, and model parameters and calculation time for completing target detection and attitude angle estimation are obviously reduced.

Using a separate object detection network YOLOV3 and pose estimation network FDN, the forward computation time is 213ms at GTX 1080, and the forward time is reduced to 97ms at the same image size using the network structure herein.

The following table sets forth the migration learning method and direct test herein, finnetune compares on a real dataset, where the source domain dataset is selected to be the ctyscape dataset and the target domain dataset is the fog-ctyscape dataset, in order to simulate the source domain and the target domain.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that various dependent claims and the features described herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. The target detection and attitude estimation method based on data cross-modal transfer learning is characterized by comprising the following steps of:

2. The method of claim 1, wherein the shared feature extraction network is a multi-scale fusion feature extraction network, which uses a plurality of convolution and pooling basic units in series to perform feature extraction on the input image, the first convolution module includes 1 convolution layer with a step size of 1 and a size of 3X 3, the padding is 1, a batch normalization layer and a LeakyReLU, and then passes through 1, 2, 8, 4 convolution unit groups, respectively, each unit group includes a convolution layer with a step size of 2 and a size of 3X n, n is input feature dimension, padding is 1, down sampling is carried out on the features, meanwhile, each unit is connected through residual errors, the network keeps good learning capacity when the depth is deep, each convolution unit group sequentially comprises 1 × 1 × n convolution with the step length of 1, no padding is carried out, one batch normalization layer and one LeakyReLU, then 3 × 3 × n convolution with the step length of 1 is sequentially carried out, padding is 1,1 batch normalization layer and one LeakyReLU, the image input belongs to X belonging to X, wherein X represents the image input space, and the feature extraction network is made to be f = G _f (x；θ _f ) And converting the input image x into a D-dimensional feature vector.

3. The method according to claim 2, wherein the input of the object detection module is the output features of the shared feature extraction network are 3 different-scale features, multiple scales are used to detect objects with different sizes, and an ignoring parameter is defined on the calculation of loss to indicate that if a prediction box is the largest cross-over ratio of all truth values<When the parameter is ignored, the prediction box is a negative sample; if the center point of the truth value falls in a region, the region is responsible for detecting the object, and the prediction box with the maximum intersection ratio with the object is taken as a positive sample, wherein B target boundary boxes exist in a single grid, and each target boundary box consists of five-dimensional prediction parameters including the center point coordinates (x, y), the width and the height (w, h) of the boundary box and the confidence coefficientScore s _i Confidence score s _i The following can be calculated:

represents the Intersection and parallel ratio (IoU) of the predicted value and the true value of the boundary box, shows the accuracy of the position of the target boundary box predicted by the current model,

Can be expressed as:

class probability Pr (C) of object _i I O) represents the posterior probability that the target belongs to a certain class of objects i under the condition that the target exists in the bounding box, and if the target detection task has K types of objects in total, each grid predicts the i-th class of objects C _i Has a conditional probability of Pr (C) _i |O),i＝1,2,…,K，

loss＝lbox+lobj+lcls

the confidence that an object exists in the target bounding box during testing can be expressed as the detection box required for retention by non-maximum suppression:

4. the data cross-modality transfer learning-based target detection and attitude estimation method according to claim 3, wherein the attitude angle estimation module comprises a feature decoupling module and a cross-category center loss module.

5. The data cross-modal transfer learning-based target detection and pose estimation method according to claim 4, wherein the feature decoupling module is implemented using three channel attention blocks.

6. The method of target detection and pose estimation based on data cross-modality migration learning of claim 4, wherein the cross-category central loss module central loss portion at each angular branch is defined as follows:

where j, j', j ∈ { yaw; a pitch; roll }, j ≠ j' ≠ j ",

and then

where alpha is a hyper-parameter used to trade-off the two-part loss.

7. The method for target detection and posture estimation based on data cross-modal transfer learning according to claim 6, wherein the step S7 specifically comprises the following steps:

P(C,B,I)＝P(C,B|I)P(I)

in most problems of transfer learning, there is a basic assumption that the distribution of features in the target domain and the source domain is the same, because the simulation data and the real infrared data are very close to the whole data distribution, expressed in a formula, that is, the two domains are the same as P (C, B | I), and the difference is P (I), so the detection head for the detection network should be the same, that is, the detection effect should be consistent for both the target domain and the source domain, therefore, in order to make the whole network have the detection effect of the active domain on the target domain, it is necessary to reduce P (I) on the source domain and the target domain as much as possible, i.e., migrate the backsbone extracting features to reduce the difference between P (I), and introduce a domain classifier to accomplish this purpose, the whole structure of the network is that the domain classifier and the target task head are connected as two parallel sub-branches after the feature extraction backbone network (kbone), the domain classifier is a binary classifier, the label of the classification is the source domain/target domain, the target task head is a target-based on the target-line loss, but the simulation data set is obviously over the target domain (target domain) and the target data set; the goal of the domain classifier is to minimize the binary error, i.e., to distinguish between the two domains as much as possible; the main task of the feature extraction network is to extract a feature shared by a target task header and a domain classifier, and the feature has two targets: minimizing a target task loss function (target task header); maximizing the binary classification error (robust domain classifier), wherein the second objective is achieved by gradient inversion, which is essentially to multiply the gradient of the domain classifier by a negative coefficient when propagating backward to the feature extraction backbone network, so that the feature extraction backbone network is optimized toward the robust domain classifier, and the source domain data and the target domain data are distributed in the direction of the robust domain classifier

Are spatially distributed jointly as

And

thus, the functions performed by the classifier include having input samples { x) from the source domain and the target domain ₁ ,x ₂ ,…,x _N Is characterized by defining d _i Is a domain label of the ith sample, wherein d _i E.g., {0,1}, if d _i If not =0, then

Otherwise if d _i =1, then

Firstly, after an input image passes through a feature extraction backbone network, the feature extraction network is set to be f = G _f (x；θ _f ) Converting the input image x into a D-dimensional feature vector f ∈ R ^D In the learning stage, the goal is to minimize the label prediction loss of the labeled part (namely, the source domain data set part) of the training set, so that the parameters of the feature extraction network and the target task head are optimized to minimize the loss of the source domain samples, which ensures the discriminant of the feature f and the good prediction performance of the feature extractor and the label predictor as a whole, and the good prediction performance of the feature extractor and the label predictor combined on the source domain, and meanwhile, the feature f is favored to have domain invariance, namely, the goal of optimizing the network in the training process is to enable the distribution S (f) = { G (G) = _f (x；θ _f ) | x to S (x) } and T (f) = { G _f (x；θ _f ) | x-T (x) } which, under the covariate transfer assumption, would make the target domain label prediction accuracy the same as that of the source domain (Shimodaira, 2000), however, considering that f is highly dimensional and the distribution itself varies as the learning proceeds, the dissimilarity of the measurement distributions S (f) and T (f) is such thatIn contrast, one way to estimate the variance is to look at the loss of the domain classifier, provided that the parameters of the domain classifier have been trained to optimally distinguish the two feature distributions, so that during training, to obtain domain-invariant features, we look for the parameter θ of the feature map that maximizes the loss of the domain classifier _f (by making the two feature distributions as similar as possible), while seeking a parameter θ that minimizes the loss of the domain classifier _d Furthermore, we also seek to minimize the loss of the target task head, so the task can be embodied as:

and

the forward calculation and the error back propagation for minimizing the target task loss are consistent with the traditional deep learning process, the maximized domain classification loss is solved by introducing a gradient back module, and a gradient obtained by normally performing back propagation calculation in a domain classifier is multiplied by a negative coefficient-lambda to optimize a feature backbone network before entering a feature extraction network, namely, the network parameter updating mode is as follows:

based on this, data annotation needs to label data fields in addition to the self-labeling of the target task, and since 2 data fields exist, namely real infrared data and simulation data, 0 is used for representing the real data field, and 1 is used for representing the simulation data field, the data format of each input image is as follows [ input image (width, height, c) ], target task labeling (taking target detection as example (x, y, w, h), domain label (0 or 1)) ];

and S72, network training is carried out.

8. The method for target detection and pose estimation based on data cross-modal transfer learning according to claim 7, wherein the step S72 specifically comprises the following steps:

s722, performing field confrontation training, namely changing each batch of data input in the training process into simulation data and real data in terms of data input, wherein input labels comprise target task labels and domain labels, then performing training on the basis of the model obtained by training in the step S721, starting a feature extraction network, a target task head and a domain classifier by training, setting a smaller learning rate (for example, the target detection task is 1 e-4) for training, and performing 5-10 rounds of iteration;

and S723, performing Finetie training on the real infrared training data set without using a domain classifier module.