CN110992378B

CN110992378B - Dynamic updating vision tracking aerial photographing method and system based on rotor flying robot

Info

Publication number: CN110992378B
Application number: CN201911220924.1A
Authority: CN
Inventors: 谭建豪; 谭姗姗; 殷旺; 刘力铭; 王耀南
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2023-05-16
Anticipated expiration: 2039-12-03
Also published as: CN110992378A

Abstract

The invention belongs to the technical field of unmanned aerial vehicles, and discloses a dynamic updating visual tracking aerial photographing method and a system based on a rotor flying robot, wherein an HOG+SVM is used for detecting a target in a picture; then improving AlexNet network structure by designing three important influencing factors of twin network receptive field size, network total step length and characteristic filling, adding a smoothing matrix and a background suppression matrix, and effectively utilizing the characteristics of the previous frames; and fusing multiple layers of characteristic elements to learn target appearance change and background suppression on line, and training by using a continuous video sequence. The invention ensures the balance of precision and real-time tracking by utilizing the dynamic twin network, quickly learns the appearance change of the target by utilizing the dynamic updating network, fully utilizes the space-time information of the target, and effectively solves the problems of drift, target shielding and the like. According to the invention, a deeper network is selected to acquire target characteristics, and appearance learning and background suppression are used for dynamic tracking, so that the robustness is effectively increased.

Description

Dynamic updating vision tracking aerial photographing method and system based on rotor flying robot

Technical Field

The invention belongs to the technical field of unmanned aerial vehicles, in particular to a method for controlling the unmanned aerial vehicle

Relates to a dynamic updating visual tracking aerial photographing method and system based on a rotor flying robot.

Background

Currently, the closest prior art: unmanned aerial vehicles (Unmanned Aerial Vehicle, UAV) are unmanned aerial vehicles operated by radio remote control devices or programmed control means, capable of autonomously completing flight tasks without human intervention. In military, due to the characteristics of small size, strong maneuverability, easy control and the like of the rotor flying robot, the rotor flying robot can operate in an extreme environment, and is widely applied to anti-terrorism and explosion prevention, traffic monitoring and earthquake relief. In the civil field, unmanned aerial vehicle can be used to fields such as high altitude shooting, pedestrian detection. Rotorcraft robots typically need to track a specific target for flight and transmit information of the target to a ground station in real time while performing a specific task. Accordingly, tracking flights of vision-based rotorcraft robots are gaining widespread attention and are a current research focus.

The tracking flight of the rotor flying robot refers to that a camera is carried on the rotor flying robot flying in low altitude, an image frame sequence of a ground moving target is obtained in real time, the image coordinates of the target are calculated and used as the input of visual servo control, the speed required by the aircraft is obtained, and then the position and the gesture of the rotor flying robot are automatically controlled, so that the tracked ground moving target is maintained near the center of the visual field of the camera. The traditional twin network tracking method has good real-time performance, but when the influence of complex background or illumination is added after the target is lost due to target shielding, the situation that the target cannot be tracked correctly still occurs by taking the first frame as a standard reference. The method aims at the situations that the target is lost due to the influence of shielding, appearance change of the target, tracker drift, background factor interference and the like in the aerial photographing process of the rotor flying robot.

In summary, the problems of the prior art are: (1) The existing rotor flying robot is easy to cause drifting, target losing and other conditions due to the influence of shielding, illumination, background factor interference and the like in the aerial photographing process.

(2) In the prior art, the tracker extracts features basically using an AlexNet network, and deeper features about the target can be extracted by using a deeper CIResNet network, so that the tracker locks the target in a search area and reduces the influence of complex backgrounds.

(3) Although the existing twin network tracker operates at a high frame rate, there is no updated part in its frame, meaning that the tracker cannot quickly cope with severe changes in the target or background, which may cause tracking drift in some cases.

The difficulty of solving the technical problems is as follows: the method of identifying the location of the target in the search area using color features and contour features may fail when the appearance of the target changes drastically during tracking.

The operation time is increased if every frame is re-detected or a threshold is used to determine whether the tracking is lost during the tracking process.

More feature information can be obtained using a ciranet network for feature extraction, but the tracker frame rate is slightly reduced due to the deeper ciranet network compared to the AlexNet network.

Meaning of solving the technical problems: the tracking precision can be improved by using deeper network extraction features, and the overall performance of the tracker can be improved.

The dynamic updating part increases the robustness of the tracker, and the tracker does not learn the characteristic information of the first frame any more, but continuously learns the tracking result of the previous frame, so that the tracker adapts to the change of the target.

The CIResNet network can effectively extract more sample features, the tracker can learn more feature information of the target, and the capability of adapting to complex backgrounds is improved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a dynamic updating visual tracking aerial photographing method and system based on a rotor flying robot.

The invention discloses a dynamic updating visual tracking aerial photographing method based on a rotor flying robot, which comprises the following steps of:

firstly, performing target detection on an input image by using an HOG feature extraction algorithm and a Support Vector Machine (SVM) algorithm;

and step two, transmitting target frame information obtained by target detection to a visual tracking part, and tracking the target in real time by adopting a dynamic update twin network based on a CIResNet network.

Further, in the first step, the target detection method includes:

(1) Dividing the image into a plurality of connected areas which are 8×8 pixel cell units;

(2) Collecting gradient amplitude and gradient direction of each pixel point in a cell unit, dividing the gradient direction of [ -90 degrees, 90 degrees ] into 9 sections (bins) on average, and using the gradient amplitude as a weight;

(3) Carrying out histogram statistics on the gradient amplitude of each pixel in the unit in each direction bin interval to obtain a one-dimensional gradient direction histogram;

(4) Performing contrast normalization on the histogram on the space block;

(5) Extracting HOG descriptors through a detection window, and combining HOG descriptors of all blocks in the detection window to form a final feature vector;

(6) Inputting the feature vector into a linear SVM, and performing target detection by using an SVM classifier;

(7) Dividing the detection window into overlapped blocks, calculating HOG descriptors for the blocks, and putting the formed feature vectors into a linear SVM to perform target/non-target classification;

(8) Scanning all positions and scales of the whole image by a detection window, and performing non-maximum suppression on an output pyramid to detect a target;

the method for carrying out contrast normalization on the histogram in the step (4) comprises the following steps:

the density of each histogram in this bin is first calculated and then the individual cell units in the bin are normalized according to this density.

Further, in the first step, the HOG feature extraction method specifically includes:

(1) normalizing the whole image, and normalizing the color space of the input image by adopting a Gamma correction method; the Gamma correction formula is as follows:

f(I)＝I ^γ ；

wherein, I is the image pixel value, and Gamma is the Gamma correction coefficient;

(2) calculating gradients in the horizontal coordinate and the vertical coordinate directions of the image, and calculating a gradient direction value of each pixel position according to the gradients; the deriving operation captures the outline and some texture information, and further weakens the influence of illumination;

G _x (x,y)＝H(x+1,y)-H(x-1,y)；

G _y (x,y)＝H(x,y+1)-H(x,y-1)；

wherein Gx (x, y), gy (x, y) respectively represent the horizontal gradient and the vertical gradient at the pixel points (x, y) in the input image;

wherein G (x, y), H (x, y), alpha (x, y) respectively represent the gradient amplitude, the pixel value and the gradient direction of the pixel point at (x, y);

(3) and (3) calculating a histogram: dividing the image into small cell units, providing a code for the local image region;

(4) combining the cell units into a large block, normalizing the gradient histogram within the block;

(5) and collecting HOG characteristics of all overlapped blocks in the detection window, and combining the HOG characteristics into a final characteristic vector for classification.

Further, the step two of tracking the target in real time includes:

(1) Acquiring a first frame from a video sequence as template frame O ₁ Acquiring search area Z using current frame _t F is obtained through CIResNet-16 network respectively ^l (O ₁) and f^l (Z _t )；

(2) The network adds a transform matrix V and a transform matrix W, both of which can be rapidly calculated in the frequency domain by FFT. The transformation matrix V is obtained by the tracking result of the t-1 frame and the target of the first frame, acts on the convolution characteristic of the target template, learns the change of the target to enable the convolution characteristic of the template at the t moment to be approximately equal to the template convolution characteristic at the t-1 moment, and enables the change of the current frame relative to the previous frames to be smooth;

the transformation matrix W is obtained from the tracking result of the t-1 frame and acts on the convolution characteristics of the candidate region at the t moment, and background suppression is learned to eliminate the influence caused by irrelevant background characteristics in the target region;

training with regular linear regression for transformation matrix V and transformation matrix W, f ^l (O ₁) and f^l (Z _t ) Through transforming the matrix to obtain respectively

and />

Wherein "×" represents a cyclic convolution operation, +.>

Representing the change of the appearance form of the target to obtain a target template after the current update, and ++>

Representing background suppression transformation to obtain a search template more suitable for the current; the final model is as follows:

adding a smooth matrix V and a background suppression W into the final model on the basis of a twin network, wherein the smooth matrix V learns the appearance change of the previous frame; the background suppression matrix W eliminates clutter influencing factors in the background.

Further, in the second step, the dynamic update twin network based on the ciranet includes:

after clipping operation, 7X 7 convolution is carried out to delete the characteristic affected by filling;

(II) entering an improved network CIResNet unit after passing through a maximum pooling layer with a stride of 2, wherein the CIR unit stage network is 3 layers in total, the first layer is 1 multiplied by 1 convolution, and the channel number is 64; the second layer is a 3×3 convolution with a channel number of 64; the third layer is a 1×1 convolution, and the number of channels is 256; adding the characteristic graphs after passing through the convolution layer, then entering a crop operation, wherein the crop operation is 3×3 convolution, and counteracting the characteristic of the influence of padding being 1;

(III) entering a CIR-D unit, wherein the CIR-D unit stage network is 12 layers in total, and the first layer, the second layer and the third layer are used as unit blocks for 4 times of circulation; the first layer is a 1×1 convolution with a channel number of 128; the second layer is a 3×3 convolution with a channel number of 128; the third layer is a 1×1 convolution, and the number of channels is 512;

and (IV) cross-correlation operation: the improved twin network structure takes an image pair as input, and comprises an example image Z and a candidate search image X; image Z represents the object of interest, while X represents the search area in the subsequent video frame, typically larger; both inputs are processed by ConvNet with parameter θ; two feature maps are generated, and the cross-correlation is:

b represents a deviation term, and the formula searches the image X by taking Z as a mode so as to enable the maximum value in the response chart f to be matched with the target position; the network is trained offline by means of a random image pair (Z, X) obtained from the training video and the corresponding ground tag y, the parameter θ in ConvNet being obtained by minimizing the following loss parameters in the training set:

the basic formula of the loss function is:

l(y,v)＝log(1+exp(-yv))；

wherein y ε (+1, -1) represents the true value and v represents the actual score of the sample search image; from the sigmoid function, the probability that the above expression represents a positive sample is

The probability of negative sample is +.>

The following is readily derived from the formula of cross entropy:

further, in the step (iii), the first block of the CIR-D unit stage is downsampled by the proposed CIR-D unit, and the number of filters is doubled after downsampling the feature map size; CIR-D changes the convolution steps in the bottleneck layer and the shortcut connection layer from 2 to 1, and inserts cutting again after the adding operation so as to delete the characteristics affected by filling; finally, performing spatial downsampling of the feature map with maximum pooling; the spatial size of the output feature map is 7 x 7, each feature receiving information from an area on the input image plane of size 77 x 77 pixels; adding the characteristic diagram after passing through the convolution layer, and then entering a loop operation and a maximum pooling layer; the key idea of these modifications is to ensure that only functions affected by padding are deleted, while the inherent block structure remains unchanged.

Further, in the step two, a dynamic update twin network based on CIResNet is adopted to track the target in real time, the dynamic update algorithm comprises:

(1) Inputting a picture to obtain a template image O1;

(2) Determining a candidate frame searching region Zt in a frame to be tracked;

(3) Mapping the original image to a specific feature space through feature mapping to obtain f respectively ^l (O ₁) and f^l (Z _t ) These two depth features;

(4) Learning the change of the previous frame tracking result and the first frame template frame according to the RLR;

the fast computation in the frequency domain can be obtained:

thereby obtaining the variation

The following is indicated:

/>

in this context,

wherein O represents the target, f represents the matrix, the upper right symbol represents the first channel, and the lower right symbol represents the frame, namely, the tracking result of the previous frame and the target of the first frame are obtained;

(5) Obtaining the suppression quantity of the current frame background according to the RLR calculation formula in the frequency domain

wherein ,G_t-1 Is a map of the same size as the search area of the previous frame,

is to G _t-1 Multiplying the center point of the picture by a Gaussian smoothing; target variation by on-line learning>

And background suppression transform->

(6) Element multi-layer feature fusion;

(7) Joint training is performed by first propagating forward for a given N-frame video sequence { I } _t T=1,.. N is tracked to obtain N response graphs by { S }, and _t |t=1,.. _t T=1,..n } represents N target boxes;

(8) Gradient propagation and parameter updating using BPTT and SGD to obtain L _t All parameters; from the following components

Calculating

and />

Through the CirConv and RLR layers on the left, ensuring efficient propagation of loss gradients to f ^l ；

wherein ,

representing f, E after Fourier transformation is a discrete Fourier transformation matrix, which is converted into ++for the multi-feature fusion formula>

The invention further aims to provide a dynamic updating visual tracking aerial photographing system based on the rotor flying robot, which implements the dynamic updating visual tracking aerial photographing method based on the rotor flying robot.

The invention further aims to provide an information data processing terminal for realizing the dynamic updating visual tracking aerial photographing method based on the rotor flying robot.

It is another object of the present invention to provide a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the dynamic update vision tracking aerial method based on a rotorcraft robot.

In summary, the invention has the advantages and positive effects that: (1) By adopting a deeper CIResNet network, a classification standard is automatically established by a sample learning method, the adaptability of a complex background is enhanced, and the effective extraction of more sample features is satisfied.

(2) The invention adds the smooth transformation matrix V in the traditional twin network, can learn the target appearance change of the previous frames on line, effectively utilizes the space-time information, and simultaneously adds the background suppression matrix W, thereby effectively controlling the influence of the background clutter factors.

(3) Instead of a single first frame as a standard reference, the problems of shielding and the like can be effectively solved by using appearance learning and background suppression to carry out dynamic tracking.

(4) The accuracy and the overlapping rate are both increased, and the speed can reach 16fps, so that the real-time requirement is basically met.

Table 1: tracking the comparison of various indexes

Tracking device	Accuracy of	Overlap ratio	Speed (fps)
				Ours	0.5512	0.2905	16.
SiamFC	0.5355	0.2889	65
				DSiam	0.5414	0.2804	25
DSST	0.5078	0.1678	134

The algorithm is realized and debugged in the ubuntu16.04 operating system, and the computer hardware is configured as Intel core i7-8700k, main frequency 3.7GHz,GeForce RTX2080TI graphic card.

According to the dynamic updating visual tracking aerial photographing method based on the rotor flying robot, the CIResNet network is used for replacing the original AlexNet network, so that the network hierarchy is deeper compared, and the characteristic acquisition of a target is facilitated. Compared with the traditional twin network, the method adds the smooth transformation matrix V to learn the target appearance change of the previous frames on line, effectively utilizes the space-time information, and simultaneously adds the background suppression matrix W to effectively control the influence of the background clutter factors. The method provided by the invention does not singly take the first frame as a standard reference, but selects a deeper network to acquire the target characteristics, and uses the appearance learning and the background suppression to dynamically track, thereby effectively increasing the robustness.

Drawings

Fig. 1 is a flowchart of a dynamic update visual tracking aerial method based on a rotorcraft robot according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a dynamic update visual tracking aerial method based on a rotorcraft robot according to an embodiment of the present invention.

Fig. 3 is a frame diagram of a detection section provided in an embodiment of the present invention.

Fig. 4 is a frame diagram of a tracking section provided by an embodiment of the present invention.

Fig. 5 is a schematic diagram of a basic description of a ciranet network according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a single-layer network structure according to an embodiment of the present invention.

Fig. 7 is a graph of results on a UAV dataset provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The tracking flight of the rotor flying robot refers to that a camera is carried on the rotor flying robot flying in low altitude, an image frame sequence of a ground moving target is obtained in real time, the image coordinates of the target are calculated and used as the input of visual servo control, the speed required by the aircraft is obtained, and then the position and the gesture of the rotor flying robot are automatically controlled, so that the tracked ground moving target is maintained near the center of the visual field of the camera. The traditional twin network tracking method has good real-time performance, but when the influence of complex background or illumination is added after the target is lost due to target shielding, the situation that the target cannot be tracked correctly still occurs by taking the first frame as a standard reference.

Aiming at the problems in the prior art, the invention provides a dynamic updating visual tracking aerial photography method based on a rotor flying robot, which uses a CIResNet network to replace the original AlexNet network, and compared with the CIResNet network, the CIResNet network has deeper network hierarchy and is beneficial to the feature acquisition of a target. Compared with the traditional twin network, the method adds the smooth transformation matrix V to learn the target appearance change of the previous frames on line, effectively utilizes the space-time information, and simultaneously adds the background suppression matrix W to effectively control the influence of the background clutter factors. The method provided by the invention does not singly take the first frame as a standard reference, but selects a deeper network to acquire the target characteristics, and uses the appearance learning and the background suppression to dynamically track, thereby effectively increasing the robustness. The present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, the method for dynamically updating vision tracking aerial photography based on the rotor flying robot provided by the embodiment of the invention comprises the following steps:

s101: and utilizing HOG (Histogram of Oriented Gradient) characteristics and a Support Vector Machine (SVM) algorithm to perform target detection on the input image.

Even if the gradient and edge position information corresponding to the object in the image is unknown, its appearance and shape are still described using the distribution of local gradients or edge directions. The HOG feature is used as a basis for constructing feature description by calculating and counting gradient direction histograms of target areas, and the principle can keep good invariance on geometric changes and optical deformation of images.

Firstly, dividing an image into a plurality of connected areas, namely cells (8×8 pixels), namely cell units, then collecting gradient amplitude values and directions of pixel points in the cell units, dividing the gradient directions of [ -90 degrees, 90 degrees ] into 9 intervals (bins) on average, and then carrying out histogram statistics on the gradient amplitude values of each pixel in the cell in each direction bin interval to obtain a one-dimensional gradient direction histogram. In order to promote invariance of features to illumination and shadows, it is necessary to contrast normalize the histograms, typically by contrast normalizing the histograms over a larger range. First we calculate the density of each histogram in this bin and then normalize the individual cell units in the bin according to this density, where the normalized block descriptor is called the HOG descriptor.

Combining HOG descriptors of all blocks in the detection window to form a final feature vector, and then using an SVM classifier to perform target detection. FIG. 3 depicts a feature extraction and object detection flow, where the detection window is divided into overlapping blocks, HOG descriptors are computed for these blocks, and the resulting feature vectors are placed in a linear SVM for object/non-object classification. The detection window scans all positions and scales of the whole image, and performs non-maximum suppression on the output pyramid to detect the target.

S102: and transmitting the target frame information obtained by target detection to a visual tracking part, and tracking the target in real time by adopting a CIResNet-based dynamic update twin network, wherein a tracking framework is shown in fig. 4.

Acquiring a first frame from a video sequence as template frame O ₁ Acquiring search area Z using current frame _t F is obtained through CIResNet-16 network respectively ^l (O ₁) and f^l (Z _t )。

The end result of a conventional twin network is represented as follows:

the result of this formula calculation is a similarity, where corr represents the correlation filtering, can be replaced by other metric functions, ^t representing time, l represents layer i.

Unlike conventional Siamese networks, the proposed network adds two change matrices, the first transform matrix V acts on the convolution characteristics of the target template in order to make the convolution characteristics of the template at time t approximately equal to the template convolution characteristics at time t-1, this transform matrix is learned from the t-1 frame and is considered to be a smooth deformation of the target. The second transformation matrix W acts on the convolution characteristics of the candidate region at time t in order to emphasize that the target region eliminates irrelevant background features.

For transformation matrices V and W, the invention trains using canonical linear regression, f ^l (O ₁) and f^l (Z _t ) Through transforming the matrix to obtain respectively

and />

Wherein "×" represents a cyclic convolution operation, +.>

Representing the change of the appearance form of the target->

Representing a background rejection transformation. The final model is as follows:

the model is added with two transformation matrixes of smoothing and background suppression on the basis of a twin network, and the smoothing matrix learns the appearance change of the previous frame and can effectively utilize space-time information; the background inhibition matrix eliminates clutter influencing factors in the background, and robustness is enhanced. Meanwhile, the AlexNet network in the traditional twin network is replaced by the CIResNet-16 network, so that the precision is higher.

The detailed description of HOG feature extraction in step S101 is:

1) To reduce the effect of illumination factors, the entire image first needs to be normalized (normalized). In the texture intensity of the image, since the specific gravity of the local surface exposure contribution is large, the compression processing can effectively reduce the local shadow and illumination variation of the image. Typically, the image is converted into a gray scale, where the color space of the input image is normalized (or normalized) using Gamma correction. The Gamma correction is understood to be an improvement of the image contrast effect of dark or bright parts in an image, and can effectively reduce the local shadow and illumination variation of the image, and the Gamma correction formula is as follows:

f(I)＝I ^γ (3)

wherein I is the image pixel value, and Gamma is the Gamma correction coefficient.

2) Calculating gradients in the horizontal coordinate and the vertical coordinate directions of the image, and calculating a gradient direction value of each pixel position according to the gradients; the deriving operation can capture the outline and some texture information, so that the influence of illumination can be further weakened;

G _x (x,y)＝H(x+1,y)-H(x-1,y) (4)

G _y (x,y)＝H(x,y+1)-H(x,y-1) (5)

in the above expression, gx (x, y), gy (x, y) respectively represent a horizontal gradient and a vertical gradient at a pixel point (x, y) in an input image.

G (x, y), H (x, y), α (x, y) represent the gradient magnitude, pixel value and gradient direction of the pixel point at (x, y), respectively.

3) And (3) calculating a histogram: the image is divided into small cell units (which may be rectangular or circular) in order to provide a code for the local image area.

4) The cell units are combined into large blocks (blocks) with the gradient histograms normalized inside the blocks.

5) All overlapping blocks in the detection window are collected for HOG features and combined into a final feature vector for classification.

The detailed description of the modified network CIResNet-16 in step S102 is:

CIResNet-16 is divided into three phases (stride of 8) consisting of 18 weighted convolution layers.

(1) After a clipping operation (size 2) 7 x 7 convolutions are entered to remove the features affected by the padding.

(2) After the maximum pooling layer with the stride of 2 is passed, the improved network CIResNet unit is entered, the CIR unit is 3 layers in the network at this stage as shown in (a) of fig. 5, the first layer is 1×1 convolution, and the channel number is 64; the second layer is a 3×3 convolution with a channel number of 64; the third layer was a 1 x 1 convolution with 256 channels. As depicted in fig. 5, the feature map after passing through the convolution layer is subjected to an addition operation, and then enters a drop operation, which is a 3×3 convolution, so as to offset the feature that the padding is an influence of 1.

(3) The network enters a CIR-D (Downsampling CIR) unit, the CIR-D unit is shown in (b) of fig. 5, and the network is 12 layers in total and takes the first layer, the second layer and the third layer as unit blocks to circulate for 4 times. Wherein the first layer is a 1 x 1 convolution with a channel number of 128; the second layer is a 3×3 convolution with a channel number of 128; the third layer is a 1 x 1 convolution with a channel number of 512.

The first block at this stage (4 blocks in total) is downsampled by the proposed CIR-D unit, and after downsampling the feature map size, the number of filters is doubled to improve feature resolvability. CIR-D changes the stride of the convolution in the bottleneck layer and the shortcut layer from 2 to 1, and inserts a cut again after the addition operation to delete the feature affected by the filling. Finally, a maximum pooling is employed to perform spatial downsampling of the feature map. The spatial size of the output feature map is 7 x 7, each feature receiving information from an area on the input image plane of size 77 x 77 pixels. As shown in fig. 5, the feature map after passing through the convolution layer is subjected to an addition operation, and then enters the crop operation and the maximum pooling layer. The key idea of these modifications is to ensure that only functions affected by padding are deleted, while the inherent block structure remains unchanged.

(4) Cross-correlation operation:

the improved twin network structure takes an image pair as input, and comprises an example image Z and a candidate search image X. Image Z represents an object of interest (e.g., an image block centered on the target object in a first video frame), while X represents a search area in a subsequent video frame, typically larger. Both inputs are processed by ConvNet with parameter θ. This will produce two feature maps that are cross-correlated:

where b represents the deviation term, the whole formula corresponds to an exhaustive search of the image X in Z mode, with the aim of matching the maximum value in the response map f with the target position. To achieve this goal, the network is trained offline by means of a pair of random images (Z, X) obtained from training videos and corresponding ground tags y, the parameter θ in ConvNet being obtained by minimizing the following loss parameters in the training set:

the basic formula of the loss function is:

l(y,v)＝log(1+exp(-yv)) (10)

where y ε (+1, -1) represents the true value and v represents the actual score of the sample search image. From the sigmoid function, the probability that the above expression represents a positive sample is

The probability of negative sample is +.>

The following is readily derived from the formula of cross entropy:

the step of the dynamic update algorithm in step S102 is:

(1) Inputting a picture to obtain a template image O1;

(2) Determining a candidate frame searching region Zt in a frame to be tracked;

(4) Learning a previous frame tracking result and a change of a first frame template frame according to Regularized Linear Regression (RLR);

the fast computation in the frequency domain can be obtained:

thereby obtaining the variation

The following is indicated:

in this context,

wherein O represents the object, f represents the matrix, the upper right represents the first channel, and the lower right represents the frame, that is, the tracking result of the previous frame and the object of the first frame are obtained.

is to G _t-1 The center point of the picture is multiplied by a gaussian smoothing, the purpose of which is to emphasize the center and suppress edges. Target variation by on-line learning>

And background suppression transform->

The improved model can improve tracking precision and real-time speed by starting the adaptive capacity of the static twin network on line.

(6) Element multi-layer feature fusion;

the center weight of the shallow layer features is high, the peripheral weight of the deep layer features is high, the center is low, if the target is in the center of the search area, the shallow layer features can better position the target, and if the target is in the periphery of the search area, the deep layer features can also effectively determine the position of the target.

That is, when the target is close to the center of the search area, the deeper layer features help to eliminate background interference, and the shallower layer features help to obtain accurate positioning of the target; if the target is located at the periphery of the search area, only deeper layer features can effectively determine the target location.

(8) A schematic diagram of the single layer network structure is shown in fig. 6. Wherein "Eltwise" (elementwise multi-layer fusion) is training a matrixThe values in the matrix represent weights for different locations of different feature maps. Gradient propagation and parameter updates were performed using BPTT (backpropagation through time) and SGD (Stochastic Gradient Descent). In order to effectively use BPTT and random gradient (SGD) trained networks, L must be obtained _t All parameters, as shown in FIG. 6, are composed of

Calculate->

and />

Then passes through the "CirConv" and "RLR" layers on the left to ensure that the loss gradient can propagate efficiently to f ^l 。

wherein ,

representing f after Fourier transform, E is a discrete Fourier transform matrix for cell-based multipleLayer fusion, may also be calculated using the procedure described above. For the multi-feature fusion formula, it can be converted into +.>

The model has reliable online adaptability, effectively learns foreground and background changes and suppresses background interference, does not damage real-time response capability, and has excellent balance tracking performance in experiments. In addition, the model is directly used for joint training on the marked video sequence as a whole, rather than training on the image pair, so that the abundant space-time information of the moving object can be better captured. Meanwhile, the model uses joint training, wherein all parameters can be subjected to offline learning through back propagation, and data training is facilitated. The specific effect is shown in fig. 7.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The dynamic updating vision tracking aerial photographing method based on the rotor flying robot is characterized by comprising the following steps of:

step two, transmitting target frame information obtained by target detection to a visual tracking part, and tracking the target in real time by adopting a dynamic update twin network based on a CIResNet network;

the step two of real-time tracking the target comprises the following steps:

(2) The network is added with a transformation matrix V and a transformation matrix W, the two matrices are rapidly calculated through FFT in a frequency domain, the transformation matrix V is obtained by a tracking result of a t-1 frame and a first frame target, the transformation matrix V acts on the convolution characteristic of a target template, and the change of the learning target enables the convolution characteristic of the template at the t moment to be approximately equal to the template convolution characteristic at the t-1 moment, so that the change of a current frame relative to the previous frames is smoothed; the transformation matrix W is obtained from the tracking result of the t-1 frame, acts on the convolution characteristic of the candidate region at the t moment, and learns the influence caused by irrelevant background characteristics in the background inhibition elimination target region;

training with regular linear regression for transformation matrix V and transformation matrix W, f ^l (O ₁) and f^l (Z _t) Through transforming the matrix to obtain respectively

and />

Wherein "×" represents a cyclic convolution operation, +.>

adding a smooth matrix V and a background suppression W into the final model on the basis of a twin network, wherein the smooth matrix V learns the appearance change of the previous frame; the background inhibition matrix W eliminates clutter influencing factors in the background;

in the process of tracking the target in real time by adopting a CIResNet-based dynamic update twin network, a dynamic update algorithm comprises:

(1) Inputting a picture to obtain a template image O1;

(2) Determining a candidate frame searching region Zt in a frame to be tracked;

fast calculation in the frequency domain:

thereby obtaining the variation

The following is indicated:

wherein ,f₁ ^l ＝f ^l (O ₁ ),

And background suppression transform->

(6) Element multi-layer feature fusion;

Calculating

and />

wherein ,

representing f, E after Fourier transform is a discrete Fourier transform matrix, which is converted into for a multi-feature fusion formula

2. The method for dynamically updating visual tracking and aerial photographing based on a rotorcraft robot according to claim 1, wherein in the first step, the target detection method comprises the following steps:

(2) Collecting gradient amplitude and gradient direction of each pixel point in a cell unit, dividing the gradient direction of [ -90 degrees, 90 degrees ] into 9 bin intervals on average, and using the gradient amplitude as a weight;

(4) Performing contrast normalization on the histogram on the space block;

3. The method for dynamically updating visual tracking aerial photographs based on a rotorcraft robot of claim 1, wherein in step one, the HOG feature extraction algorithm specifically comprises:

f(I)＝I ^γ ；

G _x (x,y)＝H(x+1,y)-H(x-1,y)；

G _y (x,y)＝H(x,y+1)-H(x,y-1)；

4. The method for dynamically updating visual tracking aerial photograph based on a rotorcraft robot of claim 1, wherein in step two, the dynamically updating twin network based on ciranet comprises:

the basic formula of the loss function is:

l(y,v)＝log(1+exp(-yv))；

The probability of negative sample is +.>

The following is readily derived from the formula of cross entropy:

5. the method of dynamically updating visual tracking on-the-fly based on a rotorcraft robot of claim 4, wherein in step (iii), the first block of the CIR-D cell stage is downsampled by the proposed CIR-D cell, and the number of filters is doubled after downsampling the feature map size; CIR-D changes the convolution steps in the bottleneck layer and the shortcut connection layer from 2 to 1, and inserts cutting again after the adding operation so as to delete the characteristics affected by filling; finally, performing spatial downsampling of the feature map with maximum pooling; the spatial size of the output feature map is 7 x 7, each feature receiving information from an area on the input image plane of size 77 x 77 pixels; adding the characteristic diagram after passing through the convolution layer, and then entering a loop operation and a maximum pooling layer; the key idea of these modifications is to ensure that only functions affected by padding are deleted, while the inherent block structure remains unchanged.

6. A rotorcraft robot-based dynamic update vision tracking aerial system that implements the rotorcraft robot-based dynamic update vision tracking aerial method of claim 1.

7. An information data processing terminal for implementing the dynamic update visual tracking aerial method based on a rotorcraft robot according to any one of claims 1 to 5.

8. A computer readable storage medium comprising instructions that when run on a computer cause the computer to perform the dynamically updated visual tracking aerial method based on a rotorcraft robot as claimed in any one of claims 1 to 5.