CN111260687A - Aerial video target tracking method based on semantic perception network and related filtering - Google Patents

Aerial video target tracking method based on semantic perception network and related filtering Download PDF

Info

Publication number
CN111260687A
CN111260687A CN202010028112.3A CN202010028112A CN111260687A CN 111260687 A CN111260687 A CN 111260687A CN 202010028112 A CN202010028112 A CN 202010028112A CN 111260687 A CN111260687 A CN 111260687A
Authority
CN
China
Prior art keywords
target
frame
mask
network
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010028112.3A
Other languages
Chinese (zh)
Other versions
CN111260687B (en
Inventor
李映
尹霄越
朱奕昕
薛希哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010028112.3A priority Critical patent/CN111260687B/en
Publication of CN111260687A publication Critical patent/CN111260687A/en
Application granted granted Critical
Publication of CN111260687B publication Critical patent/CN111260687B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an aerial video target tracking method based on a semantic perception network and related filtering, which aims at the problem that a related filtering algorithm is difficult to solve such as target blurring and shielding. Thanks to the above measures, the present invention can achieve very robust results in a variety of challenging aerial scenes.

Description

Aerial video target tracking method based on semantic perception network and related filtering
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an aerial video target tracking method based on a semantic perception network and related filtering.
Background
In recent years, aerial video tracking technology has been remarkably developed in the military field and the civil field, and is excellent in diversity and flexibility. Compared with the common handheld device for shooting videos, the aerial video has more flexible angles, scales and visual fields. The development of aerial video target tracking technology has spawned many new and important applications such as crowd monitoring, target tracking, and air navigation. In the traditional general scene target tracking technology, many algorithms position a bounding box on a video according to a given initial state in a first frame, however, specific factors such as weather conditions, flight heights, target sizes and camera view angles can influence the target tracking result; meanwhile, due to shadows, background interference and low light conditions introduced by a high-tilt shooting angle, the aerial video may greatly lose the originally abundant texture information and details of the object. In recent years, a method based on correlation filtering is greatly developed, and good tracking performance is shown in precision and speed, so that the requirement of aerial video can be met to a certain extent. However, trackers can be misleading when objects captured in an aerial video appear blurred in shadows or occluded by other objects. In this case, when the target is lost for a period of time, the conventional correlation filtering method may generate a model drift phenomenon, which may result in that the target cannot be tracked and positioned again. Therefore, designing a robust target tracking algorithm for an aerial scene is significant and urgent.
Disclosure of Invention
Technical problem to be solved
Aiming at the problem that appearance model drift is caused by fuzzy and sheltered targets due to camera motion in an aerial video, and further tracking failure is easily caused, the robust real-time target tracking method is designed on the basis of high-efficiency related filtering algorithms by fully utilizing the advantage that target semantic information is not easily influenced by appearance change and combining a target detection technology.
Technical scheme
An aerial video target tracking method based on semantic perception network and related filtering is characterized by comprising the following steps:
step 1: reading the first frame image data and the parameter R of the target block in the first frame imagetarget=[x,y,w,h]Wherein x, y represent the horizontal and vertical coordinates of the upper left corner of the target, and w, h represent the width and height of the target;
step 2: determining a target region R according to the target center position and length and width of the first frame1,R1=[xcenter,ycenter,1.5w,1.5h];
And step 3: in the region R1Performing feature extraction, wherein a ResNet50 residual error network structure with a feature pyramid FPN is used by a feature extraction network to obtain 256-dimensional depth features J which are obtained by multiplying 5 different scales S by {1.0,0.8,0.4,0.2 and 0.1} times of the size of the original image;
and 4, step 4: inputting the characteristics obtained in the step (3) into a relevant filtering module and a detection module respectively; in the correlation filtering module, the truncated feature J corresponds to RtargetPart J of (5)targetAs target template y1(ii) a The detection module detects the target feature JtargetInputting the category judgment branch therein, and outputting the category information of the target by the network;
and 5: reading the k frame image, wherein k is more than or equal to 2 and the initial value is 2, and according to the target parameter [ x ] of the previous framek-1,yk-1,wk-1,hk-1]Determining a k-th frame target region RkUsing the method in step 3 to RkCarrying out feature extraction to obtain a target feature JkA mixture of J andkrespectively inputting the data into a relevant filtering module, a detection module and a semantic segmentation module;
step 6: in the correlation filtering module, order JkTraining sample x equal to the framekCombining the target template y of the framekTraining a correlation filter w; training for w uses an optimization model:
Figure BDA0002363203510000021
wherein, f (-) represents the correlation operation, L (-) represents the square loss function, and λ is the regular parameter, for the convenience of solving, xk,ykBy discrete Fourier transformTo obtain Xk,YkConverting the above formula into frequency domain calculation, W represents the frequency domain W, and solving to obtain
Figure BDA0002363203510000031
Wherein h represents the feature dimension of the training sample; after obtaining the correlation filter W, calculating an initial response graph r output by the correlation filter module according to the following formula:
Figure BDA0002363203510000032
wherein, F-1(. -) represents the inverse fourier transform, ⊙ represents the dot product, -represents the complex conjugate operator;
and 7: detection module pair JkFirstly, performing convolution operation with convolution kernel size of 3 multiplied by 3; respectively inputting the output of the convolution operation into a category judgment branch, a target frame regression branch and a mask branch; the class judgment branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 80, namely the number of classes of the COCO data set, and each dimension of data represents the confidence score belonging to the class; the target frame regression branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 4, and the output dimension comprises the coordinates of the upper left corner and the lower right corner of the target frame; the mask branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 32, and tanh activation function is used on the output to generate the coefficient c corresponding to each pixel pointiThe semantic segmentation module is used for generating a target mask; the detection module needs to be pre-trained before the tracking algorithm is executed;
and 8: the category confidence and the regression frame are combined, and the category and the target frame can be obtained pixel by pixel; setting anchor points according to {1:2,1:1,2:1}, and obtaining a candidate frame through non-maximum value suppression NMS processing; screening the candidate frames according to the target category obtained from the first frame to further obtain a region RkThe detection frame with the same type as the target is used as the output of the detection module; and simultaneously obtaining the mask coefficient of the corresponding pixel point to be expressed as C ═ tanh ([ C ═ tanh)1,c2,...,ct])∈Rt×32T represents the number of the screened target frames;
and step 9: the semantic segmentation module divides JkInputting a full convolution neural network FCN, which is firstly subjected to convolution operation with a 3-layer convolution kernel size of 3 x 3, keeping dimensionality unchanged, then subjected to a layer of 2-time upsampling, subjected to a layer of 3 x 3 convolution, and finally subjected to a 1 x 1 convolution to output a 32-dimensional semantic segmentation prototype, which is expressed as D ═ D [ D ] s1,d2,...,d32]∈R32×nN is the dimension of the feature map, i.e. the product of the feature map length and width; the semantic segmentation module needs to be pre-trained before the tracking algorithm is executed;
step 10: combining the mask coefficient C output in the step 7, generating a target mask M according to the following formulat,pi,x,yRepresenting elements in a matrix C by D, t representing a total of t target masks
Figure BDA0002363203510000041
Step 11: to MtSelecting according to the following formula to obtain a final target mask M, score represents the confidence of the category, dist represents the region R from the center of the maskkThe distance of the center, i, represents the index of the mask, and finds the mask with the maximum ratio as the final target mask M:
Figure BDA0002363203510000042
step 12: according to the target frame output by the detection module, the initial response graph r of the relevant filter is cut, the value in the area of the target frame is reserved, the value outside the area is set to be 0, and a new response graph r is obtainedb(ii) a Then combining the output of the segmentation module according to the following formula to obtain a final semantic fused response graph rmP represents the weight of mask M;
rm=(1-p)rb+pM
step 13: find out rmAnd taking the position with the maximum upper response value as the target position of the frame, and updating the correlation filter w according to the following formula:
wk=(1-η)wk-1+ηwk
wherein η represents the learning rate;
step 14: judging whether all the images are processed or not, and if so, ending the process; otherwise, returning to the step 5.
The value of the lambda is 0.003.
The H is 50.
The value of p is 0.2.
The η value is 0.03.
The detection module and the semantic segmentation module are jointly pre-trained as follows:
1) carrying out normalization operation on the images of the COCO2017 data set to enable the data set distribution to be in accordance with standard normal distribution, and then randomly cutting the images and fixing the size to be 500 x 500;
2) the category judgment branch uses smooth-L1 loss function, the target frame regression branch uses standard cross entropy loss function, the semantic segmentation module combines the mask coefficient output by the detection network, and the loss function shown by the following formula is adopted:
Figure BDA0002363203510000051
wherein G represents a real mask label, and S represents the number of masks in the graph;
the total loss function of the network is the sum of the 3 loss functions;
3) initializing a feature extraction network FPN + ResNet50 by using network model parameters pre-trained on ImageNet; training is optimized by using a random gradient descent SGD algorithm, and the parameters of an optimizer are set as follows: learning rate of 0.001, momentum of 0.9, weight attenuation of 5 × 10-4
4) And inputting data into a network for training, wherein 27 periods are trained, 20000 pictures are trained in each period, and the obtained network model is used for a tracking process.
Advantageous effects
The invention provides an aerial video target tracking method based on a semantic perception network and related filtering, which adopts a related filtering tracking algorithm and more robustly and accurately positions the position of a target by fusing semantic information of a target area. Aiming at the problem of target blurring and shielding which is difficult to solve by a related filtering algorithm, the invention introduces a detection module and a segmentation module, records the category information of a target in a first frame, detects and semantically segments a target candidate region in a subsequent frame to obtain a target candidate frame and a segmentation mask of the same category in the region, then fuses the candidate frame and the mask to process a response image of the related filtering algorithm, and cuts a non-target region with a larger response value in the response image to obtain accurate target positioning. Thanks to the above measures, the present invention can achieve very robust results in a variety of challenging aerial scenes.
Drawings
FIG. 1 flow chart of the present invention
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
1.1 tracking procedure
1) Reading the first frame image data and the parameter R of the target block in the first frame imagetarget=[x,y,w,h]Where x, y represent the horizontal and vertical coordinates of the upper left corner of the target and w, h represent the width and height of the target.
2) Determining a target region R according to the target center position and length and width of the first frame1,R1=[xcenter,ycenter,1.5w,1.5h]。
3) In the region R1The feature extraction is performed, and the feature extraction network uses a ResNet50 residual network structure with a feature pyramid FPN to obtain 256-dimensional depth features J including 5 different dimensions S ═ {1.0,0.8,0.4,0.2,0.1} times of the original size.
4) Inputting the characteristics obtained in the step 3) into a relevant filtering module and a detection module respectively. In the correlation filtering module, we intercept the feature J corresponding to RtargetPart J of (5)targetAs target template y1(ii) a The detection module detects the target feature JtargetClass decision branch input thereto, class of network output targetAnd (4) other information.
5) Reading the k frame image, wherein k is more than or equal to 2 and the initial value is 2, and according to the target parameter [ x ] of the previous framek-1,yk-1,wk-1,hk-1]Determining a k-th frame target region RkThe method in the step 3) is adopted for RkCarrying out feature extraction to obtain a target feature JkA mixture of J andkrespectively input into a related filtering module, a detecting module and a semantic segmentation module.
6) In the correlation filtering module, order JkTraining sample x equal to the framekCombining the target template y of the framekThe correlation filter w is trained. Training for w uses an optimization model:
Figure BDA0002363203510000061
f (-) represents the correlation operation, L (-) represents the square loss function, and λ is the regular parameter and takes the value of 0.003. For the convenience of solution, for xk,ykBy discrete Fourier transform to Xk,YkConverting the above formula into frequency domain calculation, W represents the frequency domain W, and solving to obtain
Figure BDA0002363203510000062
H denotes the feature dimension of the training sample, H50. After obtaining the correlation filter W, an initial response graph r output by the correlation filter module is calculated according to the following formula.
Figure BDA0002363203510000071
F-1(. -) represents the inverse fourier transform, ⊙ the dot product, and-represents the complex conjugate operator.
7) Detection Module (Pre-training Process see 1.2) to JkFirstly, a convolution operation with a convolution kernel size of 3 x 3 is carried out, the characteristic dimension is not changed, and 256 dimensions are still kept. The output of the convolution operation is input into a category decision branch, a target box regression branch and a mask branch respectively. CategoriesThe decision branch performs a convolution operation of 3 × 3 on the input, and the output dimension is 80, namely the number of categories of the COCO dataset, and each dimension of data represents a confidence score belonging to the category; the target frame regression branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 4, and the output dimension comprises the coordinates of the upper left corner and the lower right corner of the target frame; the mask branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 32, and tanh activation function is used on the output to generate the coefficient c corresponding to each pixel pointiAnd the semantic segmentation module is used for generating the target mask.
8) The category and the target box can be obtained pixel by combining the category confidence and the regression box. And setting anchor points according to the {1:2,1:1,2:1}, and obtaining a more accurate candidate frame through non-maximum suppression (NMS) processing. Screening the candidate frames according to the target category obtained from the first frame to further obtain a region RkAnd the detection frame with the same target category is used as the output of the detection module. And simultaneously obtaining the mask coefficient of the corresponding pixel point to be expressed as C ═ tanh ([ C ═ tanh)1,c2,...,ct])∈Rt×32And t represents the number of the screened target frames.
9) Segmentation Module (see 1.2 for Pre-training procedure) JkInputting a full convolution neural network (FCN), which is subjected to convolution operation with a 3-layer convolution kernel size of 3 x 3, keeping dimensionality unchanged, then subjected to up-sampling by a layer of 2 times, subjected to convolution by a layer of 3 x 3, and finally outputting a 32-dimensional semantic segmentation prototype expressed as D ═ D through convolution by a 1 x 11,d2,...,d32]∈R32×nAnd n is the dimension of the feature map, i.e., the product of the feature map length and width.
10) Combining the mask coefficient C output in the step 7), generating a target mask M according to the following formulat,pi,x,yRepresenting elements in a matrix C by D, t representing a total of t target masks
Figure BDA0002363203510000072
11) To MtSelecting according to the following formula to obtain the final target mask M, score representing the confidence coefficient of the category, dist denotes the mask center to region RkThe distance of the center, i, represents the index of the mask, and the mask with the largest ratio is found out as the final target mask M.
Figure BDA0002363203510000081
12) According to the target frame output by the detection module, the initial response graph r of the relevant filter is cut, the value in the area of the target frame is reserved, the value outside the area is set to be 0, and a new response graph r is obtainedb. Then combining the output of the segmentation module according to the following formula to obtain a final semantic fused response graph rmAnd p represents the weight of the mask M, and p is taken as 0.2 in the invention.
rm=(1-p)rb+pM
13) Find out rmThe position where the upper response value is the maximum is set as the frame target position, and the correlation filter w is updated, and η represents the learning rate according to the following formula, and is set to 0.03.
wk=(1-η)wk-1+ηwk
14) Judging whether all the images are processed or not, and if so, ending the process; otherwise go back to step 5).
1.2 detection and semantic segmentation Module Joint Pre-training
1) The images of the COCO2017 data set are normalized to make the data set distribution conform to the standard normal distribution, and then the images are randomly cut and fixed to be 500 x 500 in size.
2) The network structure of the detection module and the segmentation module is shown as 1.1, the category judgment branch uses smooth-L1 loss function, the target frame regression branch uses standard cross entropy loss function, the semantic segmentation module combines the mask coefficient output by the detection network and adopts the loss function shown by the following formula, the meanings of C, D and n are shown as 1.1, G represents the real mask label, S represents the number of masks in the graph,
Figure BDA0002363203510000082
the total loss function of the network is the sum of the 3 loss functions described above.
3) For the feature extraction network FPN + ResNet50, initialization was performed using network model parameters pre-trained on ImageNet. Training was optimized using a Stochastic Gradient Descent (SGD) algorithm with optimizer parameters set to: learning rate of 0.001, momentum of 0.9, weight attenuation of 5 × 10-4
4) And inputting data into a network for training, wherein 27 periods are trained, 20000 pictures are trained in each period, and the obtained network model is used for a tracking process.

Claims (6)

1. An aerial video target tracking method based on semantic perception network and related filtering is characterized by comprising the following steps:
step 1: reading the first frame image data and the parameter R of the target block in the first frame imagetarget=[x,y,w,h]Wherein x, y represent the horizontal and vertical coordinates of the upper left corner of the target, and w, h represent the width and height of the target;
step 2: determining a target region R according to the target center position and length and width of the first frame1,R1=[xcenter,ycenter,1.5w,1.5h];
And step 3: in the region R1Performing feature extraction, wherein a ResNet50 residual error network structure with a feature pyramid FPN is used by a feature extraction network to obtain 256-dimensional depth features J which are obtained by multiplying 5 different scales S by {1.0,0.8,0.4,0.2 and 0.1} times of the size of the original image;
and 4, step 4: inputting the characteristics obtained in the step (3) into a relevant filtering module and a detection module respectively; in the correlation filtering module, the truncated feature J corresponds to RtargetPart J of (5)targetAs target template y1(ii) a The detection module detects the target feature JtargetInputting the category judgment branch therein, and outputting the category information of the target by the network;
and 5: reading the k frame image, wherein k is more than or equal to 2 and the initial value is 2, and according to the target parameter [ x ] of the previous framek-1,yk-1,wk-1,hk-1]Determining a k-th frame target region RkUsing the method in step 3 to RkCarrying out feature extraction to obtain a target feature JkA mixture of J andkrespectively inputting the data into a relevant filtering module, a detection module and a semantic segmentation module;
step 6: in the correlation filtering module, order JkTraining sample x equal to the framekCombining the target template y of the framekTraining a correlation filter w; training for w uses an optimization model:
Figure FDA0002363203500000011
wherein, f (-) represents the correlation operation, L (-) represents the square loss function, and λ is the regular parameter, for the convenience of solving, xk,ykBy discrete Fourier transform to Xk,YkConverting the above formula into frequency domain calculation, W represents the frequency domain W, and solving to obtain
Figure FDA0002363203500000021
Wherein h represents the feature dimension of the training sample; after obtaining the correlation filter W, calculating an initial response graph r output by the correlation filter module according to the following formula:
Figure FDA0002363203500000022
wherein, F-1(. -) represents the inverse fourier transform, ⊙ represents the dot product, -represents the complex conjugate operator;
and 7: detection module pair JkFirstly, performing convolution operation with convolution kernel size of 3 multiplied by 3; respectively inputting the output of the convolution operation into a category judgment branch, a target frame regression branch and a mask branch; the class judgment branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 80, namely the number of classes of the COCO data set, and each dimension of data represents the confidence score belonging to the class; the target frame regression branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 4, and the output dimension comprises the coordinates of the upper left corner and the lower right corner of the target frame; mask divisionCarrying out convolution operation of 3 multiplied by 3 on the input, the output dimension is 32, and using tanh activation function on the output to generate the coefficient c corresponding to each pixel pointiThe semantic segmentation module is used for generating a target mask; the detection module needs to be pre-trained before the tracking algorithm is executed;
and 8: the category confidence and the regression frame are combined, and the category and the target frame can be obtained pixel by pixel; setting anchor points according to {1:2,1:1,2:1}, and obtaining a candidate frame through non-maximum value suppression NMS processing; screening the candidate frames according to the target category obtained from the first frame to further obtain a region RkThe detection frame with the same type as the target is used as the output of the detection module; and simultaneously obtaining the mask coefficient of the corresponding pixel point to be expressed as C ═ tanh ([ C ═ tanh)1,c2,...,ct])∈Rt×32T represents the number of the screened target frames;
and step 9: the semantic segmentation module divides JkInputting a full convolution neural network FCN, which is firstly subjected to convolution operation with a 3-layer convolution kernel size of 3 x 3, keeping dimensionality unchanged, then subjected to a layer of 2-time upsampling, subjected to a layer of 3 x 3 convolution, and finally subjected to a 1 x 1 convolution to output a 32-dimensional semantic segmentation prototype, which is expressed as D ═ D [ D ] s1,d2,...,d32]∈R32×nN is the dimension of the feature map, i.e. the product of the feature map length and width; the semantic segmentation module needs to be pre-trained before the tracking algorithm is executed;
step 10: combining the mask coefficient C output in the step 7, generating a target mask M according to the following formulat,pi,x,yRepresenting elements in a matrix C by D, t representing a total of t target masks
Figure FDA0002363203500000031
Step 11: to MtSelecting according to the following formula to obtain a final target mask M, score represents the confidence of the category, dist represents the region R from the center of the maskkThe distance of the center, i, represents the index of the mask, and finds the mask with the maximum ratio as the final target mask M:
Figure FDA0002363203500000032
step 12: according to the target frame output by the detection module, the initial response graph r of the relevant filter is cut, the value in the area of the target frame is reserved, the value outside the area is set to be 0, and a new response graph r is obtainedb(ii) a Then combining the output of the segmentation module according to the following formula to obtain a final semantic fused response graph rmP represents the weight of mask M;
rm=(1-p)rb+pM
step 13: find out rmAnd taking the position with the maximum upper response value as the target position of the frame, and updating the correlation filter w according to the following formula:
wk=(1-η)wk-1+ηwk
wherein η represents the learning rate;
step 14: judging whether all the images are processed or not, and if so, ending the process; otherwise, returning to the step 5.
2. The aerial video target tracking method based on the semantic aware network and the correlation filtering as claimed in claim 1, wherein λ is 0.003.
3. The semantic aware network and correlation filtering based aerial video target tracking method according to claim 1, wherein H-50.
4. The method for tracking the aerial video target based on the semantic aware network and the related filtering according to claim 1, wherein the value of p is 0.2.
5. The aerial video target tracking method based on the semantic aware network and the correlation filtering as claimed in claim 1, wherein the value η is 0.03.
6. The method for tracking aerial video target based on semantic aware network and correlation filtering as claimed in claim 1, wherein the detection module and semantic segmentation module are jointly pre-trained as follows:
1) carrying out normalization operation on the images of the COCO2017 data set to enable the data set distribution to be in accordance with standard normal distribution, and then randomly cutting the images and fixing the size to be 500 x 500;
2) the category judgment branch uses smooth-L1 loss function, the target frame regression branch uses standard cross entropy loss function, the semantic segmentation module combines the mask coefficient output by the detection network, and the loss function shown by the following formula is adopted:
Figure FDA0002363203500000041
wherein G represents a real mask label, and S represents the number of masks in the graph;
the total loss function of the network is the sum of the 3 loss functions;
3) initializing a feature extraction network FPN + ResNet50 by using network model parameters pre-trained on ImageNet; training is optimized by using a random gradient descent SGD algorithm, and the parameters of an optimizer are set as follows: learning rate of 0.001, momentum of 0.9, weight attenuation of 5 × 10-4
4) And inputting data into a network for training, wherein 27 periods are trained, 20000 pictures are trained in each period, and the obtained network model is used for a tracking process.
CN202010028112.3A 2020-01-10 2020-01-10 Aerial video target tracking method based on semantic perception network and related filtering Active CN111260687B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010028112.3A CN111260687B (en) 2020-01-10 2020-01-10 Aerial video target tracking method based on semantic perception network and related filtering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010028112.3A CN111260687B (en) 2020-01-10 2020-01-10 Aerial video target tracking method based on semantic perception network and related filtering

Publications (2)

Publication Number Publication Date
CN111260687A true CN111260687A (en) 2020-06-09
CN111260687B CN111260687B (en) 2022-09-27

Family

ID=70943935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010028112.3A Active CN111260687B (en) 2020-01-10 2020-01-10 Aerial video target tracking method based on semantic perception network and related filtering

Country Status (1)

Country Link
CN (1) CN111260687B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883836A (en) * 2021-01-29 2021-06-01 中国矿业大学 Video detection method for deformation of underground coal mine roadway
CN113298036A (en) * 2021-06-17 2021-08-24 浙江大学 Unsupervised video target segmentation method
TWI797527B (en) * 2020-12-28 2023-04-01 國家中山科學研究院 Object re-identification detection system and method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952288A (en) * 2017-03-31 2017-07-14 西北工业大学 Based on convolution feature and global search detect it is long when block robust tracking method
CN108734151A (en) * 2018-06-14 2018-11-02 厦门大学 Robust long-range method for tracking target based on correlation filtering and the twin network of depth
WO2018232378A1 (en) * 2017-06-16 2018-12-20 Markable, Inc. Image processing system
CN109740448A (en) * 2018-12-17 2019-05-10 西北工业大学 Video object robust tracking method of taking photo by plane based on correlation filtering and image segmentation
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN110163887A (en) * 2019-05-07 2019-08-23 国网江西省电力有限公司检修分公司 The video target tracking method combined with foreground segmentation is estimated based on sport interpolation
CN110310303A (en) * 2019-05-06 2019-10-08 南昌嘉研科技有限公司 Image analysis multi-object tracking method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952288A (en) * 2017-03-31 2017-07-14 西北工业大学 Based on convolution feature and global search detect it is long when block robust tracking method
WO2018232378A1 (en) * 2017-06-16 2018-12-20 Markable, Inc. Image processing system
CN108734151A (en) * 2018-06-14 2018-11-02 厦门大学 Robust long-range method for tracking target based on correlation filtering and the twin network of depth
CN109740448A (en) * 2018-12-17 2019-05-10 西北工业大学 Video object robust tracking method of taking photo by plane based on correlation filtering and image segmentation
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN110310303A (en) * 2019-05-06 2019-10-08 南昌嘉研科技有限公司 Image analysis multi-object tracking method
CN110163887A (en) * 2019-05-07 2019-08-23 国网江西省电力有限公司检修分公司 The video target tracking method combined with foreground segmentation is estimated based on sport interpolation

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HAMED KIANI GALOOGAHI ET AL: "Learning Background-Aware Correlation Filters for Visual Tracking", 《ARXIV:1703.04590V2 [CS.CV]》 *
ION GIOSAN ET AL: "A solution for probabilistic inference and tracking of obstacles classification in urban traffic scenarios", 《2012 IEEE 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING》 *
YANGLIU KUAI ET AL: "Target-Aware Correlation Filter Tracking in RGBD Videos", 《IEEE SENSORS JOURNAL 》 *
孙彦景等: "基于多层卷积特征的自适应决策融合目标跟踪算法", 《电子与信息学报》 *
谢超等: "视频目标跟踪技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI797527B (en) * 2020-12-28 2023-04-01 國家中山科學研究院 Object re-identification detection system and method
CN112883836A (en) * 2021-01-29 2021-06-01 中国矿业大学 Video detection method for deformation of underground coal mine roadway
CN112883836B (en) * 2021-01-29 2024-04-16 中国矿业大学 Video detection method for deformation of underground coal mine roadway
CN113298036A (en) * 2021-06-17 2021-08-24 浙江大学 Unsupervised video target segmentation method
CN113298036B (en) * 2021-06-17 2023-06-02 浙江大学 Method for dividing unsupervised video target

Also Published As

Publication number Publication date
CN111260687B (en) 2022-09-27

Similar Documents

Publication Publication Date Title
US10719940B2 (en) Target tracking method and device oriented to airborne-based monitoring scenarios
CN110929578B (en) Anti-shielding pedestrian detection method based on attention mechanism
CN111310862B (en) Image enhancement-based deep neural network license plate positioning method in complex environment
CN109800689B (en) Target tracking method based on space-time feature fusion learning
CN108596101B (en) Remote sensing image multi-target detection method based on convolutional neural network
CN104378582B (en) A kind of intelligent video analysis system and method cruised based on Pan/Tilt/Zoom camera
CN113065558A (en) Lightweight small target detection method combined with attention mechanism
CN111539370A (en) Image pedestrian re-identification method and system based on multi-attention joint learning
CN111260687B (en) Aerial video target tracking method based on semantic perception network and related filtering
CN112926410A (en) Target tracking method and device, storage medium and intelligent video system
CN108960404B (en) Image-based crowd counting method and device
CN111126278B (en) Method for optimizing and accelerating target detection model for few-class scene
CN110929593A (en) Real-time significance pedestrian detection method based on detail distinguishing and distinguishing
CN116343330A (en) Abnormal behavior identification method for infrared-visible light image fusion
CN112084952B (en) Video point location tracking method based on self-supervision training
CN110310305A (en) A kind of method for tracking target and device based on BSSD detection and Kalman filtering
CN115272876A (en) Remote sensing image ship target detection method based on deep learning
Li et al. Weak moving object detection in optical remote sensing video with motion-drive fusion network
CN111683221A (en) Real-time video monitoring method and system for natural resources embedded with vector red line data
CN110334703B (en) Ship detection and identification method in day and night image
CN116630828A (en) Unmanned aerial vehicle remote sensing information acquisition system and method based on terrain environment adaptation
CN107730535B (en) Visible light infrared cascade video tracking method
CN115861709A (en) Intelligent visual detection equipment based on convolutional neural network and method thereof
WO2023086398A1 (en) 3d rendering networks based on refractive neural radiance fields
CN112614158B (en) Sampling frame self-adaptive multi-feature fusion online target tracking method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant