CN110929560A

CN110929560A - Video semi-automatic target labeling method integrating target detection and tracking

Info

Publication number: CN110929560A
Application number: CN201910963482.3A
Authority: CN
Inventors: 徐英; 谷雨; 刘俊; 彭冬亮; 陈庆林
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Electronic Science and Technology University
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2020-03-27
Anticipated expiration: 2039-10-11
Also published as: CN110929560B

Abstract

The invention discloses a video semi-automatic target marking method integrating target detection and tracking. In the subsequent frames, fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of a target in an image; the invention judges whether the target marking is finished according to the target tracking algorithm. If the target marking is finished, extracting the video key frames according to the size of the significant value of each frame of target to obtain a target marking result, and otherwise, continuously estimating the position of the target in the video image; the method for extracting the video key frame based on the target significance enables the key frame to reflect the diversity of target changes. The invention adopts multi-lens multi-ship video to carry out experimental test, and verifies the effectiveness of the method provided by the invention.

Description

Video semi-automatic target labeling method integrating target detection and tracking

Technical Field

The invention belongs to the field of video data marking, and relates to a video target marking method which integrates target detection and target tracking and extracts video key frames according to target significance.

Background

In recent years, the deep learning technology is rapidly developed, and the target detection and target tracking field is promoted to continuously realize new breakthroughs. Because the deep learning technology needs the support of big data, obtaining a large amount of accurate label training data with sample diversity is the key for obtaining excellent performance by the deep learning technology.

At present, two methods of manual marking and automatic marking are mainly used for acquiring training data. The manual marking adopts a manual mode to mark the target position and the label in a single image, a large number of continuous image frames exist in the video, the manual marking efficiency is low, and the automatic marking becomes possible due to the fact that the target in the video has the characteristic of space-time continuity. In the prior art, only a target tracking algorithm based on relevant filtering is used for video target labeling, and the accuracy of a labeling result cannot meet the requirement of being used as training data. And only the target detection algorithm is used for marking the video target, the detector marks all targets which accord with the target type in the subsequent frame according to the type of the target of the initial frame, and whether the targets are the same as the initial frame cannot be judged, or the detector fails to detect due to factors such as jitter and blurring of the target and the like to cause inconsistent marking of the video target. The invention integrates the detection and tracking algorithms, combines the advantages of the two algorithms, can improve the accuracy of automatic labeling, can determine the same target by utilizing the space-time continuity of the tracking algorithm, solves the problem of detection omission of the detector, can automatically judge that the target disappears and improves the labeling efficiency.

The invention provides a video semi-automatic labeling method, which comprises the steps of manually labeling a target position in an initial frame, automatically labeling the target position in a subsequent frame, and finally automatically extracting a plurality of key frames to obtain a labeling result. The main problems to be solved include: (1) how to improve the accuracy and the consistency of video target labeling is the first problem to be solved. (2) In order to reduce manual participation and improve the labeling efficiency, it is necessary to automatically determine the target disappearance and the end of labeling. (3) The extracted key frames can reflect the diversity of changes of target dimension size, angle, illumination and the like.

Aiming at the condition that the current independent target detection algorithm or target tracking algorithm cannot meet the automatic labeling requirement of the video target, the invention integrates target detection and target tracking through reasonable rules, thereby greatly improving the efficiency and accuracy of labeling the video target; in addition, a method for extracting video key frames based on target significance is provided, so that the extracted key frames can accurately reflect the diversity of target changes.

Disclosure of Invention

The invention provides a video semi-automatic target marking method integrating target detection and tracking, aiming at solving the technical problems that the existing automatic marking means is low in precision and continuity or low in manual marking speed.

Firstly, a certain frame is selected as an initial frame in a video image, the initial position of a target is marked manually, and a category label of the target is determined. And in the subsequent frames, fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of the target in the image, and judging whether the target marking is finished according to the target tracking algorithm. And if the target marking is finished, extracting the video key frames according to the size of the significant value of each frame of target to obtain a target marking result, and otherwise, continuously estimating the position of the target in the video image. .

The technical scheme adopted by the invention comprises the following steps:

1. the video semi-automatic target marking method integrating target detection and tracking is characterized by comprising the following steps of:

selecting a certain frame as an initial frame in a certain shot of a video, manually marking the initial position and size of a target, and determining a category label of the target;

step (2), adopting automatic labeling for other subsequent frames after the initial frame, specifically fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of a target in an image; the method comprises the following steps:

2.1 detecting the target in each frame of image by adopting YOLO V3 and marking a detection frame;

the YOLO V3 is a training sample obtained by adjusting the size of the labeled target image to a fixed scale, and training YOLO-V3; wherein, the YOLO layer is increased to 4 layers, and four different receptive field characteristic maps with different scales of 13 multiplied by 13, 26 multiplied by 26, 52 multiplied by 52 and 104 multiplied by 104 are obtained through multi-scale characteristic fusion; using three prior boxes of (116x90), (156x198) and (373x326) to predict the 13 x13 feature map, detecting a larger object; predicting the 26 × 26 feature map by using three prior boxes of (30x61), (62x45) and (59x119), and detecting objects with medium sizes; using three prior boxes of (10x13), (16x30) and (33x23) to predict a 52 x 52 feature map, detecting smaller objects; the 104 x 104 feature map is predicted by using three kinds of prior boxes of (5x6), (8x15) and (16x10) which are newly added, and a smaller target is detected;

2.2 acquiring a tracking frame of the target by adopting a KCF related filtering tracking algorithm;

firstly, HOG characteristics are extracted according to the target position and size of the previous frame, then the HOG characteristics are converted into a frequency domain through Fourier transformation, the obtained frequency domain characteristics are mapped to a high dimension through a Gaussian kernel function, and a filtering template α is obtained according to the formula (1):

wherein x represents the HOG characteristic of the sample, ^ represents Fourier transform, g is a two-dimensional Gaussian function with the center as the peak value, and λ is a regularization parameter used for controlling overfitting of training; k is a radical of^xxThe kernel autocorrelation matrix of x in the high-dimensional space is represented, and the calculation mode is given by the formula (2):

in which σ is a Gaussian kernel functionThe width parameter, which controls the radial extent of action of the function, represents the complex conjugate, ⊙ represents the dot product,

representing the inverse fourier transform, c is the number of channels of the HOG feature x;

when the target is tracked on the image of the t-th frame, the update of the correlation filter α is given by:

η is an update parameter;

to accommodate the scale change of the object, the filter α of the current frame_tScaling is needed so as to predict the size of the next frame of target; wherein the scaling ratios are [1.1,1.05,1,0.95,0.9]；

Extracting a candidate sample HOG characteristic z at a t frame target position on a t +1 frame image; in conjunction with each of the above-mentioned size-scaled filters, each corresponding filtered output response plot f is shown in equation (4):

where m ═ 1,2,3,4,5, corresponding to scaled ratios [1.1,1.05,1,0.95,0.9], respectively; x represents the HOG characteristic of the t frame target;

the maximum value f is screened from the 5 response graphs f maximum values max (f)_max,f_maxThe corresponding position is the position of the target center, f_maxThe corresponding scaling is the target size, and a tracking frame of the t +1 th frame is obtained;

2.3 fusing the results of target detection and target tracking to determine the labeled target frame;

firstly, judging whether each frame of image contains a detection frame or not, and if not, taking the target frame as a tracking frame; if yes, continuously judging whether the detection frame is only one, if yes, calculating the IOU values of the tracking frame and the detection frame, if the IOU value is larger than a threshold value, taking the target frame as the detection frame, initializing a KCF tracking algorithm by using the detection frame, and if not, taking the target frame as the tracking frame; if the number of the detection frames is multiple, the IOU value of the tracking frame and each detection frame needs to be calculated, the maximum IOU value is further screened out, if the maximum IOU value is larger than a threshold value, the target frame is the detection frame corresponding to the maximum IOU value, a KCF tracking algorithm is initialized by using the detection frame, and if not, the target frame is the tracking frame;

the IOU value is used for evaluating the coincidence degree of the tracking frame and each detection frame under the current frame, and the formula is as follows:

wherein S_IIndicates the overlapping area of the tracking frame and each detection frame under the same frame, S_URepresenting the area of the set part of the tracking frame and each detection frame under the same frame, wherein the area of the set part is the sum of the areas of the tracking frame and the detection frame minus the overlapping area;

step (3), judging whether the target marking is finished or not according to a target tracking algorithm;

judging whether max (f) is smaller than a set threshold value theta and whether the Peak Sidelobe Ratio (PSR) is smaller than the set threshold value theta according to a response graph f of the KCF correlation filtering tracker_PSRWhen, namely:

max(f)<θandPSR<θ_PSR(7)

if yes, judging that the target marking is finished, and turning to the step (4) to select the key frame; otherwise, turning to the step (2), and continuing to estimate the position of the target in the next frame image;

the peak side lobe ratio (PSR) is calculated as follows:

where max (f) is the peak value of the correlation filter response map f, and Φ is 0.5，μ_Φ(f) And σ_Φ(f) Mean and standard deviation of 50% response area centered on the f peak, respectively;

step (4), calculating a significant value of each frame of target in the current shot; extracting a set number of video key frames according to the significant value of each frame of target to obtain a target labeling result; the method comprises the following steps:

4.1LBP (local binary pattern) extracting the texture feature of the image, the basic idea is that the texture feature is defined in the neighborhood of pixel 3x3, the neighborhood center pixel is used as a threshold value, the gray value of 8 adjacent pixels is compared with the threshold value, if the surrounding pixels are larger than the center pixel value, the position of the pixel point is marked as 1, otherwise, the position is 0; comparing 8 points in 3-by-3 neighborhood to generate 8-bit binary number, converting into decimal number to obtain LBP value of central pixel, and reflecting LBP information of the region with the value; the specific calculation formula is shown as (8):

wherein (x)₀,y₀) Is the coordinate of the central pixel, p is the p-th pixel of the neighborhood, j_pIs the gray value of the neighborhood pixel, j₀The gray value of the central pixel; s (x) is a sign function:

4.2 the calculation formula of the color saliency map is as follows:

wherein the patch is an original image of the target frame area_gaussianThe method is characterized in that the image is the image of patch after Gaussian filtering processing with 5 multiplied by 5 Gaussian kernel and 0 standard deviation, | | represents absolute value, i represents channel number, and (x, y) is pixel coordinate;

4.3 obtaining edge saliency characteristic map for pixel points of target edge region in each frame of image target frame

In the target edge area in the target frame, pixel values can jump, derivatives are obtained for the pixel values, and the first derivative of the derivatives is an extreme value at the edge position, namely the edge is at the extreme value, which is the principle used by the Sobel operator; if the second derivative is calculated for the pixel value, the derivative value at the edge is 0; the method for realizing the Laplace function is that first Sobel operators are used for calculating second-order x and y derivatives, then edge significance characteristic graphs are obtained through summation, and the calculation formula is as follows:

wherein I represents an image in the target frame, and (x, y) represents pixel coordinates of a target edge region in the target frame;

4.4, carrying out average weighted fusion on the LBP texture characteristics, the color saliency characteristics, the edge saliency characteristics and other characteristics to obtain a fusion value mean, wherein a fusion calculation formula is as follows:

wherein the content of the first and second substances,

respectively representing the values of pixel points (x, y) in an LBP texture characteristic graph, a color saliency characteristic graph and an edge saliency characteristic graph in the t frame;

4.5 the color histogram variation value Dist is obtained by calculating the babbitt distance between the color histogram of the selected target area of the initial frame and the target area of the t-th frame, and the calculation formula is as follows:

wherein H₀Manually labeling a selected target frame color histogram for an initial frame, H_tAutomatically labeling the color histogram of the target frame for the t-th frame, n representing the total number of color histogram bins,

the calculation formula of (a) is given by:

wherein k is 0 or t;

4.6 the scale change value is obtained by calculating the width and height change of the initial frame target frame and the t frame target frame, and the calculation formula is as follows:

wherein

For the width and height of the target box of the initial frame,

and

the width and height of the target frame of the t frame;

4.7 according to the fusion value, the color histogram change value and the scale change value of the image target frame region, the calculation formula of the target significant value of the t-th frame is as follows:

wherein T represents the total number of frames of the video;

4.8 saliency S of each frame object in video_tConstructing a significant value line graph, and solving all peak values and corresponding frames;

assuming that the video has T frames, setting the number of extracted key frames as n; the number of the significant value peak values is k, if n is less than k, the peak values are sorted in a descending order, frames corresponding to the first n peak values are extracted as key frames, if k is less than n and less than T, frames corresponding to all the peak values are extracted, and the rest n-k key frames adopt a random and unrepeated extraction mode; if n is larger than T, all video frames are used as key frames;

and (5) returning to the step (1) to label the target of the next video shot.

Compared with the prior art, the invention has the following remarkable advantages: (1) the invention creatively fuses the target detection algorithm and the target tracking algorithm, thereby improving the accuracy of target positioning and the continuity of target state estimation in the video image; (2) only the target initial position needs to be marked manually in the initial frame, and the marking is automatically judged to be finished in the marking process, so that the times of artificial participation are reduced; (3) and fusing the LBP texture features, the color saliency features and the edge saliency features of the target region, and calculating the target saliency by combining the color histogram change and the scale change, so that the extracted key frame can reflect the diversity of the target change.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of fused target detection and target tracking;

FIG. 3 is a flow chart of target saliency calculation;

FIG. 4 is a detection result of a 2 nd frame image in an example video;

FIG. 5 is a tracking result of a 2 nd frame image in an example video;

FIG. 6 is a fusion detection and tracking result for a 2 nd frame of image in an example video;

FIG. 7 is a KCF response plot peak change curve for the 2 nd lens of an example video;

FIG. 8 is a plot of the peak-to-side lobe ratio variation of the KCF response plot for the 2 nd lens of the example video;

FIG. 9 is a 243 frame image of an example video shot 2;

FIG. 10 is a 1 st frame image of an example video shot 3;

FIG. 11 is a target saliency curve for an example video shot 6;

fig. 12 is a key frame extracted for the 6 th shot of an example video.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in FIG. 1, the method comprises the following steps:

selecting a certain frame in a video image as an initial frame, manually marking the initial position of a target, and determining the category label of the target.

And (2) fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of a target in an image in a subsequent frame. The invention adopts a YOLO V3 detection algorithm and a KCF related filtering tracking algorithm, and a fusion method is shown as a figure 2 and specifically comprises the following steps:

2.1 the detector of the invention adopts the fast YOLO V3 detection algorithm in the current mainstream detection network, which meets the requirements of real-time and accuracy in the video annotation technology, including the feature extraction network Darknet-53 and the prediction network, the Darknet53 network adopts the ResNet shortcut connection, avoiding the gradient disappearance, in the prediction phase, the algorithm uses the method of region of interest extraction based on anchors in the RPN network, and fpn (feature Pyramid) network uses 3 scales of feature maps, small feature maps provide semantic information, large feature maps have finer granularity information, small feature maps are fused by upsampling and large scale, better detection effect is realized, in addition, compared with V1 and V2, yoo V3 does not use soft max loss function any more, but uses sigmod max + entropy cross function, thereby supporting the prediction of multiple labels.

The invention carries out the following improvement and optimization on the basis of the original model:

firstly, initializing training parameters by adopting a darknet53.conv.74 pre-training model in a feature extraction part, then increasing a YOLO layer of an original model to 4 layers, obtaining four different receptive field feature maps of 13 x13, 26 x 26, 52 x 52 and 104 x 104 in different scales through multi-scale feature fusion, and then predicting the feature map of 13 x13 by using three prior frames (116x90), (156x198) and (373x326) to detect a larger object; using (30x61), (62x45), (59x119) to predict the 26 x 26 feature map, detecting objects of medium size; the 52 × 52 feature map is predicted using (10x13), (16x30), (33x23) to detect smaller objects, and the 104 × 104 feature map is predicted using the newly added (5x6), (8x15), (16x10) three prior blocks to detect smaller objects. Compared with the original model, the improved detection network integrates the characteristics of lower layers, so that the detection rate of the small target is improved.

In each detection operation, inputting a t +1 th frame image, firstly resize to a fixed scale, and finally obtaining a detection frame containing an object category and a score value as a detection result of the t +1 th frame through a feature extraction network and a prediction network.

2.2 the KCF related filtering tracking algorithm firstly extracts HOG characteristics according to the target position and size of the t-th frame, then transfers the HOG characteristics to a frequency domain through Fourier transformation, maps the obtained frequency domain characteristics to a high dimension through a Gaussian kernel function, and obtains a filtering template α according to the formula (1):

where x represents the HOG characteristics of the sample, a represents the Fourier transform, g is a two-dimensional Gaussian function centered at the peak, and λ is a regularization parameter used to control the overfitting of the training. k is a radical of^xxThe kernel autocorrelation matrix of x in the high-dimensional space is represented, and the calculation mode is given by the formula (2):

where σ is the width parameter of the gaussian kernel function, which controls the radial extent of action of the function, denotes the complex conjugate, ⊙ denotes the dot product,

denotes the inverse fourier transform and c is the number of channels of the HOG feature x.

When performing object tracking on the tth frame image, the update of the correlation filter α is given by:

and 2.3, fusing the results of target detection and target tracking to determine the labeled target frame.

wherein S_IIndicates the overlapping area of the tracking frame and each detection frame under the same frame, S_UAnd the area of the collection part of the tracking frame and each detection frame under the same frame is represented. The area of the set part is the total area of the tracking frame and the detection frame minus the overlapping area;

and (3) the peak value of the response image f of the KCF correlation filtering tracker represents the confidence that the corresponding position is the target, and the higher the peak value is, the higher the probability that the position is the target is. The peak-to-side lobe ratio (PSR) measures the peak intensity of the correlation filtering output, and the higher the PSR value is, the higher the reliability of the tracking result is. If the peak value and the PSR are lower than the set threshold values, the target is possibly disappeared, and therefore the video target marking is judged to be finished. The peak side lobe ratio (PSR) is calculated as follows:

where max (f) is the peak value of the correlation filter response diagram f, Φ is 0.5, μ_Φ(f) And σ_Φ(f) Mean and standard deviation, respectively, of a 50% response region centered at the f-peak. If max (f) is less than the set threshold θ and PSR is less than the set threshold θ_PSRWhen, namely:

max(f)<θandPSR<θ_PSR(7)

and (4) judging that the target marking is finished, and turning to the step (4) to select the key frame. Otherwise, turning to the step (2), and continuing to estimate the position of the target in the next frame image.

And (4) calculating a significant value of each frame of target, as shown in fig. 3, in the labeling process, acquiring a target region by using the target frame obtained in the step (2), then performing LBP texture feature, color significant feature and edge significant feature fusion on the target region, and calculating the significant value of the target by combining color histogram change and scale change. The method comprises the following specific steps:

4.1LBP (local binary pattern) extracts the texture feature of the target region, the basic idea is to define in the neighborhood of pixel 3x3, with the neighborhood center pixel as the threshold, the gray value of the adjacent 8 pixels is compared with it, if the surrounding pixel is greater than the center pixel value, the position of the pixel is marked as 1, otherwise, it is 0. 8 points in 3-by-3 neighborhood can generate 8-bit binary numbers through comparison, the LBP value of the central pixel can be obtained through conversion into decimal numbers, and the LBP information of the area is reflected through the value. The specific calculation formula is shown as (8):

wherein (x)₀,y₀) Is the coordinate of the central pixel, p is the p-th pixel of the neighborhood, j_pIs the gray value of the neighborhood pixel, j₀Is the gray value of the neighborhood pixel. s (x) is a sign function:

4.2 the calculation formula of the color saliency map is as follows:

wherein the patch is a target area image_gaussianFor the image after patch is subjected to gaussian filtering with a gaussian kernel of 5 × 5 and a standard deviation of 0, | | represents an absolute value, i represents the number of channels of the image, and (x, y) is a horizontal coordinate and a vertical coordinate.

4.3 in the edge region of the target region image, the pixel values will "jump", and the derivative of these pixel values is determined, and its first derivative is extreme at the edge position, which is the principle used by the Sobel operator-the extreme is the edge. If the second derivative is taken over the pixel value, the derivative value at the edge is 0. The method for realizing the Laplace function is that first Sobel operators are used for calculating second-order x and y derivatives, then edge significance characteristic graphs are obtained through summation, and the calculation formula is as follows:

wherein I represents an image, (x, y) represents pixel coordinates of an edge region of the object in the object frame;

wherein the content of the first and second substances,

and respectively representing the values of pixel points (x, y) in an LBP texture characteristic graph, a color saliency characteristic graph and an edge saliency characteristic graph in the t-th frame.

4.5 color histogram of the target area image represents the distribution of color components in the image, showing different types of colors and the number of pixels in each color. The color histogram change value Dist is obtained by calculating the babbit distance between the color histogram of the initial frame selected target area and the color histogram of the t frame target area, the greater the Dist value is, the lower the similarity is, the more obvious the target change is, and the calculation formula is as follows:

wherein H₀Selecting a target region color histogram for the initial frame, H_tIs the color histogram of the target region of the t-th frame, n represents the total number of color histogram bins,

the calculation formula of (a) is given by:

where k is 0 or t.

wherein

For the width and height of the target box of the initial frame,

and

the width and height of the target box of the t-th frame.

4.7 through the above calculation, the calculation formula of the target significant value of the t-th frame is as follows:

where T represents the total number of video frames for a shot.

And 4.8 drawing a significant value line graph according to the significant value of each frame of target in the scene shot to obtain all peak values and corresponding frames. Assuming that the shot has T video frames, the number of key frames to be extracted is n, the number of peak values is k, if n is less than k, the peak values are sorted in a descending order, the frames corresponding to the first n peak values are extracted as key frames, if k is less than n, the frames corresponding to all the peak values are extracted, and the rest n-k key frames adopt a random and unrepeated extraction mode; if n > T, all video frames are used as key frames.

And (5) returning to the step (1) to mark the target of the next lens.

In order to verify the effectiveness of the method provided by the invention, a section of multi-lens multi-ship video is adopted for experimental testing. The video has 9 scene shots of multiple ships, the frame number of each scene shot is shown in table 1, and for accelerated calculation, the experiment is labeled once every 5 frames.

TABLE 1 video shot and frame number

In the stage of target detection, a single-stage target detection algorithm YOLO V3 is trained on a large number of labeled samples with ship label information and position information to obtain a detection model, and then the model is used as a detector. Considering that the original algorithm has low capability of detecting small targets, the method adds small scale on the original basisThe Anchor improves the defect of low detection precision, improves the detection capability of targets with various scales on the premise of ensuring the detection speed, and realizes accurate real-time detection. In the target tracking stage, the parameter setting lambda of the KCF tracking algorithm is 1 multiplied by 10^-4And sigma is 0.5, and η is 0.02, considering that the original algorithm cannot adapt to the change of the target scale, the scale judgment is added to the KCF tracking algorithm, and the improved KCF tracking algorithm is used as a tracker.

In the stage of fusing the detection result and the tracking result, the IOU threshold is set to be 0.5. And if the IOU value of the tracking frame and each detection frame is less than 0.5, the detector does not detect the target to be marked, and the target frame of the target is the tracking frame. If the IOU values of the tracking frame and one or more detection frames are greater than 0.5, which indicates that the detector detects the target to be labeled, the target frame of the target is the detection frame corresponding to the maximum IOU value. For example, after the 1 st frame of the video shot 1 is manually marked with the target, the detection result and the tracking result of the 2 nd frame are shown in fig. 4 and 5. As can be seen from the figure, there are multiple targets in the detection result of the detector, and the tracking result of the tracker has only one target. By calculating the IOU values of the tracking frame and each detection frame, only one detection frame and the IOU value of the tracking frame are greater than the threshold value of 0.5, the result of fusing and outputting the target frame is shown in fig. 6, and the result of fusing and outputting the detection frame is output.

When judging whether the target marking is finished, setting a peak value threshold theta of a KCF tracker to be 0.3 and setting a peak value sidelobe ratio threshold theta_PSRAnd 3.5, and when the ratio of the peak value to the peak sidelobe is smaller than the threshold value, the marking is finished. For example, when the target disappears during the process of labeling the 2 nd lens of the video, the response diagram of the KCF tracking algorithm has smaller peak value and peak-to-side lobe ratio, as shown in fig. 7 and 8. In 0-47 frames under a scene lens, the ratio value of the peak value and the peak value sidelobe of a response graph of a KCF tracking algorithm is larger, the peak value and the peak value sidelobe in the 48 th frame are smaller, the target of the frame disappears, 243 frames are actually and just corresponding to the scene lens, the scene lens is marked once every 5 frames, and the scene lens of the next frame of 243 frames is switched. Wherein the 243 st frame image of the shot 2 and the 1 st frame image of the shot 3 are as shown in fig. 9 and 10. In the figure it can be seen that the video is cut by shot 2And the target disappears due to the fact that the shot 3 is replaced, which shows that the method judges that the marking is finished accurately.

When the tracker judges that the video shot target marking is finished, a video shot target significant value curve is obtained according to the target significant value of each frame, key frames are extracted at the local maximum value of the curve, and in the experiment, 10 frames are extracted from each shot to serve as the key frames. For example, the target saliency curve for shot 6 is shown in fig. 11. The local maxima are arranged from large to small, then the frames corresponding to the first 10 local maxima are taken as key frames, and the extracted key frames are shown in fig. 12 (a-j). As can be seen from the figure, the extracted key frame has strong representativeness, and the diversity of changes such as the size, the angle and the like of the target size can be accurately reflected.

The results of this experiment are shown in table 2,

TABLE 2 Key Frames for each shot

Lens barrel	Key frame
		1	5，10，25，30，40，50，55，65，75，80
2	90，110，125，135，145，160，180，195，205，215
		3	325，340，365，380，400，420，430，445，460，480
4	1099，1109，1119，1139，1149，1159，1169，1179，1329，1369
		5	1424，1519，1559，1594，1604，1624，1634，1674，1754，1764
6	1779，1854，1869，1994，2054，2064，2089，2114，2144，2154
		7	2194，2199，2214，2229，2249，2269，2279，2289，2294，2314
8	2349，2359，2379，2399，2414，2424，2444，2459，2474，2539
		9	2974，3094，3164，3179，3189，3199，3214，3229，3259，3274

The extraction ranges of the key frames are all in the corresponding shots, and further prove that the method can distinguish different shots and automatically judge the end of target marking. The method adopts the local maximum value of the target significant value as the extraction basis of the key frame, so that the extracted key frame is representative. According to the experimental result, the video target labeling method based on the fusion target detection algorithm and the target tracking algorithm obtains higher accuracy.

Claims

η is an update parameter;

max(f)<θandPSR<θ_PSR(7)

the peak side lobe ratio (PSR) is calculated as follows:

where max (f) is the peak value of the correlation filter response diagram f, Φ is 0.5, μ_Φ(f) And σ_Φ(f) Mean and standard deviation of 50% response area centered on the f peak, respectively;

4.2 the calculation formula of the color saliency map is as follows:

wherein the patch is an original image of the target frame area_gaussianFor the image after the patch is processed by the gaussian filter with a gaussian kernel of 5 × 5 and a standard deviation of 0, | | represents the absolute value, | represents the number of channels, and (x, y) is the pixel coordinate；

wherein the content of the first and second substances,

wherein H₀Manually labeling a selected target frame color histogram for an initial frame, H_tIs the t-th frameAutomatically labeling the color histogram of the target box, n representing the total number of color histogram bins,

the calculation formula of (a) is given by:

wherein k is 0 or t;

wherein

For the width and height of the target box of the initial frame,

and

the width and height of the target frame of the t frame;

wherein T represents the total number of frames of the video;

and (5) returning to the step (1) to label the target of the next video shot.