CN111488948A

CN111488948A - Method for marking sparse samples in jitter environment

Info

Publication number: CN111488948A
Application number: CN202010358369.5A
Authority: CN
Inventors: 张学睿; 张帆; 姚远; 郑志浩
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing University; Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-04
Anticipated expiration: 2040-04-29
Also published as: CN111488948B

Abstract

The invention relates to a method for marking sparse samples in a jitter environment, and belongs to the technical field of image recognition. The method comprises the following steps: s1: adopting a debounce algorithm to debounce the input video file; s2: identifying sparse samples by using an improved Mask fast RCNN model; s3: constructing an intelligent marking system, and manually marking the identified sparse samples; s4: updating a training set: and returning the marked data to a training data set for the next round of improved Mask fast RCNN model training. The method and the device can mark specific targets and improve the effectiveness of the video by aiming at the problem of sparse effective samples in the monitoring video in the rare-people environment and considering the difficulty of video instability caused by a shaking environment.

Description

Method for marking sparse samples in jitter environment

Technical Field

The invention belongs to the technical field of image recognition, and relates to a method for marking sparse samples in a shaking environment.

Background

For some specific scenes, the frequency of the target entering and exiting the camera is low, and a shaking environment is easily caused, so that fewer collected identification samples are caused, and the efficiency of an identification algorithm is low due to fewer identification samples. At present, in a common identification method for samples in a video, the influence of few identification samples on identification precision is rarely considered. Therefore, a sparse sample marking method is developed, and the target identification under the specific scene is improved.

Disclosure of Invention

In view of the above, the present invention provides a method for marking sparse samples in a jittering environment, which can mark a specific target and improve the effectiveness of a video, in order to solve the problem of sparse effective samples in a surveillance video in a rare-people environment and consider the difficulty of video instability caused by the jittering environment.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for marking sparse samples in a jitter environment specifically comprises the following steps:

s1: adopting a debounce algorithm to debounce the input video file;

s2: identifying sparse samples by using an improved Mask fast RCNN model;

s3: constructing an intelligent marking system, and manually marking the identified sparse samples;

s4: updating a training set: and returning the marked data to a training data set for the next round of improved Maskfast RCNN model training.

Further, in step S1, the debouncing algorithm specifically includes the following steps:

s11: inputting a video file, and calculating a sift characteristic point and a descriptor of each frame of image;

s12: performing optimal matching on feature points between adjacent frames, specifically comprising:

s121: calculating 2 best matching feature points (namely matching feature points) corresponding to each feature point (namely original feature point) of the previous frame of image by a nearest neighbor algorithm;

s122: calculating Euclidean distances of 2 most matched feature points, and if the distance is smaller than a certain threshold value, matching the original feature points successfully, wherein the 2 feature points can be used as matched feature points; otherwise, the original feature points do not have corresponding matching feature points, and the matching fails.

S123: all successfully matched feature point pairs form the optimal matching between adjacent frames;

s13: calculating an affine transformation matrix T2 < 3 > between adjacent frames through the optimal matching feature points, thereby obtaining the motion trail estimation of the camera;

s14: performing parameter calculation on the affine transformation matrix, and calculating the following parameters: t [0] [2], T [1] [2], actan2(T [1] [0], T [0] [0]), sqrt (T [1] [0] < Lambda > 2+ T [0] < Lambda > 2 ];

s15: smoothing the parameters and ensuring that the absolute value of the smoothed parameters and the original value is less than a certain threshold value;

s16: recalculating the affine transformation matrix according to the smoothed parameters;

s17: carrying out translation, rotation, scaling, shearing, reflection and other transformations on the original video image frame by frame according to the new affine transformation matrix;

s18: and uniformly cutting the transformed video images and combining the video images into a new video, namely the de-jittering video.

Further, in step S2, the modified Mask fast RCNN model includes: and the trunk network, the FPN characteristic pyramid network and the RPN region generation network are used for characteristic extraction.

The second layer also comprises a BN layer and a Re L U layer pooling layer, wherein the BN layer is used for normalizing parameters, and meanwhile, the second layer is additionally provided with a down-sampling branch beside the main road to double the number of original input characteristic map channels.

Further, the FPN feature pyramid network specifically includes: the feature map output by the fifth layer of the backbone network is subjected to up-sampling, the number of channels is unchanged, then the feature map is added with the feature map of the fourth layer, the operation is repeated by using the obtained feature maps, three different feature maps, namely P4 of the fourth layer, P3 of the third layer, P2 of the second layer and P5 of the fifth layer are not changed, then P2 to P5 are subjected to convolution once to eliminate aliasing effect of an up-sampling process band, and then the feature map obtained by P5 is used as input to be subjected to a down-sampling process to obtain output.

Further, the RPN area generation network comprises a network of a convolution layer plus Re L U, a classification layer and a regression layer, wherein after characteristics are extracted through a main network, a characteristic diagram is divided into h x w areas according to the size of the obtained characteristic diagram, each area is determined by pixel points of the area, each pixel point generates k candidate areas possibly including targets for the original diagram, wherein k is different aspect ratios of anchor frames, then, the anchors of each candidate area are distinguished and given positive and negative labels, the anchors are overlapped with a real value frame IoU to reach a given positive label of k, k is a set threshold value, if IoU is not overlapped with k, the highest overlapped anchor label of the 3 anchors is given, other anchors IoU overlapped with less than 1-k are given negative labels, each anchor after convolution layer has a score of foreground and background, the score represents the probability of the foreground and the background, and a true value can be transformed into a coordinate value of a regression [ log, y ] and a shift amount of a real value can be (log), and w) is calculated.

Further, the RPN region generation network specifically includes:

1) after the feature map enters an RPN network, traversing the feature map obtained from each layer, performing convolution with the number of 3 × 3 channels being 512 on each map, multiplying the number of channels, and then performing classification and regression operations respectively;

in the classification operation, 1 × 1 convolution kernel is firstly carried out to carry out convolution to obtain 2 × 3 dimensional output, the output is then changed to, using the reshape function, this is the classifier score data rpn _ class _ locations, for the following calculation of the classification loss, classifier probability data rpn _ probs obtained after the classifier data is processed by softmax represents the confidence of positive and negative samples, i.e. the probability, the output structure is [ N, w h 3,2], where N is the set batch _ size, w h 3 is how many anchors are generated per feature map, 2 is the two dimensions for positive and negative samples, in the regression operation, the convolution with 1 × 1 is used to obtain 4 × 3 dimensional output, then reshape function is used to change the output to [ N, w × h × 3,4], this is the coordinate offset rpn _ bbox for the anchor, where 4 represents the 4 coordinates of the prediction box, so each graph outputs three types of data: rpn _ class _ loci, rpn _ probs, rpn _ bbox;

2) inputting the three data output in the step 1) into a Proposal L layer, and processing the result of the RPN network to generate a proper interest target frame rois;

firstly inputting probability data of positive samples in rpn _ probs obtained from the upper layer, secondly inputting rpn _ bbox coordinate offset, obtaining all anchors frames generated by a feature map, obtaining top _ k anchors before ranking according to probability scores, wherein the top _ k is a set parameter, then obtaining indexes of the top _ k anchors, obtaining probability scores of the anchors, the coordinate offset and the anchors frames according to the indexes, then carrying out coordinate correction on the selected anchors by using the offset, and combining the obtained correction frames with original score data; then, carrying out normalization processing on the coordinates of the four corners of the whole image and the coordinates of the four corners of the obtained correction frame, uniformly setting the coordinates to be between 0 and 1, and once the coordinates in the four corners of the correction frame are less than 0 or more than 1, respectively setting the coordinates to be 0 or 1, namely, forcibly limiting the detection frame in the whole image and clipping the excess part; and finally, screening the top _ k frames again by using an NMS non-maximum suppression algorithm to obtain the final rois interest area, namely the fraction and the coordinates of all rois interest areas, and directly filling zero if the generated rois number is lower than the set parameters.

3) Inputting the output of the Proposal L layer into a DetectionTarget L layer;

the input data includes preselected boxes propusals of rois, truth category gt _ class _ ids, truth box gt _ boxes, and truth mask gt _ masks; after inputting, because the input is filled with 0 for fixing the shape, the filled 0 is removed completely, the preselected frames, the truth frames and the truth masks corresponding to the nonzero truth frames which are less than or equal to the original number are obtained after deletion, then the specially crowded examples are processed, the number of the crowded examples and the number of the normal examples are respectively recorded, gt _ class _ ids is used for judging, if the number is greater than 0, the picture is a normal example, the class of each example is recorded, if the number is less than 0, the picture is a crowded example, and then only the class, the frame and the masks of the normal examples are used; performing IOU calculation on all the obtained preselected frames and the frames of the congestion examples, wherein the calculation process is to use four coordinates of two frames to intersect to obtain four coordinates of an intersection part to obtain an intersection area S1, then use the result of the area addition of the two frames to subtract the intersection area to obtain a union area S2, and finally use the intersection area S1 to divide the union area S2 to obtain the IOU, and if the largest IOU in all the obtained congestion examples is less than 0.001, the preselected frame can be used;

4) selecting positive and negative samples, generating training samples, calculating the overlapping condition of a preselection frame and a truth value frame IOU of a normal example, and respectively recording the index values of the positive and negative samples; setting the number k of preselected frames to be trained on each graph, then selecting k x 0.33 positive sample indexes, then randomly selecting two thirds of negative samples, searching the sample with the maximum IOU (input output) of the true value frame according to the positive sample indexes, recording the position and the category of the true value frame corresponding to the positive sample index, and calculating the deviation value of the positive sample and the true value frame; distributing a truth value shade for the training preselection frame according to the corresponding truth value frame;

5) the method comprises the steps of distributing a positive sample to each true value frame and each true value mask according to an overlapping condition, inputting the real type of a target, the offset of the true value frame corresponding to the positive sample and the mask of the true value frame corresponding to the target, entering the next layer for carrying out classification and regression operation, and generating the mask through a parallel branch, wherein during the classification and regression operation, a ROIAlign layer is firstly entered, a corresponding region is taken out from a corresponding feature map after the number of layers corresponding to each roi is obtained, then the rois are subjected to pooling operation, the sizes of all rois are changed into a uniform size through a bilinear interpolation calculation mode, the obtained output sequentially enters two full connection layers of a convolutional layer, a BN layer, a Re L U activation layer, a convolutional layer, a BN layer and a Re L U activation layer, the feature map size does not change in the process, max is the size obtained after pooling, the final classifier score is obtained, then a softfunction is used for obtaining the classification probability, and the coordinate of the final detection frame is obtained through offset calculation.

Furthermore, the RPN region generation network specifically comprises a mask generation network, which is used for inputting rois, firstly entering a ROIAlign layer to calculate a corresponding hierarchy, then entering 5 full-connection layers, wherein each layer is a convolution layer, a BN layer and a Re L U layer, finally recovering the size of a graph through upsampling, and because each pixel point has class information due to the strong pixel-level semantic segmentation of the full-connection layers, the finally generated mask has a class, and finally the mask position information with the class is output.

Further, in step 2), the formula for performing coordinate correction on the selected anchors by using the offset is as follows:

G'_x＝A_w·d_x+A_x

G'_y＝A_h·d_y+A_y

G'_w＝A_w·exp(d_w)

G'_h＝A_h·exp(d_h)

wherein A is_x、A_y、A_w、A_hTo start the coordinates of the center point and the width and height of the anchors preset, d_x、d_y、d_w、d_hIs the offset, G ', calculated previously'_x、G'_y、G'_w、G'_hThe center point coordinates and the width and height of the updated anchors.

Further, the formula for calculating the deviation value between the positive sample and the true box is:

d_y＝(gt_center_y-center_y)/height

d_x＝(gt_center_x-center_x)/width

d_h＝ln(gt_height/height)

d_w＝ln(gt_width/width)

wherein center _ x, center _ y, gt _ center _ x and gt _ center _ y are true values and coordinates of the center point of the positive sample, height and width are true values and the height and width of the positive sample, and d_x、d_y、d_h、d_wIs an offset.

The invention has the beneficial effects that: according to the method, the influence of a shaking environment on target identification is reduced by adopting a de-shaking algorithm, a Mask fast RCNN model is improved to identify the sparse samples, and the identification result is stored in a server. And adding manual intervention to mark the recognition result, feeding the marking result back to a training database of the Mask fast RCNN model, and further training the Mask fast RCNN model to improve the recognition accuracy.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a labeling method according to the present invention;

fig. 2 shows the picture recognition result of the intelligent marking system.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Referring to fig. 1-2, fig. 1 is a method for labeling sparse samples in a jitter environment, including: (1) a debounce algorithm; (2) improving a Mask fast RCNN model to identify sparse samples; (3) manually marking; (4) and updating the training set.

The detailed process of each step is as follows:

(1) de-dithering algorithm

1) Inputting a video file, and calculating the sift characteristic points and descriptors of each frame of image.

2) The optimal matching of the feature points between adjacent frames specifically comprises the following steps:

a. calculating 2 best matching feature points (namely matching feature points) corresponding to each feature point (namely original feature point) of the previous frame of image by a nearest neighbor algorithm;

b. calculating Euclidean distances of 2 most matched feature points, and if the distance is smaller than a certain threshold value, matching the original feature points successfully, wherein the 2 feature points can be used as matched feature points; otherwise, the original feature points do not have corresponding matching feature points, and the matching fails;

c. all successfully matched feature point pairs form the optimal matching between adjacent frames;

3) and calculating an affine transformation matrix T2 < 3 > between adjacent frames through the optimal matching feature points, thereby obtaining the motion trail estimation of the camera.

4) Performing parameter calculation on the affine transformation matrix, and calculating the following parameters: t [0] [2], T [1] [2], actan2(T [1] [0], T [0] [0]), sqrt (T [1] [0] < Lambda > 2+ T [0] [0] < Lambda > 2 ].

5) And smoothing the parameters, and ensuring that the absolute value of the smoothed parameters and the original value is less than a certain threshold value.

6) And recalculating the affine transformation matrix according to the smoothed parameters.

7) And (4) carrying out translation, rotation, scaling, shearing, reflection and other transformations on the original video image frame by frame according to the new affine transformation matrix.

8) And uniformly cutting the transformed video images and combining the video images into a new video, namely the de-jittering video.

(2) Improved Mask fast RCNN model recognition sparse sample

The picture is input by a gray scale image, then the input of the image is (H, W,1), W is width, H is height, 1 is channel number, then the image preprocessing is carried out, the height and the width of the image are required to be consistent (1024 x 1024 is taken as input here), the length is determined by the longest edge, the image is taken as a square input, the part which is not enough for the longest edge is directly filled with zero, and then the image characteristic is extracted by a main network ResNet-FPN.

The network comprises two parts, wherein one part of extracted features is pushed from a low dimension to a high dimension, the other part of extracted features is sampled from the high dimension to the low dimension, the trunk networks extracted by the features in Mask RCNN are generally Resnet101 and Resnet50, the two parts are not very different and can be divided into 5 large layers, firstly, the first layer of convolutional layer is a convolution kernel of 7 × 7, the step size is 2, the frame filling is 3, the number is 64, and the processed size is 7 × 7

This would halve the feature map size, resulting in a 512 x 64 feature map, which then enters the second tier, a pooling process of 3 x 3 convolution kernel, step size 2, and frame fill 1, with feature map size calculation process as above

Obtaining 256 × 64 feature maps, reducing the feature map size by half again, entering a second layer, then convolving 64 convolution kernels of 1 × 1 without changing the size, then subjecting the feature map values to normalization processing of the BN layer, and when processing the layer of network input data, regarding the data as a four-dimensional matrix (m, f, h, w), where m is how many batches of data are processed each time, f is the number of feature maps or the number of channels, and h and w are respectively high and wide, then there are m × h × w parameters processed at one time, and the normalization process:

inputting: b ═ x_1...m}，x_1,...,mAre the input parameter values, i.e. the values on the characteristic map.

Calculating the mean and variance:

and (3) outputting:

is a very small positive number to prevent the denominator from being 0.

This would allow the network to converge faster and prevent the gradient from disappearing, then it would be followed by a Re L U activation function for forward conduction, Re L U with the equation f (x) max (0, x), increasing the nonlinearity of the network, increasing the convergence speed, the following layers of 64 convolution kernel convolutions of 3 x 3 respectively, the edge filling is 1, without changing the signature size, get the signature with the result of 256 x 64, the BN layer normalization parameters, Re L U activation function, 256 convolution kernel convolutions of 1 x 1, get the signature of 256 x 256, the BN layer normalization, while the second layer adds down-sampling branches next to the main trunk, the number of channels of the second layer input signature is 64, and the number of channels after processing is 256, in order to be able to add, this branch needs to double the number of original input signature to 256, just then it is set up to one Re 1, so the number of channels after processing is 256 x 256, so it is called 256 x 3, then it is done with the second layer of 256 x 2 convolution kernel convolutions, then it is called 256 x 256, the whole block 256, it is called 128 convolution kernel processing block 256, so it is done with the second layer 256 x 2, it is a block of 256, it is then it is called 256 n 3, it is done again, it is a block of 256 x 256, it is then it is a block of 256 n2, it is done again, it is called 256 n2, it is a block, it is then it is a block, it is a block of 256 n2, it is done again, it is a block, it is done with the last block of 256 n2, it is a block, it is done again, it

At this time, the size of the feature map is halved, the feature map is 128 x 128, the BN layer, the Re L U layer, the 512 convolution kernels of 1 x 1, the feature map is 128 x 512, the BN layer is the same as the previous layer, and the feature map is additionally composed of 512 convolution kernels with 1 x 1 branch from the input, and finally the Re L U layer is also followed by 3 same blocks, but the step size of the convolution kernel of 3 x 3 is 1, the size of the feature map is not changed, and no branch existsAnd (4) a way. The third large layer has 4 blocks, and so on, the fourth large layer has the characteristic diagram of 128 × 512 as input, and the characteristic diagram of 64 × 1024 as output, and the difference between the ResNet50 and the ResNet101 is that the fourth large layer has 6 blocks in the former, 23 blocks in the latter, 64 × 1024 as input, and 32 × 2048 as output.

The above is a backbone network of feature extraction, then enters an FPN feature pyramid network, the output of each layer from the second layer to the fifth layer is convolved by 256 convolution kernels of 1 × 1 to obtain feature maps with 256 channels of four different sizes (256,128,64,32), then taking the feature map of 32 × 256 of the fifth layer as an example, the size is changed into 64 through one up-sampling, the feature map is 64 × 256, the number of channels is unchanged, then the feature maps are added with the feature map of 64 × 256 of the fourth layer, the operation is repeated by using the obtained feature maps to obtain three different feature maps, namely P4 of the fourth layer, P3 of the third layer, P2 of the second layer, P5 of the fifth layer is not changed, and then P2 to P5 are subjected to a convolution process with 3 × 3 to eliminate aliasing effect of sampling, the resulting 32 x 256 signature from P5 was then used as input to a down-sampling process to yield P6 with an output size of 16 x 256.

These results are entered as input into the RPN area generation network, which is a network of convolution layers plus Re L U and classification and regression layers, after feature extraction through the backbone network, the feature map is divided into h x w areas according to the resulting feature map size, each area is determined by the pixels of this area, each pixel generates k candidate areas that may include the target to the original, where k is the different aspect ratio of the anchor boxes, where there are three types [0.5,1,2] in the RPN network, so the value of k is 3, then the anchors of each candidate area are discriminated and given positive and negative labels, anchors are given to overlap with the true value box IoU to give k the positive label given, k is the set threshold, typically 0.7, and if IoU overlaps without k, these 3 are given the highest positive label of the overlapping, other IoU overlapping with less than 1-k are given negative labels given to the positive label of the convolved layer, each anchors has a negative label given to the background label, and there is a difference in the background score, and there is a final score of log x, which is given as the background score, which is 3, there is a final score, which is given as a background score, and a background score, which is given by a background score, which is 3, and a background score, which is also a background score, which is given by a background score, and a background score, which is a background score.

The specific operation is that after entering the RPN network, traversal is performed on the feature maps obtained in each layer, then convolution with the number of 3 × 3 channels being 512 is performed on each map, the number of channels is multiplied, and then classification and regression operations are performed respectively. In the classification operation, 1 × 1 convolution kernel is firstly performed to obtain 2 × 3 dimensional output, then reshape function is used to change the output into that which is the classifier score data rpn _ class _ logs used for calculating the classification loss later, classifier probability data rpn _ probs obtained after the classifier data is processed by softmax represents the confidence of positive and negative samples, namely probability, the output structure is [ N, w h 3,2], wherein N is set batch _ size, w h 3 is how many anchors are generated for each feature map, 2 is two dimensions corresponding to the positive and negative samples, in the regression operation, 1 × 1 convolution is firstly performed to obtain 4 × 3 dimensional output, then reshape function is used to change the output into that [ N, w h 3,4] which is the coordinate of the 1 × 1, wherein 4 coordinates of the offset of rpn _ ox represents the predicted output of each frame, so that three kinds of data are obtained, rpn _ class _ loci, rpn _ probs, rpn _ bbox.

They then enter the Proposal L player layer as input, this layer is to process the result of RPN network to generate the appropriate target of interest box rois, first is the data of the probability of positive sample in RPN _ probs input from the previous layer, then inputs the coordinate offset of RPN _ bbox, and obtains all anchors box generated by the feature map, obtains top _ k anchors before ranking according to the probability score, top _ k is the parameter set, then obtains the index of these top _ k anchors, obtains the probability score of anchors according to the index, coordinate offset and anchors box itself, then uses the offset to coordinate correct the selected anchors, the formula is as follows:

G'_x＝A_w·d_x+A_x

G'_y＝A_h·d_y+A_y

G'_w＝A_w·exp(d_w)

G'_h＝A_h·exp(d_h)

wherein A is_x、A_y、A_w、A_hTo start the coordinates of the center point and the width and height of the anchors preset, d_x、d_y、d_w、d_hIs the offset, G ', calculated previously'_x、G'_y、G'_w、G'_hThe center point coordinates and the width and height of the updated anchors. And combining the resulting correction frame with the original score data. Then, the coordinates of the four corners of the whole image and the coordinates of the four corners of the obtained correction frame are normalized and uniformly specified to be between 0 and 1, and once the coordinates of the four corners of the correction frame are less than 0 or more than 1, the coordinates are respectively set to be 0 or 1, namely, the detection frame is forcibly limited in the whole image, and the excess part is cut. And finally, screening the top _ k boxes again by using an NMS non-maximum suppression algorithm to obtain the final rois interest area, namely the fraction and the coordinates of all rois interest areas, and directly filling zero if the generated rois number is lower than the set parameters.

Entering a next DetectionTarget L eye layer, where preselected boxes propusals, truth categories gt _ class _ ids, truth boxes gt _ boxes and truth masks gt _ masks of rois need to be input, after input, filling 0 is input for fixing the shape, all filling 0 is removed, the number of preselected boxes, truth boxes and truth masks corresponding to the non-zero real value boxes are obtained after deletion, then specially crowded examples are processed, the number of crowded examples and the number of normal examples are respectively recorded, judgment is carried out by using gt _ class _ ids, if the number of crowded examples is larger than 0, the picture is a normal example, the category of each example is recorded, if the number of crowded examples is smaller than 0, then the categories, boxes and the crowded examples of the normal examples are used, all the obtained preselected boxes and the crowded examples are calculated by using IOU calculation, the calculation process is carried out by using four coordinates of two boxes to obtain intersection parts, the four coordinates of the intersection parts are obtained, the area of the intersection blocks is subtracted by using the area of the preselected boxes, the area of the blocks is calculated by using the area of the intersection blocks 36, and the area of the intersection is calculated by using the area of the preselected boxes, and the area of the intersection is subtracted 360, and the area of the area is calculated by using the area of the preselected boxes is calculated by.

And entering the next selection of positive and negative samples to generate a training sample, calculating the overlapping condition of the preselected frame and the truth frame IOU of the normal example, wherein a threshold value needs to be set, if the threshold value is 0.7, the frame which is greater than 0.7 is positive, and the frame which is less than 0.7 is negative, and respectively recording the index values of the positive and negative samples. Setting the number k of preselected frames to be trained on each graph, selecting k x 0.33 positive sample indexes, namely randomly selecting one third of positive samples, randomly selecting two thirds of negative samples, searching the sample with the maximum IOU (input output) of the true value frame according to the positive sample indexes, recording the position and the category of the true value frame corresponding to the sample, and calculating the deviation value of the positive sample and the true value frame

d_y＝(gt_center_y-center_y)/height

d_x＝(gt_center_x-center_x)/width

d_h＝ln(gt_height/height)

d_w＝ln(gt_width/width)

Wherein center _ x, center _ y, gt _ center _ x and gt _ center _ y are true values and coordinates of the center point of the positive sample, height and width are true values and the height and width of the positive sample, and d_x、d_y、d_h、d_wIs an offset. A truth mask is then assigned to the pre-selected boxes of the training according to the corresponding truth box, wherein the truth mask is labeled and is the sequence ID of the object type.

Sampling positive and negative samples, keeping the ratio of the positive and negative samples to be 2:1, distributing the positive samples to each true value frame and each true value mask according to the overlapping condition, and correspondingly preparing for calculating loss, wherein the output of the layer is rois containing the positive and negative samples, and the coordinates of the rois are normalized. Then inputting the real type of the target, the offset of the positive sample corresponding to the true value frame and the mask of the target corresponding to the true value frame, entering the next layer for classification and regression operation, and generating the mask through a parallel branch, wherein the parallel branch is also the change part of the mask rcnn and the master-rcnn.

During classification and regression operation, the method firstly enters a ROIAlign layer, and the characteristic of each roi from which layer is calculated, wherein the calculation formula is as follows:

wherein k is₀Where is a set constant, generally 4, c is the width or height corresponding to the fourth layer, so it can be set to 64, w, h are the width and height of the input roi, so the calculation formula here is:

and after the corresponding layer number is obtained, the corresponding area is taken out from the corresponding characteristic diagram, then the rois is subjected to pooling operation, and the sizes of all the rois are changed into a uniform size by utilizing a bilinear interpolation calculation mode.

And the obtained output sequentially enters two fully-connected layers of the convolutional layer, the BN layer, the Re L U active layer, the convolutional layer, the BN layer and the Re L U active layer, the size of the feature diagram is not changed in the process and is the size obtained after pooling, the final classifier score is obtained, then the probability of classification is obtained by using a softmax function, and meanwhile, the coordinate of the final detection frame is obtained through offset calculation.

The input of the network generated by the mask on the other side is rois, the network also enters a ROIAlign layer firstly, the calculation of the corresponding hierarchy is carried out, then the network enters a 5-layer full-connection layer, each layer is a convolution layer, a BN layer and a Re L U layer respectively, finally the graph size is recovered through upsampling, and because of the strong pixel-level semantic segmentation of the full-connection layer, each pixel point has class information, the finally generated mask also has a class, and the finally output is predicted mask position information with the class.

Resnet101/Resnet 50: the residual network, the latter number is the number of layers designed for the residual network, and is designed to solve the problem of gradient disappearance in the network training process and the problem that the network error increases as the network is continuously deepened, and generally 50 layers, 101 layers and 144 layers are designed.

Downsampling, namely down sampled can be universally understood as reducing an image, and the effect is to enable the size of the image to be in accordance with the expectation and generate a thumbnail of the image

Upsampling can be regarded as enlarging the image, namely performing interpolation expansion on image data so as to enable the image to have larger size.

And combining the ResNet-FPN, namely the residual error network and the feature pyramid network, extracting features under different sizes through the residual error network, simultaneously processing low-layer features and high-layer features through the FPN, and obtaining a better prediction effect after fusion.

And generating a Network by the RPN region, wherein the RegionProposal Network predicts the generation of a required frame by exhausting under frames with different sizes of different anchor points.

The true value type gt _ class _ ids is the type of the object marked during marking;

a true value box gt _ boxes is a marking box during marking;

a truth value shade gt _ masks is a shade generated according to the label;

ROIAlign, namely, the quantization operation in the original pooling process is changed into continuous operation by using a bilinear interpolation method, so that the deviation in the quantization process is eliminated, and the pixel-level mask generation is more accurate;

IoU, intersection-over-intersection ratio, namely dividing the intersection area of two images by the union area of the two images;

NonMaximumSuppression generates a plurality of frames because of anchor points, the NMS algorithm plays a role here in directly and completely discarding the frames with the coincidence degree higher than a set threshold value, and selecting the frame with the minimum local coincidence degree to carry out the next processing;

ross is a frame on a Region of interests characteristic diagram, namely a detection frame predicted by the network;

the full-connected layer is used for carrying out pixel-level classification on the image, is very accurate, and recovers the classification of each pixel from the extracted features, so that the semantic segmentation problem is solved, and the full-connected layer is an important part for mask generation;

bilinear interpolation, a computation method for canceling quantization, namely, if a 32 x 32 graph is to be pooled into 7 x 7, the original graph can be regarded as a 32 x 32 dot matrix with numerical values, then the graph is subjected to average division of 7 x 7 to obtain a 7 x 7 dot matrix, each point in the 7 x 7 can not be coincided with all points in the 32 x 32, and each point in the 7 x 7 can fall in a certain 1 x 1 box in the 32 x 32 dot matrix, so that a more accurate value can be obtained by computing according to the surrounding four points.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A method for marking sparse samples in a jitter environment is characterized by specifically comprising the following steps:

s1: adopting a debounce algorithm to debounce the input video file;

s2: identifying sparse samples by using an improved Mask fast RCNN model;

s4: updating a training set: and returning the marked data to a training data set for the next round of improved Mask fastRCNN model training.

2. The method according to claim 1, wherein in step S1, the de-dithering algorithm specifically includes the following steps:

s121: calculating 2 most matched feature points corresponding to each feature point of the previous frame of image through a nearest neighbor algorithm;

s122: calculating Euclidean distances of 2 most matched feature points, and if the distance is smaller than a certain threshold value, matching the original feature points successfully, wherein the 2 feature points can be used as matched feature points; otherwise, the original feature points do not have corresponding matching feature points, and the matching fails;

s13: calculating an affine transformation matrix between adjacent frames through the optimal matching feature points, and thus obtaining motion trail estimation of the camera;

s14: performing parameter calculation on the affine transformation matrix;

s17: according to the new affine transformation matrix, the original video image is translated, rotated, scaled, cut and reflected frame by frame;

3. The method as claimed in claim 1, wherein in step S2, the modified Mask fast RCNN model comprises: and the trunk network, the FPN characteristic pyramid network and the RPN region generation network are used for characteristic extraction.

4. The method for labeling the sparse samples in the jittering environment as claimed in claim 3, wherein the main network for feature extraction comprises five large layers, wherein the first layer and the second layer reduce the size of the feature map by setting a certain convolution kernel, step length, frame filling and number, the second layer further comprises a BN layer and a Re L U pooling layer, the BN layer is used for normalizing parameters, and the second layer is additionally provided with a down-sampling branch beside the main network to double the number of original input feature map channels.

5. The method as claimed in claim 3, wherein the FPN pyramid network specifically comprises: the feature map output by the fifth layer of the backbone network is subjected to up-sampling, the number of channels is unchanged, then the feature map is added with the feature map of the fourth layer, the operation is repeated by using the obtained feature maps, three different feature maps, namely P4 of the fourth layer, P3 of the third layer, P2 of the second layer and P5 of the fifth layer are not changed, then P2 to P5 are subjected to convolution once to eliminate aliasing effect of an up-sampling process band, and then the feature map obtained by P5 is used as input to be subjected to a down-sampling process to obtain output.

6. The method as claimed in claim 3, wherein the RPN region generation network comprises a network of convolution layer plus Re L U, classification layer and regression layer, the feature map is divided into h x w regions according to the size of the feature map after extracting features through the backbone network, each region is determined by the pixel points of the region, each pixel point generates k candidate regions possibly including the target for the original map, wherein k is different aspect ratio of the anchor frame, then the anchor points of each candidate region are distinguished and given positive and negative labels, anchors are given to overlap with the true value frame IoU to give positive labels for k, k is a set threshold, if IoU overlaps without k, the highest overlap among the 3 anchors is given a positive label, other IoU overlaps with anchors less than 1-k to give negative labels, each anchor has a foreground and background probability after convolution of the convolution layer, the probability of each foreground and background probability is used as a background score, log and the background probability is transformed into a value of (y) and the regression score is given by (log) to the real value of the regression layer.

7. The method according to claim 6, wherein the RPN region generation network specifically comprises:

firstly inputting probability data of positive samples in rpn _ probs obtained from the upper layer, secondly inputting rpn _ bbox coordinate offset, obtaining all anchors frames generated by a feature map, obtaining top _ k anchors before ranking according to probability scores, wherein the top _ k is a set parameter, then obtaining indexes of the top _ k anchors, obtaining probability scores of the anchors, the coordinate offset and the anchors frames according to the indexes, then carrying out coordinate correction on the selected anchors by using the offset, and combining the obtained correction frames with original score data; then, carrying out normalization processing on the coordinates of the four corners of the whole image and the coordinates of the four corners of the obtained correction frame, uniformly setting the coordinates to be between 0 and 1, and once the coordinates in the four corners of the correction frame are less than 0 or more than 1, respectively setting the coordinates to be 0 or 1, namely, forcibly limiting the detection frame in the whole image and clipping the excess part; finally, screening the top _ k frames again by using an NMS non-maximum suppression algorithm to obtain the final rois interest area, namely the fraction and the coordinate of all the rois interest areas, and directly filling zero if the generated rois number is lower than the set parameters;

3) inputting the output of the Proposal L layer into a DetectionTarget L layer;

the input data includes preselected boxes propusals of rois, truth category gt _ class _ ids, truth box gt _ boxes, and truth mask gt _ masks; after inputting, removing all filled 0, deleting to obtain a preselected frame, a truth frame and a truth mask corresponding to a nonzero true frame which are less than or equal to the original number, then processing a particularly crowded example, respectively recording the number of the crowded example and the number of normal examples, judging by using gt _ class _ ids, if the number is greater than 0, the picture is the normal example, the category of each example is recorded, if the number is less than 0, the picture is the crowded example, and then only using the category, the frame and the mask of the normal example; performing IOU calculation on all the obtained preselected frames and the frames of the congestion examples, wherein the calculation process is to use four coordinates of two frames to intersect to obtain four coordinates of an intersection part to obtain an intersection area S1, then use the result of the area addition of the two frames to subtract the intersection area to obtain a union area S2, and finally use the intersection area S1 to divide the union area S2 to obtain the IOU, and if the largest IOU in all the obtained congestion examples is less than 0.001, the preselected frame can be used;

5) the method comprises the steps of distributing a positive sample to each true value frame and each true value mask according to an overlapping condition, inputting the real type of a target, the offset of the true value frame corresponding to the positive sample and the mask of the true value frame corresponding to the target, entering the next layer for carrying out classification and regression operation, and generating the mask through a parallel branch, entering a ROIAlign layer during the classification and regression operation, taking out a corresponding region from a corresponding feature map after obtaining the number of layers corresponding to each roi, carrying out pooling operation on the rois, changing the sizes of all rois into a uniform size by using a bilinear interpolation calculation mode, sequentially entering the obtained output into two full-connected layers of a convolutional layer, a BN layer, a Re L U activation layer, a convolutional layer, the BN layer and a Re L U activation layer, obtaining the classification probability by using a softmax function, and obtaining the coordinate of the final detection frame through offset calculation.

8. The method for labeling sparse samples in a dithering environment as recited in claim 7, wherein the RPN region generation network further comprises a mask generation network, inputting rois, firstly entering a ROIAlign layer, calculating corresponding levels, then entering 5 full-connection layers, each layer being a convolution layer, a BN layer and a Re L U layer, finally recovering the pattern size through upsampling to finally generate a mask with a category, and finally outputting the mask with the category as predicted mask position information.

9. The method for labeling sparse samples in jittering environment as claimed in claim 7, wherein in step 2), the formula for performing coordinate correction on the selected anchors by using the offset is as follows:

G'_x＝A_w·d_x+A_x

G'_y＝A_h·d_y+A_y

G'_w＝A_w·exp(d_w)

G'_h＝A_h·exp(d_h)

wherein A is_x、A_y、A_w、A_hTo start the coordinates of the center point and the width and height of the anchors preset, d_x、d_y、d_w、d_hIs offset amount, G'_x、G'_y、G'_w、G'_hThe center point coordinates and the width and height of the updated anchors.

10. The method as claimed in claim 7, wherein the formula for calculating the deviation value between the positive sample and the truth frame in step 4) is:

d_y＝(gt_center_y-center_y)/height

d_x＝(gt_center_x-center_x)/width

d_h＝ln(gt_height/height)

d_w＝ln(gt_width/width)