CN113610885B

CN113610885B - Semi-supervised target video segmentation method and system using difference contrast learning network

Info

Publication number: CN113610885B
Application number: CN202110785106.7A
Authority: CN
Inventors: 杨大伟; 董美辰; 毛琳; 张汝波
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2023-08-22
Anticipated expiration: 2041-07-12
Also published as: CN113610885A

Abstract

The application discloses a semi-supervised target video segmentation method and a semi-supervised target video segmentation system using a difference comparison learning network, and relates to the technical field of video segmentation. According to the initial frame mask, global and local feature information of the target is extracted, the similarity between the global feature and the local feature of the target is improved by adopting a contrast learning idea, and the degree of distinction between the target and the background feature is enlarged, so that more robust target feature expression is obtained. And performing pixel comparison by using the obtained global features, and simultaneously combining a reference frame segmentation result to ensure the accuracy of target and background region division in the video segmentation result.

Description

Semi-supervised target video segmentation method and system using difference contrast learning network

Technical Field

The application relates to the technical field of video segmentation, in particular to a semi-supervised target video segmentation method and system using a difference contrast learning network.

Background

The semi-supervised target video segmentation task finely segments target objects in the whole video sequence from the background based on a given initial frame mask, so that accurate target positioning is realized, and the semi-supervised target video segmentation task has wide application value and landing requirements in the fields of video understanding, man-machine interaction, automatic driving and the like. However, due to continuous change of the target background in the video and the influence factors such as illumination change, similar background interference and the like, single-target video segmentation still faces many challenges.

The existing semi-supervised video segmentation methods can be divided into three types of methods based on motion propagation, detection and template matching. The method based on motion propagation mainly utilizes the correlation of the motion and time of the target, and the accurate segmentation can be realized by relatively smoothing the position and shape change of the target depending on the space-time relationship among pixels, but when the influence of time discontinuous factors such as shielding or rapid motion is met, the drift problem can be caused. The method based on detection does not depend on time information, utilizes target information in an initial frame segmentation result, learns an appearance model, detects and segments a target in a video frame, and respectively enhances data of an initial frame image and a segmentation result diagram thereof during a test sequence so as to facilitate adjustment of a training model, thereby obtaining more accurate instance characteristic information, but great calculation amount is brought by online training. According to the method based on template matching, pixel level matching is carried out on the current video frame and the initial frame characteristics, the pixels are segmented according to the comparison result, the segmentation result cannot be influenced due to accumulated propagation errors, but space-time information is not fully utilized, and the requirement on initial frame characteristic extraction is high.

Disclosure of Invention

Aiming at the problems in the prior art, the application provides a semi-supervised target video segmentation method and a semi-supervised target video segmentation system using a difference comparison learning network, which are used for obtaining robust and distinguishable target features from a feature space, and improving the performance of a video segmentation algorithm by adopting a comparison learning idea and combining with space-time information.

In order to achieve the above purpose, the technical scheme of the application is as follows: a semi-supervised target video segmentation method using a difference contrast learning network includes:

step 1: initial video frame of size h×wInputting into backbone network to obtain general visual feature with feature channel number of c, and performing edge enhancement convolution treatment to obtain visual feature with clearer detail texture>As a basis for a subsequent comparison network; -providing said visual features->And segmentation result->Respectively multiplying and sizingAdjusting to obtain target feature->And background features->

Step 2: extracting the target featuresGlobal mapping feature->

Step 3: characterizing the global mapIs->Performing pixel-level similarity comparison to obtain c channels and m×n similarity response graphs;

step 4: characterizing the global mapIs->Performing pixel-level similarity comparison to obtain c channels and m×n difference degree response graphs;

step 5: characterizing the global mapIs->Performing pixel comparison, combining reference frame segmentation results, and passing through convolutionAccording to global mapping features->Distinguishing the object from the background features and the similarity between the object and the background at the pixel level to obtain an object region +.>And background area->

Step 6: better global feature mapping mode and global mapping features of initial frames are obtained through 3-5 steps of learningSharing the parameters of the convolution layer, repeating the step 1, and inputting the subsequent video frame with the size of h multiplied by w>Visual characteristic +.>

Step 7: global mapping features for initial framesVisual characteristics with subsequent frames->Based on which reference frame segmentation results are combined +.>Repeating the fifth step, and outputting the segmentation result of the subsequent frames;

step 8: and repeating the steps 6-7 until the target segmentation task of the whole video is completed.

Further, the visual characteristics areAnd segmentation result->Multiplying and adjusting the size to obtain the target characteristic +.>And background features->The formula is:

further, extracting the target featuresGlobal mapping feature->The method comprises two parts of global average pooling and a full connection layer, which are respectively:

(1) First for the target featureBy J ^3×3,c Global average pooling is carried out on convolution kernels of (2) to output c-dimensional feature vectors +.>The formula is:

wherein H is _average (x,J ^k×k,c S, p) is an average pooling function,for convolution operation, the convolution kernels with a step size s of 1 and a convolution kernel size k=3 are used to pool the pixel features of the c feature channels in turn until a c-dimensional feature vector is outputThe parameter is reduced and the calculated amount is reduced while the feature integrity of the image content is ensured.

(2) C-dimensional feature vector subjected to global average poolingInputting the full connection layer to obtain global mapping characteristicsThe formula is:

where μ is a map coefficient, and η is a correction amount. The purity of the global feature is improved through the full-connection layer processing, and the influence of the position on the feature expression is reduced.

Further, c channels are obtained, and a similarity response diagram with the size of m×n is obtained, wherein the formula is as follows:

wherein i=1, 2,..m, j=1, 2,..n; l=1, 2,..c; h _standard Mapping the similarity score of each pixel point into a 0-1 interval for a normalization function; each pixel point takes the highest r scores to obtain a three-channel scoring result graph with the size of m multiplied by n, and the three-channel scoring result graph is subjected to average pooling operation to obtain a final response graph with similar contrast, wherein the formula is as follows:

further, c channels are obtained, and a difference degree response diagram with the size of m multiplied by n is obtained, wherein the formula is as follows:

wherein i=1, 2,..m, j=1, 2,..n; l=1, 2,..c;

taking the highest r scores of each pixel point to obtain a three-channel scoring result graph with the size of m multiplied by n, and carrying out average pooling operation on the scoring result graph to obtain a final response graph with difference contrast, wherein the formula is as follows:

further, the target areaAnd background area->The calculation formula of (2) is as follows:

and sigma is a threshold value, and is obtained through training and used for judging the target and background areas in the video frame. Setting the convolution kernel size as 1×1, and the step length s=1, performing convolution operation on the primary segmentation results of the target and the background, performing fine processing, and outputting a segmentation mapThe formula is:

the application also provides a semi-supervised target video segmentation system using the difference contrast learning network, comprising:

difference contrast learning network for obtaining video initial frameThe general visual characteristics are obtained through backbone network processing, and then the visual characteristics with clearer detail textures are obtained through edge enhancement convolution processing>Said visual features->Multiplying the original frame segmentation map to obtain target feature +.>Background feature->The target feature->The feature vector is obtained through global average pooling treatment>Then the global mapping characteristic +.>

And the similarity comparison branch unit improves the description capability of the global mapping feature on the target by improving the similarity of the global mapping feature and the target local feature. Acquiring target featuresFeature vector of each pixel point is +.>Global mapping feature->Performing similarity comparison through convolution with the convolution kernel size of 1×1 to obtain a similarity scoring graph with c channels and the size of m×n, namely, each pixel point comprises c channels, each channel has a corresponding similarity score, k scores before each channel are reserved, and average pooling processing is performed to obtain a final similarity response graph->The local feature receptive field is limited, and the similar comparison branches improve the capability of the global mapping features to capture information from different local areas through the comparison and learning of the local features and the global mapping features.

Differential contrast branching unit by enlarging background featuresGlobal mapping feature->The degree of distinction between the two can improve the capability of the model to divide the object and the background. Acquisition of background features->Feature vector of each pixel point is +.>Global mapping feature->Performing similarity comparison through convolution with the convolution kernel size of 1×1 to obtain a similarity scoring graph with c channels and the size of m×n, namely, each pixel point comprises c channels, each channel has a corresponding similarity score, the k scores before each channel are reserved as similar branches, and average pooling processing is performed to obtain a final difference response graph>

Referencing learning branch unit, global mapping featuresAnd visual characteristics->And performing similarity comparison by taking pixels as units through convolution with the convolution kernel size of 1 multiplied by 1 to obtain c channels and m multiplied by n similarity scoring graphs, combining a reference frame segmentation result, obtaining a response graph with higher accuracy through convolution with the convolution kernel size of 3 multiplied by 3, and finally outputting a segmentation result of a target and a background.

By adopting the technical scheme, the application can obtain the following technical effects:

(1) Completing video target segmentation and target tracking multi-domain tasks

The difference contrast learning network can complete the target tracking task while completing the video target segmentation, and simultaneously improves the accuracy of target tracking. The gap between the segmentation task and the tracking task is reduced, and the application range of the network is enlarged.

(2) Target segmentation task applicable to automatic driving

The application combines the reference frame segmentation result, effectively improves the segmentation precision of the target in the case of rapid movement or deformation, is suitable for the automatic driving field, and can achieve the effect of accurate obstacle avoidance by obtaining the accurate segmentation result.

(3) Real-time tracking task suitable for automatic driving

The method and the device can be applied to a tracking module in automatic driving, and frame the target segmentation result to obtain a real-time tracking frame of targets such as pedestrians and the like, so that the subsequent path planning of automatic driving is completed.

(4) Security monitoring system

According to the application, the differentiation between the target and the background is improved through the difference contrast learning network, accurate segmentation is realized through pixel level contrast, the target can be accurately positioned and segmented from the background under a complex scene, and the method can be applied to a security monitoring system.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

FIG. 1 is a schematic overall frame of the present method;

FIG. 2 is a schematic diagram of the real-time tracking task of the autopilot target in example 1;

FIG. 3 is a schematic illustration of an autopilot obstacle avoidance task of example 2;

fig. 4 is a schematic diagram of a security monitoring designated target task in example 3.

Detailed Description

The embodiment of the application is implemented on the premise of the technical scheme of the application, and a detailed implementation mode and a specific operation process are provided, but the protection scope of the application is not limited to the following embodiment.

The embodiment provides a semi-supervised target video segmentation method and a semi-supervised target video segmentation system using a difference contrast learning network, which extract global and local feature information of a target according to an initial frame mask, adopt a contrast learning idea, improve the similarity between global features and local features of the target, and enlarge the degree of distinction between the target and background features so as to obtain more robust target feature expression. And performing pixel comparison by using the obtained global features, and simultaneously combining a reference frame segmentation result to ensure the accuracy of target and background region division in the video segmentation result.

In the application, the initial frame is the first frame of the task video, and the segmentation result of the target and the background can be given. The segmentation result is the result of accurately distinguishing the target and the background area according to the target contour. The reference frame segmentation result refers to the last frame segmentation result of the current test frame. The test frame is a subsequent video frame which needs to be segmented except the initial frame of the task video, namely, the video frame which needs to be segmented currently. The general visual features are basic visual features including color, shape, spatial relationship, etc. extracted through the backbone network. The clear visual features are the results of the detail texture and the edge feature expression in the enhanced image through the edge enhanced convolution network processing. The target feature is an area containing the target in the overall feature map. The background feature is the removal of the region containing the object from the overall feature map. Global map features refer to global feature expressions that may represent targets. Feature vectors are a mathematical form of feature expression. The characteristic channel is a place where the convolution layer performs information interaction and is also a mapping area expression of the characteristics. The similarity response graph reflects the similarity relationship between input contrast features. The difference response graph is a difference relation between the input contrast features. The target region is a region in which the image is compared with the target global mapping characteristics and is higher than a set threshold value, and is determined as a target. The background area is an area in which the image is compared with the global mapping characteristics of the target and is lower than a set threshold value, and is determined as the background.

The input video frame size can be 1280×720 RGB three-channel image, the output general visual characteristic size can be 640×360 after backbone network processing, the number c of output channels of each layer of backbone network is {32, 64, 128, 256, 512}, and the general characteristic diagrams with different sizes (1,640,360,32), (1,640,360,64), (1,640,360,128), (1,640,360,256), (1,640,360,512) can be output according to the requirement. The feature map with the size (1,640,360,256) can be output after the CNN convolution layer processing. In the similar comparison branch and the difference comparison branch, the highest r scores of each pixel point are obtained through convolution similarity comparison, and then average pooling is carried out, wherein r is {1,2,3,4,5,6,7,8,9,10}.

In the process of extracting global mapping characteristics of targets in an initial frame and a subsequent frame, convolution parameters are shared. In the similar comparison branch and the difference comparison branch, each pixel point comprises c channels, each channel has a corresponding similarity score, the highest r scores of each pixel point are taken, and the values of c and r in the two branches are the same.

Example 1:

real-time tracking task for automatic driving target

This example is directed to an automated driving target tracking task. The application is applied to the vehicle-mounted camera to position and track the surrounding environment of the vehicle in real time, so as to prepare for planning the system path and ensure the driving safety. The autopilot real-time positioning task situation is shown in fig. 2.

Example 2:

automatic driving obstacle avoidance task

The embodiment aims at the automatic driving running process, is applied to a vehicle-mounted camera, improves the accuracy of positioning and dividing the road obstacle in a shot picture, and realizes accurate obstacle avoidance. The autopilot obstacle avoidance task is shown in fig. 3.

Example 3:

target task appointed by security monitoring system

The embodiment is applied to a security monitoring system, positions and segments the designated targets in a complex scene, improves the efficiency of a monitoring and checking system, and makes target tasks shown in fig. 4.

The foregoing descriptions of specific exemplary embodiments of the present application are presented for purposes of illustration and description. It is not intended to limit the application to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the application and its practical application to thereby enable one skilled in the art to make and utilize the application in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the application be defined by the claims and their equivalents.

Claims

1. The semi-supervised target video segmentation method using the difference contrast learning network is characterized by comprising the following steps of:

step 1: initial video frame of size h×wInputting into backbone network to obtain general visual feature with feature channel number of c, and performing edge enhancement convolution treatment to obtain visual feature with clearer detail texture>-providing said visual features->And segmentation result->Multiplying and adjusting the size to obtain the target characteristic +.>And background features->

Step 2: extracting the target featuresGlobal mapping feature->

step 5: characterizing the global mapIs->Performing pixel comparison, combining reference frame segmentation result, and performing convolution method according to global mapping feature +.>Distinguishing the object from the background features and the similarity between the object and the background at the pixel level to obtain an object region +.>And background area->

Step 6: sharing the parameters of the convolution layer, repeating the step 1, and inputting the subsequent video frames with the size of h multiplied by wVisual characteristic +.>

step 8: repeating the steps 6-7 until the target segmentation task of the whole video is completed;

will be visual characteristicsAnd segmentation result->Multiplying and adjusting the size to obtain the target characteristic +.>And background features->The formula is:

extracting the target featuresGlobal mapping feature->The method comprises two parts of global average pooling and a full connection layer, which are respectively:

wherein H is _average (x,J ^k×k,c S, p) is an average pooling function,for convolution operation, the pixel features of c feature channels are sequentially pooled using a convolution kernel with a step size s of 1 and a convolution kernel size k=3 until a c-dimensional feature vector +.>

(2) C-dimensional feature vector subjected to global average poolingInputting the full connection layer to obtain global mapping characteristic +.>The formula is:

wherein μ is a map coefficient, η is a correction amount;

c channels are obtained, and a similarity response diagram with the size of m multiplied by n is obtained, wherein the formula is as follows:

wherein H is _standard Mapping the similarity score of each pixel point into a 0-1 interval for a normalization function; each pixel point takes the highest r scores to obtain a three-channel scoring result graph with the size of m multiplied by n, and the three-channel scoring result graph is subjected to average pooling operation to obtain a final response graph with similar contrast, wherein the formula is as follows:

c channels are obtained, and a difference degree response diagram with the size of m multiplied by n is obtained, wherein the formula is as follows:

target areaAnd background area->The calculation formula of (2) is as follows:

wherein, sigma is a threshold value, and is obtained through training and used for judging a target and a background area in a video frame; setting the convolution kernel size as 1×1, and the step length s=1, performing convolution operation on the primary segmentation results of the target and the background, performing fine processing, and outputting a segmentation mapThe formula is:

2. a semi-supervised target video segmentation system using a differential contrast learning network to implement the semi-supervised target video segmentation method of claim 1, comprising:

Similar comparison branch unit for obtaining target characteristicsFeature vector of each pixel point is +.>Global mapping feature->By convolutionPerforming similarity comparison on convolution with the kernel size of 1 multiplied by 1 to obtain a similarity scoring graph with c channels and the size of m multiplied by n, namely, each pixel point comprises c channels, each channel has a corresponding similarity score, k scores before each channel are reserved, and average pooling treatment is performed to obtain a final similarity response graph>

A difference comparison branch unit for obtaining background characteristicsFeature vector of each pixel point is +.>Global mapping feature->Performing similarity comparison through convolution with the convolution kernel size of 1×1 to obtain a similarity scoring graph with c channels and the size of m×n, namely, each pixel point comprises c channels, each channel has a corresponding similarity score, k scores before each channel are reserved, and average pooling processing is performed to obtain a final difference response graph->