CN107644429B

CN107644429B - Video segmentation method based on strong target constraint video saliency

Info

Publication number: CN107644429B
Application number: CN201710946156.2A
Authority: CN
Inventors: 韩守东; 张珑; 刘昱均; 陈阳; 胡卓
Original assignee: Huazhong University of Science and Technology; Shenzhen Huazhong University of Science and Technology Research Institute
Current assignee: Huazhong University of Science and Technology; Shenzhen Huazhong University of Science and Technology Research Institute
Priority date: 2017-09-30
Filing date: 2017-09-30
Publication date: 2020-05-19
Anticipated expiration: 2037-09-30
Also published as: CN107644429A

Abstract

The invention discloses a video segmentation method based on strong target constraint video saliency, and belongs to the technical field of image processing. The method introduces strong target constraint on the basis of image significance, namely the position and scale constraint of a target are obtained through a multi-scale tracking algorithm and optical flow correction, the color constraint of the target is obtained through a historical frame segmentation result, and a video significance result is obtained through calculation; performing histogram classification operation on the video significance result to obtain a label mask image, and calculating a prior probability model of a front/background of a current frame; constructing a spatial-temporal continuous full-connection conditional random field model based on superpixels in a current frame, defining a data item by using the prior probability model, defining an intra-frame smooth item and an inter-frame smooth item by combining the color distance, the spatial distance and the edge relation among the superpixels, and performing optimization solution by using a fast high-dimensional Gaussian filtering algorithm to complete video target segmentation. The method effectively improves the accuracy and time efficiency of video segmentation.

Description

Video segmentation method based on strong target constraint video saliency

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a video segmentation method based on strong target constraint video saliency.

Background

Video segmentation is a technology for segmenting a foreground object in each frame of image in a video, namely completely determining a foreground contour. The video segmentation is usually in the bottommost link in the video processing algorithm, and the segmentation result is used for video application problems of upper layers, such as foreground feature extraction, foreground classification, foreground identification and the like of the video. Meanwhile, in industrial production, there is also a need for video segmentation, such as video splicing, three-dimensional video reconstruction, video semantic analysis, and the like.

In the video segmentation problem, the foreground region in each frame of image needs to be segmented. Generally, the video segmentation algorithm reduces the problem into two parts, namely, foreground possible region identification and image segmentation, namely, firstly, determining a possible foreground region in the current video frame image, then refining the region and segmenting the region by using the existing image segmentation method to obtain the foreground segmentation result of the current frame. Video segmentation can be classified into unsupervised video segmentation and supervised video segmentation according to the presence or absence of interaction volume. The unsupervised video segmentation method means that no manual interaction is needed, namely, after an input video is given, an algorithm directly calculates to obtain a processing result. The full-automatic video segmentation mode avoids manpower consumption, so the application cost is low. However, since the unsupervised video segmentation method always calculates all regions in the video frame, the processing efficiency is very low, and the accuracy of the segmentation result is general, which is difficult to be used in many practical situations. The supervised video segmentation method needs to add manual interaction, and generally selects to mark a foreground region to be segmented in a first frame of a video, or selects partial key frames in the video and marks a foreground, and also selects to pre-establish a target learning model aiming at a specific class of foreground. According to the obtained prior/background information, the target area of the video to be processed is reduced and the area is calculated in a refined mode, the time efficiency of the video segmentation algorithm is greatly improved at the cost of sacrificing a small amount of labor consumption, and meanwhile, a more accurate segmentation result is obtained, so that the method has higher practical value in industrial production.

However, even with manual interaction, only the foreground probable region of a very small portion of the keyframes can be determined. Therefore, how to obtain the foreground possible regions in all video frames according to the key frame prior information is a big problem in the video segmentation algorithm. For the problem, solutions such as directed acyclic graphs, point trajectory tracking, saliency and the like exist in the field of video segmentation, but many assumptions and harsh requirements exist on application scenes of the solutions. For the application of the image segmentation algorithm in video segmentation, the Graphcut model can obtain a more effective segmentation result, so that the Graphcut model is widely adopted in the field of image/video segmentation. However, the method has the disadvantages of low computational efficiency, and the problem is more obvious for video segmentation with large data volume. In order to adapt to the problem of huge data volume in video processing, a more concise and faster image segmentation algorithm needs to be introduced into video segmentation.

Disclosure of Invention

Aiming at the defects or improvement requirements in the prior art, the invention provides a video segmentation method based on strong target constraint video saliency, and aims to adopt strong target constraint video saliency extraction and a spatial-temporal continuous full-connection conditional random field video segmentation algorithm based on superpixels, so that the problems of inaccuracy and low efficiency of the existing video segmentation technology are solved.

In order to achieve the above object, the present invention provides a video segmentation method for constraining video saliency based on a strong target, the method comprising:

(1) calculating the optical flow motion information and the super-pixel segmentation result of the video frame;

(2) performing interactive segmentation on a first frame picture in a video frame sequence to obtain a target frame; calibrating a target foreground through a target frame and completing segmentation to obtain a target segmentation result; initializing a multi-scale tracking model by using the position and size information of the target frame on the basis of the target segmentation result;

(3) reading a next frame of image, obtaining a target frame of a current frame by using a multi-scale tracking model, correcting the target frame of the current frame by using the optical flow motion information of the current frame and a target segmentation result of a previous frame to obtain a corrected target frame, and updating the multi-scale tracking model by using the target position and scale information of the target frame;

(4) acquiring a target color model of the current frame according to the target segmentation results of the first frame and the previous frame; integrating the target color model, the target position and the scale information into image significance in a strong target constraint mode, and calculating to obtain strong target constraint video frame significance;

(5) threshold value limiting and histogram segmentation operations are carried out on the video frame significance result to obtain a rough label mask image of the current frame, and a prior probability model of the front/background of the current frame is calculated by combining the target segmentation result of the first frame;

(6) establishing a spatial-temporal continuous full-connection conditional random field model based on superpixels, defining data items by using a front/background prior probability model of a current frame, and defining an intra-frame smoothing item and an inter-frame smoothing item by combining color distances, spatial distances and edge relations among the superpixels; performing optimization solution by using a fast high-dimensional Gaussian filtering algorithm to obtain a target segmentation result of the current frame;

(7) and (5) repeating the steps (3) to (6) until the video segmentation is finished.

Further, the calculating process of the optical flow motion information in the step (1) includes:

(11) computing the optical flow gradient strength at pixel q in the video frame:

wherein λ is^mRepresenting an intensity parameter;

represents the gradient value at pixel q;

(12) calculating the maximum value of the difference of the motion directions of each pixel and the adjacent pixels in the video frame:

wherein，

Representing the angular difference between pixel q and pixel q'; q_qIs the neighborhood of pixel q; lambda [ alpha ]^θRepresenting an angle difference parameter;

(13) the pixel velocity difference in video frames is:

wherein, T_mRepresents a decision threshold; the invention uses the histogram iteration optimal threshold value method to self-adaptively calculate and obtain the decision threshold value T_m；

(14) If b is judged_qIf the difference is larger than the difference threshold value, if so, the pixel q is a contour pixel of the motion area; the value range of the difference threshold is 0.4-0.6, preferably 0.5;

(15) starting from the direction of 12 points by taking any pixel in a video frame as a reference, emitting a ray clockwise every 45 degrees to obtain 8 rays, respectively counting the intersection times of the ray and a contour pixel, and if more than 4 odd rays intersected with the contour pixel exist, judging the pixel as an effective motion pixel; otherwise, the pixel is determined to be a noisy pixel.

Further, the calculating the super-pixel segmentation result in the step (1) specifically includes:

(16) the video frame is segmented by adopting an SLIC over-segmentation method to obtain M super-pixels r_iRepresenting a superpixel, i ∈ M, r_iThe number of pixels in the region is O, and the region r_iThe center of gravity of (a) represents a superpixel coordinate, it can be found that:

wherein the content of the first and second substances,

and

respectively representing a super pixel r_iThe abscissa and ordinate of the o-th pixel within the region,

and

then the horizontal and vertical coordinates of the super pixel are represented;

(17) super pixel r₁And r₂The physical distance between the two is characterized by:

wherein, theta_γRepresenting a physical distance characteristic parameter; theta_γThe value range is 2-4, preferably 3;

(18) super pixel r₁And r₂The color distance between the two is characterized by:

wherein n is₁And n₂Respectively representing the number of respective color types in the two super pixel areas; f (c)_kI) represents the probability that the color of the ith pixel in the kth super-pixel region appears in the region; d (c)_1,i,c_2,j) Represents a pixel c_1,iAnd c_2,jThe Euclidean distance of the color;

(19) the edge characteristics under superpixel conditions are:

wherein, theta_αAnd theta_βIs an edge feature parameter. Theta_αThe value range of (a) is 18-22, theta_βHas a value range of 30～35，θ_αAnd theta_βPreferably 20 and 33, respectively.

Further, the step (3) of correcting the current frame target frame by using the optical flow motion information and the target segmentation result of the previous frame specifically includes:

(31) obtaining effective motion pixels in the current frame image;

(32) counting the proportion of the effective moving pixels in the current frame target frame, and if the proportion is greater than a proportion threshold value, entering a step (33); otherwise, entering a step (34); the value range of the proportional threshold is 0.7-0.9, preferably 0.8;

(33) respectively comparing the circumscribed rectangle of the effective motion pixels in the current frame target frame and the current frame target frame with the circumscribed rectangle of the previous frame target segmentation result; judging the smaller difference as the correction result; if the two and the circumscribed rectangle of the last frame of the segmentation result are both larger than the maximum difference threshold value, using the circumscribed rectangle of the last frame of the segmentation result as a correction result; the value range of the maximum difference threshold is 25-45%, preferably 35%; entering a step (35);

(34) judging a current frame target frame as a correction result;

(35) the length and the width of the correction result are respectively expanded by the multiple proportion, and if the expanded correction result is smaller than the original image, the correction result is returned as a current frame target frame; otherwise, expanding the correction result to the size of the original image and returning the result as the current frame target frame; the value range of the ratio multiple is 1.0-1.4, preferably 1.2.

Further, the step (4) of obtaining the target color model of the current frame according to the target segmentation result of the first frame and the previous frame specifically includes:

(41) establishing a Gaussian mixture model for the target foreground pixels of the first frame to obtain a color model H_fore-1(ii) a Respectively establishing a mixed Gaussian model for the foreground and the background of the target in the previous frame to obtain a foreground color model H_fore-2And a background color model H_back-2；

(42) The target foreground color model of the current frame is:

the color model of the target background of the current frame is H_back＝H_back-2(ii) a Wherein the content of the first and second substances,

weighting coefficients for the color model;

the value range of (a) is 0.2-0.4, preferably 0.3;

(43) super pixel q_kForeground probability value H_fore-F(q_k) Or probability value H of the background_back-F(q_k) Comprises the following steps:

H_back-F(q_k)＝1-P_fore(q_k)；

wherein, the foreground color model H_foreAnd a background color model H_backFurther obtaining a super pixel q in the current frame_kProbability H of belonging to the foreground respectively_fore(q_k) And probability of belonging to the background H_back(q_k)。

Further, the calculating in the step (4) to obtain the saliency of the strong target constraint video frame specifically includes:

the closer the region is to the target color model, the stronger the saliency, and thus the video frame saliency:

wherein D is_s(r_k,r_i) Representing a super pixel r_kRegion and super pixel r_iThe center of gravity distance of the region; sigma_sIs a distance coefficient; w (r)_i) Is a super pixel r_iFor super pixel r_kWeight of significance, r_iThe more the number of pixels in the region, the greater the influence; d_r(r_k,r₂) Is r_kAnd r_iThe color distance of (d);

for any super pixel r_kRegion, target frame b, dis (r)_kAnd b) obtaining the distance weight w of the superpixel from the center of the image as a function of the distance between the regions_s(r_k)：

Wherein, t_iI is 1,2,3 is the empirical weight, T₁And T₂Is an empirical threshold, t_iI is 1,2,3 and T is between 0 and 1₁And T₂The value of (a) is determined by the size of the image;

S_v(r_k) The color model weights in (1) are:

wherein, the H_fore-F(r_k) Representing a super pixel r_kThe foreground probability value of (1).

Further, the step (5) specifically includes:

(51) the region saliency is classified into three classes of labels by a threshold definition and histogram segmentation algorithm:

for each super pixel region r_kSetting label (r)_k) Expressed as background, unknown and foreground regions using the values 0, 1 and 2, respectively; s_v(r_k) Representing a video frame saliency result; setting a defined threshold T_basicWhen the super pixel r_kIs less than T_basicThen r is_kThe area is a background area; t is_hisIs a significant mean value, greater than T_hisIs marked as a foreground region; the remaining part is marked as an unknown region; t is_basicThe value range of (A) is 0.4-0.5, preferably 0.45;

(52) calculating a prior probability model of the front/background of the current frame:

Θ_fore＝Θ_fore-1+ρΘ_fore-S，

Θ_back＝Θ_back-S，

wherein rho is a color model weighting coefficient; theta_fore-SRepresenting a foreground mixed Gaussian model constructed by the foreground region pixels; theta_back-SRepresenting a background mixed Gaussian model constructed by the pixels in the background area; theta_fore-1A Gaussian mixture model representing the foreground of the first frame of the target; the value range of rho is 0.3-0.4, preferably 0.35;

(53) normalizing the foreground probability value Θ of the current frame_fore-FComprises the following steps:

further, the spatial-temporal continuous full-connection conditional random field model based on the superpixel in the step (6) is specifically as follows:

defining random variables

Representing a super pixel region r_iThe segmentation tags of (a) are set,

0 is background and 1 is foreground; establishing a spatial-temporal continuous full-connection conditional random field model based on superpixels:

therein, Ψ_iIs a data item; Ψ_ijIs an intra smoothing term; phi_ijIs an inter-frame smoothing term; m and N respectively represent the total number of superpixels in the current frame and the adjacent frame;

super pixel r_iForeground probability of

Comprises the following steps:

wherein, theta_fore-F(q_o) Representing the current frame pixel q_oA foreground probability value of (d); super pixel r_iThe number of pixels in the region is O;

data item Ψ_iComprises the following steps:

intra smoothing term definition Ψ_ijComprises the following steps:

Ψ_ij(r_i,r_j)＝w₁f_dis(r_i,r_j)+w₂D_r(r_i,r_j)+w₃f_edge(r_i,r_j) i,j∈M，

wherein f is_dis(r_i,r_j) Representing a super pixel r_i,r_jA physical distance feature therebetween; d_r(r_i,r_j) Representing a super pixel r_i,r_jA color distance feature therebetween; f. of_edge(r_i,r_j) Representing a super pixel r_i,r_jEdge features in between; w is a₁、w₂And w₃Respectively represents f_dis(r_i,r_j)、D_r(r_i,r_j) And f_edge(r_i,r_j) The proportion occupied in the intraframe smoothing term; w is a₁、w₂And w₃The value ranges of (a) are respectively 5-6, 9-10 and 1-2, w₁、w₂And w₃Preferably 6, 10 and 2, respectively;

inter-frame smoothing term Φ_ijComprises the following steps:

wherein the content of the first and second substances,

is a neighboring frame region r_jα is a time domain connection distance characteristic parameter, and the value range of α is 3-5, preferably 4.0.

Generally, compared with the prior art, the technical scheme of the invention has the following technical characteristics and beneficial effects:

(1) the method introduces strong target constraint on the basis of the image significance, and obtains the significance result of the video level, namely the strong target constraint video significance provided by the invention is used for enabling the target segmentation process to be efficient and accurate in video segmentation;

(2) in the single-frame image segmentation link, the method is different from the traditional video segmentation method which uses a graph cut segmentation model to perform single-frame image segmentation processing, the method uses and improves the image segmentation algorithm of a full-connection condition random field, introduces superpixels to replace pixels as a basic modeling unit, increases video inter-frame connection, constructs a space-time continuous full-connection condition random field video segmentation model based on the superpixels, and can effectively improve the accuracy and the time efficiency of video segmentation.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a first frame of video with an initial tracking frame of a target marked with artificial markers according to an embodiment of the present invention;

FIG. 3 is an intermediate result of the target scale and location information correction process in an embodiment of the present invention;

FIG. 4 is a saliency map computed using a video saliency method in an embodiment of the present invention;

fig. 5 shows a segmentation result calculated by using a video segmentation method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention mainly comprises two aspects, which are respectively as follows:

firstly, extracting the saliency of a strong target constraint video;

video saliency evolved from the concept of image saliency, and it is generally believed that image saliency characterizes the most salient regions in the current image, i.e., the regions most likely to be labeled as foreground, and that effective saliency results will greatly reduce human resource consumption in computer vision problems. Meanwhile, in an image, a salient region generally satisfies two assumptions: the color difference between the salient region and any other region in the image is large; the salient region is closer to the center of the image than other regions. However, the features present in video are richer than images, and the present invention extracts saliency results in the video context by adding strong target constraints to the assumption of image saliency. First, the present invention proposes:

firstly, obtaining the position and scale information of a target by a multi-scale tracking algorithm and optical flow correction;

then, estimating a color model of the target through a front/background Gaussian mixture model of the historical frame segmentation result;

and finally, blending the color model and the corrected target position and scale information into the image significance in a strong target constraint mode, and calculating to obtain a strong target constraint video significance result.

1. Optical flow motion information calculation:

firstly, an optical flow algorithm is used among video frame sequences to obtain an optical flow field in each frame of image. While the motion region in the optical flow field results has two distinct features: on one hand, the motion states of all pixels in the motion area are consistent and have obvious outlines, and the motion states of the pixels in the non-motion area are disordered and have no obvious object area outlines; on the other hand, the motion direction of the edge pixel of the motion area is greatly different from the motion direction of the pixel of the non-motion area of the neighborhood. Therefore, the present invention requires preprocessing of the optical flow field according to the above features to obtain a more efficient optical flow area.

First, the gradient of the optical flow field of the video frame is calculated. In the motion area, the gradient values are all close to 0 because the motion states of the pixels are consistent; outside the motion region, the pixel motion has no significant features, and the gradient value is chaotic and small; at the edge of the motion area, the gradient value of the edge of the motion area is larger due to the great motion difference of the inner pixel and the outer pixel, so that the motion edge image of the video frame can be obtained after the result is normalized. In the present invention, the gradient value at pixel q is defined as

It represents the velocity of the pixel, and the optical flow gradient strength at pixel q can then be expressed as:

wherein λ is^mRepresenting an intensity parameter.

According to

The gradient strength result of the represented video frame can distinguish most of the contour pixels of the motion area from the global non-contour pixels. In particular, the gradient strength value is greater than some decision threshold T_mThe motion of the pixels is significant enough, so that the pixels can be immediately judged as contour pixels of a motion area; less than T for gradient intensity values_mThe pixels in (2) need to further determine their attribution by using the characteristic that the motion direction of the contour pixels in the motion area is different from that of the surrounding pixels. Therefore, the present invention also requires the computation of each imageMaximum of motion direction (angle) difference of pixel and neighborhood pixel:

wherein the content of the first and second substances,

representing the angular difference between pixel Q and pixel Q', Q_qIn the neighborhood of pixel q, λ^θRepresenting an angular difference parameter.

By combining the gradient intensity values of the optical flow field and the difference of the motion directions between pixels, we can obtain the following difference result of pixel speeds in the video frame:

since the threshold value T is determined under different scenes_mThe difference exists, therefore, the invention uses the histogram iteration optimal threshold method to self-adaptively calculate and obtain the decision threshold value T_m. For b_qPixels greater than 0.5, we consider them to be motion region contour pixels.

From the above operation, a rough contour map of a motion region in a video frame can be obtained. However, in video scenes where there is a junction of distant and close views, the noise interference in the coarse silhouette cannot be eliminated using morphological processing. Therefore, it is necessary to further extract the motion information of the video frame on the basis of the rough contour map. It is observed that noise generally does not have a distinct contour, and thus the noise effect can be removed according to the contour characteristics. Specifically, the method takes any pixel in a video frame as a reference, and emits a ray every 45 degrees clockwise from the direction of 12 points to obtain 8 rays, and counts the intersection times of each ray and the contour pixel respectively. Obviously, when the pixel is inside the contour of the closed region, the number of times each ray intersects the contour should be odd; when the pixel is outside the contour of the occlusion region, the number of times each ray intersects the contour should be an even number of times. If more than 4 rays which intersect the contour for odd times exist in the pixel, the pixel is judged as a pixel in the contour, namely an effective motion pixel; otherwise, the pixel is determined to be an out-of-contour pixel, i.e., a noise pixel. Particularly, an integral graph algorithm can be used for quickly obtaining a judgment result. Through this process, we can obtain the optical flow motion information in the video frame.

2. Calculating the target position and scale information:

the target position and scale information cannot be accurately obtained by simply using a target tracking algorithm or an optical flow algorithm. Such as: when the target area moves slowly, the optical flow field is in a disordered state, and the position and the scale of the foreground target can be effectively positioned by a tracking algorithm; when the target area moves violently, the optical flow field presents a state of motion consistency in the moving area, and the tracking algorithm is easy to lose or bias the foreground target. Therefore, in order to obtain Accurate target position and scale information, the invention firstly uses a multi-scale target Tracking algorithm DSST (Accurate scale estimation for Robust Visual Tracking) based on kernel-correlation filtering to obtain preliminary foreground target position and scale information, and then verifies and necessarily corrects the Tracking result by using the optical flow motion information and the target segmentation result of the previous frame. The specific calculation process is as follows:

(1) and verifying whether the optical flow field is effective, wherein the optical flow field effectiveness judgment criterion is whether the proportion of effective motion pixels positioned in the video tracking target frame is greater than a given threshold value, and the threshold value is usually set to be 0.8. If the current frame optical flow result is invalid, abandoning the current frame optical flow result and entering into the step (4), and if the current frame optical flow result is valid, entering into the step (2);

(2) calculating optical flow motion information and a contour thereof;

(3) comparing the tracking result with the optical flow motion contour, and if the difference is small, entering correction Step 4; otherwise, taking the video tracking result as a correction result;

(4) respectively comparing the tracking result and the optical flow motion contour with the previous frame segmentation result, judging the one with small difference as a correction result, and if the two are greatly different from the previous frame segmentation result, using the previous frame segmentation result as the correction result;

(5) and enlarging the correction result obtained by the operation by a certain proportion and returning the enlarged correction result as an effective correction result. Through the operation, accurate target position and scale information can be obtained through calculation;

(6) and updating tracking model parameters.

The accurate foreground target position and scale information not only provides accurate foreground target characteristic information for video frame segmentation, but also greatly compresses a region to be segmented, most background regions are removed after correction operation, and the accuracy of the segmentation algorithm is further improved due to the reduction of redundant information. In the stage of executing the video frame target segmentation, the algorithm operation efficiency is obviously improved only by establishing a segmentation graph model for the region to be segmented.

3. Target color model estimation:

besides the position and scale information of the target, the invention also estimates the color model of the target according to the existing target segmentation result. In a video scene, the color model of a foreground object to be segmented usually has only slight change in a shot, and the color model is basically consistent as a whole. In the video input stage, the video segmentation algorithm provided by the invention performs manual interactive segmentation on the first frame of video to obtain an accurate target segmentation result, and the color model of the foreground target of the first frame represents the color model of the target in a complete shot to a great extent. At the same time, we note that the color model of the current frame object is closest to the color model of the object in the previous frame video. Therefore, the color model of the foreground object in the first frame is used as the basic color model, and is fused into the color model in the previous frame in a weighting mode, so that the color model of the current frame object can be estimated.

The color model H can be obtained by establishing a mixed Gaussian model for the foreground target of the first frame_fore-1Respectively establishing a mixed Gaussian model for the foreground target and the background of the previous frame to obtain a color model H_fore-2，H_back-2. From H_fore-1、H_fore-2Weighting can be estimated to obtain a foreground target color model H of the current frame_fore：

H_back＝H_back-2

In the above formula, the first and second carbon atoms are,

weighting coefficients for color models, in general

The value was set to 0.3. Meanwhile, we find that the scenes of two frames of videos in the same shot are similar, so the invention uses the background color model H in the last frame_back-2To estimate the background color model H of the current frame_back。

Calculating a foreground color model H according to the estimation_foreAnd a background color model H_backEach pixel q in the video frame may be further obtained_kProbability H of belonging to the foreground respectively_fore(q_k) And probability of belonging to the background H_back(q_k). Since the probability of a pixel belonging to the foreground and the background may be quite close, in order to better determine the probability to which the pixel label belongs, the probability value of the pixel belonging to the foreground or the background is normalized by combining the foreground and background color models.

H_back-F(q_k)＝1-P_fore(q_k)

Pixel foreground probability value obtained by the above formula

And background probability value

Effectively representing the target color model constraints of the video frame.

4. According to two assumptions of the saliency of an image,in the image, the greater the color difference value between any region and other regions is, the stronger the significance of the region is; the closer any region is to the center of the image, the more salient this region is. From this assumption, conventional image saliency can be obtained, with either region r_kHas the significance of S_I(r_k)：

In the formula, w_s(r_k) Indicating the region r_kDistance from the center of the image, D_s(r_k,r_i) Indicating the region r_kWith other regions r in the image_iDistance of center of gravity, σ_sIs a distance coefficient, w (r)_i) Is a region r_iFor the region r_kWeight of significance, r_iThe larger the number of pixels in the area, the larger the influence, and finally, the calculation area r_kAnd region r_iColor distance D of_r(r_k,r_i)：

The above formula defines the inter-region color distance D under the saliency concept_r(r₁,r₂) Wherein r is₁，r₂Representing two super-pixel regions, n₁，n₂Number of color classes representing two super-pixel regions, f (c)_kI) denotes the probability that the color of the ith pixel in the kth region appears in the region, D (c)_1,i,c_2,j) Representing the euclidean distance of the pixel color.

In fact, in video, the extractable feature information is richer than in images. If the characteristics related to the upper frame and the lower frame exist in the video, the segmentation result of the target is extracted frame by using a video segmentation algorithm, so that more help can be provided for calculating the image/video saliency by using the segmented target information. As described above, by correcting the target tracking result by the optical flow motion information and the last frame target segmentation result, accurate position and scale information of the target can be acquired; meanwhile, the color model constraint of the current frame target can be calculated according to the existing segmentation result. These information all characterize the target characteristics to some extent.

Therefore, on the basis of the traditional image saliency, the hypothesis that the closer the image saliency is to the center of the image, the stronger the saliency is expanded, namely the region in the video which is considered to be closer to the center of the target is stronger in the saliency; meanwhile, the target color model is further introduced in the invention, namely the more the assumed region is close to the target color model, the stronger the significance is, and the video significance S can be obtained_v(r_k)：

In the formula, r is expressed for an arbitrary region_kB is the target frame, t_iI is 1,2,3 is the empirical weight, T₁And T₂Is an empirical threshold, dis (r)_kAnd b) as a function of the distance between the regions, the distance weight w of the region from the center of the image can be obtained_s(r_k)：

The piecewise function is used here to significantly increase the significance value in the scale and effectively decrease the significance value out of the scale, and other reasonable nonlinear functions can be used instead.

Further, S can be obtained_v(r_k) Color model weight w in_o(r_k)，

Wherein H_fore-F(r_k) For estimating the calculated region r according to the target color model_kThe foreground probability value of (1).

And secondly, performing space-time continuous full-connection conditional random field video segmentation algorithm based on the superpixels.

1. Constructing a target color model based on the significance of the strong target constraint video:

the traditional video segmentation method based on object proposal segments a video frame into a plurality of object candidate regions through over-segmentation operation, finds out the region with the maximum object probability in the video frame from the candidate regions according to the corresponding object probability calculation criteria, and then constructs a color model of a foreground object by using the region. In general, the candidate region with the highest probability of an object in the algorithm can cover most of the foreground object. However, the super-pixel region obtained by using the over-segmentation method is difficult to completely cover the foreground object and even the segmentation error may occur, so that the video segmentation method based on the object proposal is difficult to construct an accurate object color model even if the most suitable object candidate region can be always selected.

From the experimental results of the strong target constraint video saliency algorithm presented above, it can be found that pixels within the foreground target region consistently exhibit high saliency, while pixels outside the target region have very low saliency or even almost 0. Meanwhile, by observing various video scenes, the appearance of the foreground object to be segmented in the video is integrally consistent under the whole lens, namely the foreground color models between any two frames are similar.

Under the initiation of the above discovery, the present invention proposes a method for constructing a target color model based on strong target constraint video saliency. The method firstly divides the significance result into three types of labels through a threshold value limiting and histogram segmentation algorithm:

for each super pixel region r_kAll can set its label (r)_k) The background, unknown and foreground regions are represented by the values 0, 1 and 2, respectively. At the same time, the algorithm sets a defined threshold T_basicWhen region r_kIs less than T_basicThen, the region is considered as the background region, T in the present invention_basicTake 0.45. Further, the method obtains a corresponding histogram by counting the significance values of the non-background areas, and takes the significance mean value in the histogram as a segmentation threshold value T_hisThe non-background regions are divided into two categories, i.e. greater than T_hisIs marked as foreground region and the rest is marked as unknown region.

According to the region label of the video frame, respectively using the pixels of the foreground label region and the pixels of the background label region to construct a foreground Gaussian mixture model theta_fore-SAnd the background Gaussian mixture model theta_back-S. In the above, the present invention proposes that the color model of the target in the first frame is used as the basic model, and the target color model of the current frame is finally estimated by blending the color model of the target in the previous frame in a weighting manner. However, introducing the color model of the last video frame segmentation object may generate accumulated errors. Therefore, in the current link, the method also combines the strong target constraint video saliency result and the first frame target color model in a weighting mode to construct a more accurate current frame target color model:

Θ_fore＝Θ_fore-1+ρΘ_fore-S

Θ_back＝Θ_back-S

where ρ is a color model weighting coefficient, and the value of ρ is usually set to 0.35. Finally, the target foreground probability value theta is normalized_fore-FComprises the following steps:

2. a space-time continuous super-pixel video segmentation method based on strong target constraint video saliency comprises the following steps:

the full-connection conditional random field segmentation method is an image segmentation method based on the full-connection conditional random field. Because the method is fast in solving and excellent in segmentation effect, the space-time continuous full-connection conditional random field video segmentation method is designed by increasing inter-frame connection on the basis of the method. Meanwhile, in the image segmentation problem, pixels are generally used as basic operation units, and the pixels can sufficiently express information in image data, thereby ensuring high accuracy of segmentation results. However, in a video scene with tens of frames of images per second, still using pixels as the basic segmentation unit to calculate the video segmentation result consumes a lot of system resources and may result in poor video segmentation efficiency. In fact, in most cases, some similar and similar pixels express the same information, and if these pixels can be collectively treated as the same pixel in the preprocessing stage, the execution efficiency of the video segmentation algorithm can be certainly greatly improved. Therefore, the present invention proposes to use superpixels instead of pixels as the basic unit of operation for the segmentation algorithm. Under the condition of super-pixel, each basic operation unit is not simply represented by pixel information any more, and the mutual relation among the basic units is changed correspondingly, so that the segmentation model represented by the super-pixel basic unit needs to be redefined.

Obtaining M superpixels by SLIC over-segmentation method, M is usually set to 250, and each superpixel r_iRepresents a region and has i ∈ M, r_iThe number of pixels in the region is O, and the region r is defined_iThe center of gravity represents the superpixel coordinates, and it can be:

in the above formula, the first and second carbon atoms are,

respectively representing a super pixel r_iThe abscissa and ordinate of the o-th pixel,

then it represents a super pixelThe coordinates of (a).

Definition of superpixel coordinates we can derive the physical distance feature f between two superpixels_dis(r₁,r₂)：

Wherein the physical distance characteristic parameter theta_γTypically set to 3.

As defined above with respect to the color distance between superpixel regions, the color distance feature D between two superpixels can be obtained_r(r₁,r₂) Finally, we define the edge feature f under superpixel conditions_edge(r₁,r₂)：

In the present invention, the edge feature parameter θ_α，θ_βSet to 20 and 33 respectively. From the above definitions, we determine the superpixels and their feature representations in the segmentation model.

First, we define a random variable

Each indicates a segmentation label, 0 is the background, and 1 is the foreground. Based on an image segmentation energy function of a full-connection conditional random field, an energy function of the full-connection conditional random field under video segmentation is defined as follows:

therein, Ψ_iFor data items, Ψ_ijFor intra-frame smoothing terms, [ phi ]_ijFor the inter-frame smoothing term, M and N represent the total number of superpixels in the current frame and the neighboring frame, respectively. Compared with the conventional image segmentationThe energy function in the random field differs in the method with respect to the full-link condition, energy function E_super-pixel(y) adding an inter-frame smoothing term phi_ij(ii) a In fact, only using the data item and the intra-frame smoothing item to solve the energy function optimization can obtain a relatively accurate single-frame video segmentation result. However, in a video scene, the foreground objects between adjacent frames do not have large deformation generally, so that the influence of jitter, noise interference and the like generated in the video segmentation process can be prevented by increasing the connection relation in the time dimension, and a space-time continuous video segmentation result is obtained.

According to the super pixel region r_iWithin each pixel q_oNormalized target foreground probability value Θ_fore-F(q_o) The super pixel region r can be calculated_iForeground probability of

Comprises the following steps:

further, a data item definition Ψ may be derived_iComprises the following steps:

in a segmented scene, the probability that two areas with the closer relative positions belong to the same label is higher, and the probability that the labels of the areas with the more similar colors are the same is also higher; at the same time, at the same edge of the region, the classification tags also present a consistent state with a greater probability. Therefore, the physical distance characteristic f is considered_dis(r_i,r_j) Color distance feature D_r(r_i,r_j) And edge feature f_edge(r_i,r_j) The intra smoothing term definition Ψ can be derived_ij：

Ψ_ij(r_i,r_j)＝w₁f_dis(r_i,r_j)+w₂D_r(r_i,r_j)+w₃f_edge(r_i,r_j) i,j∈M

Wherein, w₁、w₂、w₃Each represents the proportion of different features in the intra smoothing term, and is typically set to 6, 10, and 2, respectively.

In addition, the interframe smoothing term is used for improving the smoothness of the segmentation result and ensuring that the output result has no jitter, so the interframe smoothing term defines phi_ij：

alpha is a time domain connection distance characteristic parameter and is set to be 4.0 in the invention.

Is a neighboring frame region r_jThe label of (1).

In the video segmentation process, the segmentation algorithm may cause a large deviation of the segmentation result due to the particularity of the video scene and the contingency of the actual operation of the program. At this time, if a video segmentation algorithm based on full video frames is used, not only a low-quality video segmentation result is generated, but also the algorithm cannot reach a stable state again through external interference operation. The super-pixel-based space-time continuous random field video segmentation method provided by the invention uses an algorithm framework for carrying out target segmentation frame by frame, and the strategy effectively avoids the problems. Therefore, when each frame of target is divided, only an inter-frame smoothing term between the current frame and the previous frame of video is considered in the energy function. Meanwhile, as the target segmentation result in the last frame is known, when the energy function is solved, the energy function is used

To calculate an inter-frame smoothing term

And finally, the super-pixel-based space-time continuous full-connection conditional random field video frame segmentation result can be deduced by using rapid high-dimensional Gaussian filtering.

As shown in fig. 1, a video segmentation process of a video bear according to an embodiment of the present invention includes the following steps:

(1) the optical flow motion information and the super-pixel pre-segmentation result of the video frame are calculated in a pre-processing mode, because the optical flow motion information and the super-pixel result in the video frame need to be obtained in the implementation process of the method and the calculation of the optical flow motion information and the super-pixel result can be completed before all processing, the optical flow motion information and the super-pixel result of the video bear are calculated in the pre-processing stage;

(2) when the first frame of the bear video is read, because the invention does not know what the target to be segmented is, an interactive interface needs to be provided, and the target to be segmented is given artificially, as shown in fig. 2, the target is the first frame of the bear video, and the rectangular frame in the picture is marked by the person. After the position and size information of the rectangular frame is obtained through man-machine interaction, the information is displayed in the image. After an initial target frame is obtained, a target foreground is calibrated through the target frame and segmentation is completed to obtain a target segmentation result; initializing a multi-scale tracking model by using the position and size information of the target frame on the basis of the target segmentation result;

(3) tracking and correcting and updating tracking model parameters of a current frame target, from a target frame in a previous frame, a rough target frame in the current frame can be obtained by using a multi-scale tracking model, however, even the most elegant video tracking algorithm cannot completely capture the target under the condition that the target is changed drastically, so here, the invention proposes to calculate video motion information by using an optical flow algorithm to assist in correcting the tracking result:

first, pixel velocity intensity, velocity gradient, and velocity intensity gradient in the light flow map are calculated. Determining stable motion pixels in the image according to the calculation result; counting the proportion of the stable motion pixels in the approximate target frame, if the proportion is smaller than an empirical threshold, judging that no stable motion area exists in the approximate target frame, and at the moment, the target tends to have no change, namely the video tracking operation obtains quite accurate target scale and position information; if the ratio is greater than the empirical threshold, it is determined that a stable motion area exists in the approximate target frame, and the optical flow may be used to correct the target frame to find a bounding rectangle of the stable motion pixels in the current target frame. Respectively comparing the tracking target frame and the optical flow motion frame with a circumscribed rectangle frame of a previous frame segmentation result, and selecting a proper window as a target frame according to the given criteria;

finally, amplifying the target frame in a certain proportion, and identifying the result as target scale and position information; as shown in fig. 3, the results of each stage in the foreground object dimension and position calculation step in the bear video are shown. In the figure, a black window is a video tracking result, a red window is an effective optical flow area, and a blue window is a target frame for segmentation after correction;

(4) extracting the significance of a strong target constraint video, and establishing a new graph structure of a current frame image by taking a super pixel as a basic unit according to a super pixel result; in the new graph structure, each node is represented by a super pixel, the position of the node is the geometric center of the super pixel, and the color distance between the node and other nodes can be obtained by accumulating the difference value of each pixel in two super pixel areas; meanwhile, according to the target scale, the current frame is divided into 3 saliency areas with different scales, and the closer the super-pixel distance to the center of the target, the larger the acquired saliency weight is; finally, adding a target Gaussian mixture model estimated from the segmentation result of the first frame and the last frame, and solving a final significance result; FIG. 4 shows the result of the saliency calculation for a video bear in a second frame;

(5) estimating a rough label mask image and a front/background prior probability model, dividing a video significance result into three types of labels (a background area, an unknown area and a foreground area) by using a threshold value limiting and histogram segmentation algorithm, estimating the rough label mask image of a current frame, and further respectively constructing a foreground Gaussian mixture model and a background Gaussian mixture model by using pixels of the foreground label area and pixels of the background label area; in the video segmentation problem, the color change part is usually the background, and the foreground color change amplitude is small, so the method can more accurately calculate the prior probability model of the front/background of the current frame by adopting a weighting mode of a front/background Gaussian mixture model in combination with the target segmentation result of the first frame;

(6) and constructing a spatial-temporal continuous full-connection condition random field model based on the superpixel, rapidly optimizing and solving, and establishing a spatial-temporal continuous full-connection condition random field graph structure based on the superpixel according to the video segmentation model. Then, the graph structure is solved by using a fast high-dimensional Gaussian filter algorithm, and the segmentation result of the current frame is obtained. The segmentation algorithm is used in the sample video bear, and the obtained result is shown in fig. 5;

(7) and (5) keeping the segmentation result and the target frame of the current frame, reading the video image of the next frame and the optical flow motion information thereof, and repeating the steps (3) to (7) until the video segmentation is finished.

It will be appreciated by those skilled in the art that the foregoing is only a preferred embodiment of the invention, and is not intended to limit the invention, such that various modifications, equivalents and improvements may be made without departing from the spirit and scope of the invention.

Claims

1. A video segmentation method based on strong target constraint video saliency is characterized by comprising the following steps:

(1) calculating the optical flow motion information and the super-pixel segmentation result of the video frame; the calculating the super-pixel segmentation result specifically comprises:

wherein the content of the first and second substances,

and

and

wherein, theta_γRepresenting a physical distance characteristic parameter;

(19) the edge characteristics under superpixel conditions are:

wherein, theta_αAnd theta_βIs an edge feature parameter;

(3) reading a next frame of image, obtaining a target frame of a current frame by using a multi-scale tracking model, correcting the target frame of the current frame by using current frame optical flow motion information and a target segmentation result of a previous frame, and updating the multi-scale tracking model by using the target position and scale information of the corrected target frame;

2. The method for video segmentation based on strong target constraint video saliency as claimed in claim 1, wherein the calculation process of the optical flow motion information in step (1) includes:

wherein λ is^mRepresenting an intensity parameter;

represents the gradient value at pixel q;

wherein the content of the first and second substances,

(13) the pixel velocity difference in video frames is:

wherein, T_mIndicating a decision threshold

(14) If b is judged_qIf the difference is larger than the difference threshold value, if so, the pixel q is a contour pixel of the motion area;

3. The method according to claim 1, wherein the correcting the current frame target frame in step (3) using the optical flow motion information and the target segmentation result of the previous frame specifically comprises:

(31) obtaining effective motion pixels in the current frame image;

(32) counting the proportion of the effective moving pixels in the current frame target frame, and if the proportion is greater than a proportion threshold value, entering a step (33); otherwise, entering a step (34);

(33) respectively comparing the circumscribed rectangle of the effective motion pixels in the current frame target frame and the current frame target frame with the circumscribed rectangle of the previous frame target segmentation result; judging the smaller difference as the correction result; if the two and the circumscribed rectangle of the last frame of the segmentation result are both larger than the maximum difference threshold value, using the circumscribed rectangle of the last frame of the segmentation result as a correction result; entering a step (35);

(34) judging a current frame target frame as a correction result;

(35) the length and the width of the correction result are respectively expanded by the multiple proportion, and if the expanded correction result is smaller than the original image, the correction result is returned as a current frame target frame; otherwise, the correction result is expanded to the size of the original image and then returned as the current frame target frame.

4. The method according to claim 1, wherein the obtaining the target color model of the current frame according to the target segmentation result of the first frame and the previous frame in the step (4) specifically comprises:

(42) The target foreground color model of the current frame is:

of the current frameThe color model of the target background is H_back＝H_back-2(ii) a Wherein the content of the first and second substances,

weighting coefficients for the color model;

H_back-F(q_k)＝1-P_fore(q_k)；

5. The video segmentation method based on the saliency of the strong target constraint video according to claim 1, wherein the saliency of the strong target constraint video frame calculated in the step (4) is specifically:

for renSuper pixel r_kRegion, target frame b, t_iI is 1,2,3 is the empirical weight, T₁And T₂Is an empirical threshold, dis (r)_kAnd b) obtaining the distance weight w of the superpixel from the center of the image as a function of the distance between the regions_s(r_k)：

S_v(r_k) The color model weights in (1) are:

6. The method for video segmentation based on strong target constraint video saliency as claimed in claim 1, wherein said step (5) specifically comprises:

for each super pixel region r_kSetting label (r)_k) Expressed as background, unknown and foreground regions using the values 0, 1 and 2, respectively; s_v(r_k) Representing a video frame saliency result; setting a defined threshold T_basicWhen the super pixel r_kIs less than T_basicThen r is_kThe area is a background area; t is_hisIs a significant mean value, greater than T_hisIs marked as a foreground region; the remaining part is marked as an unknown region;

Θ_fore＝Θ_fore-1+ρΘ_fore-S，

Θ_back＝Θ_back-S，

wherein rho is a color model weighting coefficient; theta_fore-SRepresenting a foreground mixed Gaussian model constructed by the foreground region pixels; theta_back-SRepresenting a background mixed Gaussian model constructed by the pixels in the background area; theta_fore-1A Gaussian mixture model representing the foreground of the first frame of the target;

7. the video segmentation method based on strong target constraint video saliency as claimed in claim 1, wherein said spatial-temporal continuous full-connected conditional random field model based on superpixels in step (6) is specifically:

defining random variables

Representing a super pixel region r_iThe segmentation tags of (a) are set,

super pixel r_iForeground probability of

Comprises the following steps:

data item Ψ_iComprises the following steps:

intra smoothing term definition Ψ_ijComprises the following steps:

Ψ_ij(r_i,r_j)＝w₁f_dis(r_i,r_j)+w₂D_r(r_i,r_j)+w₃f_edge(r_i,r_j)i,j∈M，

wherein f is_dis(r_i,r_j) Representing a super pixel r_i,r_jA physical distance feature therebetween; d_r(r_i,r_j) Representing a super pixel r_i,r_jA color distance feature therebetween; f. of_edge(r_i,r_j) Representing a super pixel r_i,r_jEdge features in between; w is a₁、w₂And w₃Respectively represents f_dis(r_i,r_j)、D_r(r_i,r_j) And f_edge(r_i,r_j) The proportion occupied in the intraframe smoothing term;

inter-frame smoothing term Φ_ijComprises the following steps:

wherein the content of the first and second substances,

is a neighboring frame region r_jα is a time domain connection distance characteristic parameter.