CN114820423A

CN114820423A - Automatic cutout method based on saliency target detection and matching system thereof

Info

Publication number: CN114820423A
Application number: CN202111060436.6A
Authority: CN
Inventors: 孙创开; 黄海龙; 伍俊英
Original assignee: Guangzhou Faisco Internet Technology Co ltd
Current assignee: Guangzhou Faisco Internet Technology Co ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2022-07-29

Abstract

The invention belongs to the technical field of computer vision, and discloses an automatic cutout method based on saliency target detection and a matching system thereof, wherein the realization method comprises the following steps: inputting the image into a salient target detection neural network module to obtain a foreground and background separation masking layout; then entering a matting module, and generating an image with a transparent background and a visible foreground by using the masking image and the original image masking image; the user judges whether further modification is needed according to the matting result; if the matting result is complete and accurate, the matting result generated by the neural network for detecting the significant target can be directly used; if the cutout is incomplete or insufficient in accuracy, switching to an interactive cutout module, and locally modifying by a user according to actual conditions; and finally, calculating the sectional drawing by using the modified Mongolian drawing and the original drawing. The method can carry out automatic picking of the foreground of the original image, can carry out more accurate editing on the automatic picking result, and can conveniently and efficiently assist the user to realize accurate picking.

Description

Automatic cutout method based on saliency target detection and matching system thereof

Technical Field

The invention relates to the technical field of digital image processing, in particular to an automatic cutout method based on saliency target detection and a matching system thereof.

Background

On an e-commerce design platform, users often need to perform various editing processes on images, wherein matting is a frequently used application scene. Specifically, the cutout is to segment the target area in the image to realize the accurate separation of the foreground and the background. It is an important technique for image and video processing, and designers can usually implement matting by means of photoshop and other tools. Meanwhile, there are high requirements on the application skills of the image processing software of the designer, and when picking up an object with complicated details (such as hair, a grid, translucent glass, etc.), it usually takes a long time and a great effort, and the efficiency is low and the stability is insufficient.

With the development of image processing technology, especially the development of deep Convolutional Neural Network (CNN) technology in recent years, many excellent matting algorithms emerge. In 2017, Xu et al (Ning Xu, Brian Price, Scott Cohen and Thomas Huang, Deep Image matching, CVPR 2017) proposed: first, an alpha matte (mask) of an image is predicted by inputting a trimap corresponding to the image and the picture through an encoder-decoder network of a deep convolution. The resulting mask of the first part is then fine-tuned with a small convolutional network to obtain sharper edge results. In 2019, Cai et al (Shaofan Cai, Xiaoshuai Zhuang, Haoqiang Fan, Haibiin Huang, Jiangyu Liu, Jianming Liu, Jiangying Liu, Jue Wang, and Jian Sun, Disentangied Image Matting, ICCV 2019) proposed: the matting problem is decomposed into two subtasks, trimap adaptation (a classification task) and alpha estimation (a modification task). Trimap adaptation, among other things, is a pixel-level classification problem that infers the global structure of the input image by identifying certain foreground, background, and semi-transparent image regions. alpha estimation is a regression problem, computing the opacity value of each pixel.

Although the final matting accuracy has a great breakthrough, the method still depends on trimap trigram, and the trimap accuracy has a great influence on the final matting result, so that the end-to-end high-accuracy automatic matting can not be realized in the true sense. In order to solve the defects of a significant target detection algorithm in multi-target and complex scenes in practical application, the invention also provides an auxiliary sectional system combining automatic sectional drawing and interactive sectional drawing.

Disclosure of Invention

The method is inspired by the principle of significant target detection to solve the problems in the prior art. The invention provides an automatic cutout method based on saliency target detection, the algorithm can automatically identify the most visually attractive object in an image and accurately segment the object, and the input and intermediate processes are completely independent of trimap ternary diagrams, so that end-to-end automatic cutout in the true sense is realized. Meanwhile, in order to solve the defects of a significant target detection algorithm in multi-target and complex scenes in practical application and enable a user to flexibly edit and modify the matting object, the invention also provides a matting auxiliary system combining automatic matting and interactive matting.

In order to achieve the above object, a first aspect of the present invention provides an automatic matting method based on saliency object detection, including:

s1_1, first, an image is input into an Res _ Swish backbone network formed by "bottleeck" modules in which the original ReLU activation function is replaced with a Swish function with higher accuracy, and five stages of encoding, Encode _ stage1, Encode _ stage2, Encode _ stage3, Encode _ stage4, Encode _ stage5, and five stages of decoding, Decode _ stage4, Decode _ stage3, Decode _ stage2, and Decode _ stage1 are performed. Meanwhile, each output of the Encode stage is added to the symmetrical Decode stage, and multi-scale characteristic information is effectively utilized;

s1_2 performs convolution operations with convolution kernel size of 3 × 3 and padding of 1 on the output at Decode _ stage1, respectively, and the output tensor M1 is obtained without up-sampling since no down-sampling is performed at Decode _ stage 1. And then convolution operations with convolution kernel sizes of 3 × 3 and 1 are performed on outputs in the Decode _ stage2, Decode _ stage3, Decode _ stage4 and Decode _ stage5 stages, respectively, and output tensors M2, M3, M4 and M5 having the same input size are upsampled by using a bilinear interpolation algorithm, respectively.

S1_3, transversely splicing M1, M2, M3, M4 and M5 into a 6-channel tensor M0 according to the dimension 1, and then convolving the tensor M0 into a single-channel tensor by using a convolution kernel with the size of 1 x 1.

And S1_4, finally, carrying out sigmoid operation on the single-channel tensor obtained in the step S1_3 to obtain a probability matrix M [: for each pixel belonging to the foreground and the background. Multiplying the probability matrix M by 255 to obtain a mask map Alpha _ SOD predicted by the significance target detection module;

to achieve the above object, a second aspect of the present invention provides a combined automatic and interactive matting aid system, comprising:

s2_1, inputting the Image to be processed into a saliency target detection module, and automatically reasoning the probability P (x) that each pixel point in the Image belongs to the foreground through a trained neural network model _ij ) Obtaining a two-dimensional probability matrix M [: of the whole image:]. Multiplying the probability matrix M by 255 to obtain a mask map Alpha _ SOD predicted by the significance target detection module;

s2_2, leading the mask Image Alpha _ SOD and the original Image obtained in the step S2_1 into a Matting module to obtain a foreground and background separated Image Matting;

s2_3, the user judges the effect of the cutout obtained in the step S2_ 2: if the requirements of completeness and precision of body region Matting are met, downloading a Matting graph for use (S2_4 to S2_9 steps do not need to be executed); otherwise, entering an interactive cutout module;

s2_4, performing binarization processing on the Alpha _ SOD obtained in the step S2_1 to obtain a Binary image Binary _ Alpha 1;

s2_5, performing mask calculation on the Binary Image Binary _ alpha1 obtained in the step S2_4 and the original Image, and performing local modification on a main body region by using an interactive matting algorithm (blue handwriting stands for preservation and red region stands for deletion) based on GrabCT to obtain a Binary mask Image Binary _ alpha2 with higher accuracy;

s2_6, calculating the difference between Binary _ alpha1 and Binary _ alpha2, and generating a Trimap image of local modified regions by processing the edges of the different pixel point regions through corrosion, expansion and the like (only the modified regions generate trimaps, and not all the regions generate trimaps again);

s2_7, inputting the original Image and the Trimap generated in the S2_6 step into a trained cutout neural network model depending on the trisection Image to generate a more accurate mask Image Alpha _ mat;

s2_8, importing the more accurate mask image Alpha _ mat obtained in the step S2_7 into a Matting module, and operating to obtain a Matting;

s2_9, the user judges the effect of the cutout obtained in the step S2_ 8: if the requirements of integrity and precision of the sectional drawing of the main body are met, downloading a Matting drawing; otherwise, the interactive matting module is entered again, and the steps from S2_4 to S2_8 are executed until the matting effect reaches the satisfaction of the user;

compared with an online matting tool provided by an electronic commerce design platform on the market, the invention provides an automatic matting method based on significance target detection. According to the method, a significant target detection neural network is trained through a large amount of high-precision training data, then the trained neural network is used for reasoning images to be scratched, so that accurate scratching results are obtained, and the process does not need any manual interaction and does not depend on Trimap trisection. Through manual checking and reexamination, the accuracy rate of the automatic image matting method based on the saliency target detection provided by the invention on the image main body region matting integrity is about 90%, so that the manual participation amount is greatly reduced, and the image matting efficiency is improved.

Meanwhile, in order to better solve the defect that the main body region is not accurately segmented under individual multiple targets and complex scenes in practical application of a significant target detection algorithm, the invention also provides a matting auxiliary system combining automatic matting and interactive matting, which can enable a user to edit and modify the main body region according to the actual condition, and further improve the flexibility of a matting product and the accuracy of the final matting effect;

drawings

FIG. 1 is a general system framework flow diagram of the present invention.

FIG. 2 is a block diagram of a salient object detection neural network based on a U-Net structure according to the present invention.

FIG. 3 is a modified Res _ Swish module structure diagram based on the "Bottleneck" module according to the present invention.

Fig. 4 is a diagram showing the specific effects of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

According to a first aspect of the invention, the embodiment discloses an automatic matting method based on saliency object detection.

As shown in fig. 2, the further operation of this embodiment is explained as follows:

1) the overall network proposed in step S1_1 is improved over the U-Net architecture of the classic Encode-Decode structure, in which the encoding and decoding network elements of each stage are improved by the "bottleeck" module. Specifically, a new Res _ Swish network is formed by replacing the ReLU activation function of the "bottleeck" module with a Swish function with higher precision (the structure is shown in fig. 3). The Swish function is expressed as:

swish(x)＝x·sigmoid(x)

where x represents the input.

2) The size of an input image is uniformly adjusted to 480 × 480, and in the Encode stage, downsampling is realized by using the maximum pooling of kernel _ size 2 and step size stride 2; the upsampling is implemented using a bilinear interpolation algorithm in the Decode stage. Five stages of the feature extraction part, namely, encoded Encode _ stage1, Encode _ stage2, Encode _ stage3, Encode _ stage4 and Encode _ stage5, and decoded Deccode _ stage4, Deccode _ stage3, Deccode _ stage2 and Deccode _ stage1 respectively have corresponding spatial resolutions of 480, 240, 120, 60, 30; the addition of each output of the Encode stage to the symmetric Decode stage referred to in step S1_1 mainly takes into account: in the multi-stage feature extraction process in the encoding stage, feature information of different scales has different information, wherein the feature of high resolution of a lower layer contains rich local information and is beneficial to detecting key points of a significant target, and the feature of low resolution of a higher layer contains global semantic information of the whole target, so that the accuracy of a neural network is improved by utilizing the multi-scale feature information;

3) in step S1_2, since the size corresponding to the stage Decode _ stage1 that is not down-sampled is still 480 × 480, the convolution operation with the convolution kernel size of 3 × 3 and the padding of 1 is only required to be performed on the output at the stage Decode _ stage1, and the output tensor M1 is directly obtained. Since the other decoding stages have downsampling, after performing convolution operation with convolution kernel size of 3 × 3 and convolution kernel size of 1 on the outputs of Decode _ stage2, Decode _ stage3, Decode _ stage4 and Decode _ stage5, it is necessary to perform upsampling to 480 × 480 size using a bilinear interpolation algorithm, and the corresponding output tensors are denoted as M2, M3, M4 and M5.

4) In step S1_3, M1, M2, M3, M4, and M5 are transversely spliced to generate a 6-channel tensor M0, and then convolved to a single-channel tensor M0 by using a convolution kernel with a size of 1 × 1.

M0＝conv(concat(M1,M2,M3,M4,M5))

Where M0 represents the fused feature map, conv represents the convolutional layer, and concat represents the transverse splicing operation.

5) In step S1_4, sigmoid operation is performed on the single-channel feature map M0 obtained in step S1_3, and a probability matrix M [: ] that each pixel belongs to a foreground and a background is obtained. And multiplying the probability matrix M by 255 to obtain a mask map Alpha _ SOD predicted by the significance target detection module. In the training process, a binary cross entropy loss optimization model parameter is used, and a loss function is as follows:

where h, w represent the height and width of the image, respectively, t _i Representing the true probability, o, that the ground truth label sample belongs to the foreground at pixel i point _i And (4) representing the prediction probability of the neural network belonging to the foreground at the pixel point i.

The whole loss process uses a multi-stage binary cross-entropy loss function to supervise the multilayer network, and comprises 6 loss functions of feature maps M0, M1, M2, M3, M4 and M5. The global loss function is defined as follows:

loss _total ＝loss _M0 +loss _M1 +loss _M2 +loss _M3 +loss _M4 +loss _M5

therein, loss _tota l represents the total loss, loss _M0 To loss _M5 Indicating the loss of the corresponding stage. The overall significance target detection neural network trains 600 epochs, an Adam optimization algorithm is used, the initial learning rate is 0.001, and an exponential descent method is adopted in a learning rate attenuation strategy.

In order to verify the effectiveness of the method, 1000 images are randomly selected from images uploaded by a user, and a trained model is used for prediction. And manually reviewing the prediction result, and measuring by using two indexes of integrity and precision, wherein the statistical result shows that the integrity and the precision of the cutout of the main body region are both about 90%, and the strong end-to-end cutout performance of the method is shown.

According to a second aspect of the present invention, this embodiment discloses a combined automatic and interactive matting aid system. As shown in fig. 1, the further operation of this embodiment is explained as follows:

1) in step S2_1, the Image to be processed is input into the saliency target detection module, and the trained saliency target detection neural network model automatically infers the probability P (x) that each pixel point in the Image belongs to the foreground _ij )，Obtaining a two-dimensional probability matrix M of the whole image:]. And multiplying the probability matrix M by 255 to obtain a mask map Alpha _ SOD predicted by the significance target detection module.

2) In step S2_2, the mask Image Alpha _ SOD and the original Image obtained in step S2_1 are imported into a Matting module to obtain a foreground and background separated Image Matting.

3) In step S2_3, the user performs effect judgment on the Matting obtained in step S2_2, and if the Matting effect meets the requirement, downloads the Matting map for use (steps S2_4 to S2_9 do not need to be executed); otherwise, the interactive matting module is entered.

4) In step S2_4, binarization processing is performed on the Alpha _ SOD obtained in step S2_1 to obtain a Binary image Binary _ Alpha 1;

5) in step S2_5, mask calculation is performed on the Binary Image Binary _ alpha1 obtained in step S2_4 and the original Image, and the body region is locally modified by using an interactive matting algorithm based on GrabCut (blue handwriting stands for preservation, and red region stands for deletion), so as to obtain a binarized mask Image Binary _ alpha2 with higher accuracy.

6) In step S2_6, the difference between Binary _ alpha1 and Binary _ alpha2 is calculated, and Trimap maps of local modified regions are generated by erosion, dilation, and the like on the edges of the pixel region of the difference (Trimap is generated only for the modified regions, and not all regions are regenerated).

7) In step S2_7, the original Image and the Trimap generated in step S2_6 are input into the trained trisection-dependent matting neural network model to generate a more accurate mask Image Alpha _ mat.

8) In step S2_8, the more accurate mask map Alpha _ mat obtained in step S2_7 is imported into the Matting module to obtain Matting mat.

9) In step S2_9, the user makes an effect judgment on the cutout acquired in step S2_ 8: if the requirements of the completeness and the precision of the sectional drawing of the main body region are met, downloading a Matting drawing; otherwise, the interactive matting module is entered again, and the steps from S2_4 to S2_8 are executed until the matting effect is satisfied by the user.

Compared with the existing auxiliary sectional drawing tool in the market, the invention firstly utilizes the salient object detection neural network to obtain the accurate sectional drawing result, reduces the uncertainty interference caused by manual interaction and Trimap treble drawing, and really realizes the automatic sectional drawing from end to end. The invention improves the U-Net network, fuses multi-scale characteristic information of different stages, and has stronger detail segmentation capability (as shown in figure 4). Meanwhile, in order to better fill the defects of the saliency target detection algorithm in practical application, the invention also provides a matting auxiliary system combining automatic matting and interactive matting. Can let the user do the secondary fine setting to the main part region according to the actual matting effect of first step, can further promote the flexibility of matting product and obtain more accurate matting result, very big reduction the matting degree of difficulty, promote the image editing efficiency of electricity merchant design platform.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. An automatic cutout method based on saliency object detection is characterized by comprising the following steps:

s1_1, firstly, inputting an image into a Res _ Swish network formed by a 'Bottleneck' module after replacing an original ReLU activation function with a Swish function with higher precision, and adding each output of an Encode stage to a symmetrical Decode stage to effectively utilize multi-scale feature information through five stages of encoding Encode _ stage1, Encode _ stage2, Encode _ stage3, Encode _ stage4 and Encode _ stage5, five stages of decoding Decode _ stage4, Decode _ stage3, Decode _ stage2 and Decode _ stage 1;

s1_2, performing convolution operations with convolution kernel sizes of 3 × 3 and paging ═ 1 on the output at the Decode _ stage1, respectively, obtaining output tensor M1 without up-sampling operation because no down-sampling is performed at the Decode _ stage1, performing convolution operations with convolution kernel sizes of 3 × 3 and paging ═ 1 on the outputs at the Decode _ stage2, Decode _ stage3, Decode _ stage4 and Decode _ stage5, respectively, and up-sampling the outputs at the Decode _ stage2, M3, M4 and M5 with the same input sizes by using a bilinear interpolation algorithm, respectively;

s1_3, transversely splicing M1, M2, M3, M4 and M5 into a 6-channel tensor M0 according to the dimension 1, and then convolving the M0 into a single-channel tensor by using a convolution kernel with the size of 1 x 1;

and S1_4, finally, carrying out sigmoid operation on the single-channel tensor obtained in the step S1_3 to obtain a probability matrix M [: for each pixel belonging to the foreground and the background. And multiplying the probability matrix M by 255 to obtain a mask map Alpha _ SOD predicted by the significance target detection module.

2. The automatic matting method based on saliency object detection as claimed in claim 1 characterized by: the neural network based on the saliency target detection is improved on a U-Net architecture of a classic Encode-Decode structure, wherein an encoding and decoding network unit of each stage is improved by a 'Bottleneck' module, specifically, a ReLU activation function of the 'Bottleneck' module is replaced by a Swish function with higher precision to form a new Res _ Swish network, and the Swish function is expressed as:

swish(x)＝x·sigmoid(x)

where x represents the input.

3. The automatic matting method based on saliency object detection as claimed in claim 1 characterized by: in step S1_3, M1, M2, M3, M4, and M5 are transversely spliced to generate a 6-channel tensor M0, and then convolved to a single-channel tensor by using a convolution kernel with a size of 1 × 1 for M0, so that the multi-scale feature information fusion at different stages is realized, and the formula is as follows:

M0＝conv(concat(M1,M2,M3,M4,M5))

4. The automatic matting method based on saliency object detection as claimed in claim 1 characterized by: in the training process, a binary cross entropy loss optimization model parameter is used, and a loss function is as follows:

5. The automatic matting method based on saliency object detection as claimed in claim 1 characterized by: the whole loss process uses a multi-stage binary cross-entropy loss function to supervise the multilayer network, and comprises 6 loss functions of feature maps M0, M1, M2, M3, M4 and M5, wherein the whole-process loss function is defined as follows:

loss _total ＝loss _M0 +loss _M1 +loss _M2 +loss _M3 +loss _M4 +loss _M5

where lostotal represents the total loss, and lossM0 through lossM5 represent the losses of the corresponding stages.

6. A matting aid system combining automatic matting and interactive matting, comprising the steps of:

s2_9, the user judges the effect of the cutout obtained in the step S2_ 8: if the requirements of the completeness and the precision of the sectional drawing of the main body region are met, downloading a Matting drawing; otherwise, the interactive matting module is entered again, and the steps from S2_4 to S2_8 are executed until the matting effect is satisfied by the user.