CN113449820A

CN113449820A - Image processing method, electronic device, and storage medium

Info

Publication number: CN113449820A
Application number: CN202110996642.1A
Authority: CN
Inventors: 李艺; 旷章辉; 陈益民; 张伟
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-09-28
Anticipated expiration: 2041-08-27
Also published as: CN113449820B

Abstract

The application discloses an image processing method, an electronic device and a storage medium, wherein the image processing method comprises the following steps: acquiring a training image and a class response graph thereof; segmenting the class response graph to obtain a plurality of class classification response graphs; carrying out coefficient of variation smoothing on each class classification response graph to obtain a smooth classification response graph of each class; obtaining a foreground mask of each category by using the smooth classification response image of each category and the training image; acquiring a proportional matrix by using the foreground mask of each category; and generating a pseudo mask image based on the foreground mask of each category and the proportion matrix. The quality of generating the pseudo mask is improved through an image processing method.

Description

Image processing method, electronic device, and storage medium

Technical Field

The present application relates to the field of image processing application technologies, and in particular, to an image processing method, an electronic device, and a storage medium.

Background

Semantic segmentation is a basic computer vision task that aims to predict the pixel-level classification results of images. However, semantic segmentation requires the collection of class labels at the pixel level, which is both time consuming and expensive compared to other tasks such as classification and detection.

Recently, a great deal of research has been conducted on weakly supervised semantic segmentation, which attempts to achieve segmentation performance equivalent to that of fully supervised methods, using weakly supervised semantic segmentation techniques such as image-level classification labels, graffiti, and bounding boxes.

Currently, weak supervised partitioning is generally based on a class response mapping (CAM) to generate pseudo masks. However, the class response graph usually only responds in the most identifiable place, and is missed in other areas, i.e. a local response problem, so that the quality of the generated pseudo mask is not high.

Disclosure of Invention

The application provides an image processing method, an electronic device and a storage medium.

One technical solution adopted by the present application is to provide an image processing method, including:

acquiring a training image and a class response graph thereof;

segmenting the class response graph to obtain a plurality of class classification response graphs;

carrying out coefficient of variation smoothing on each class classification response graph to obtain a smooth classification response graph of each class;

obtaining a foreground mask of each category by using the smooth classification response image of each category and the training image;

acquiring a proportional matrix by using the foreground mask of each category;

and generating a pseudo mask image based on the foreground mask of each category and the proportion matrix.

By the method, the proportion pseudo mask with smooth coefficient of variation can be provided to generate the high-quality pseudo mask, and the generation quality of the pseudo mask is improved.

Wherein, the step of performing coefficient of variation smoothing on each class classification response map to obtain a smoothed classification response map includes:

obtaining the variation coefficient of each category classification response graph;

taking the variation coefficient of each class classification response graph as a smoothing parameter of each class;

and smoothing the pixels in the classification response map of each class based on the smoothing parameters of the class to obtain a smooth classification response map of the class.

In the manner, the activation region of the class classification response map can be expanded by using the coefficient of variation smoothing processing so as to overcome the partial response problem based on the class response map.

Wherein, the obtaining the variation coefficient of each class classification response map comprises:

obtaining the confidence coefficient distribution of each category classification response image;

obtaining the confidence coefficient deviation and the confidence coefficient average value of each category classification response image based on the confidence coefficient distribution and a preset threshold value;

and obtaining the variation coefficient by using the confidence coefficient deviation and the confidence coefficient average value.

Through the method, the activation region of the target object can be effectively expanded by providing a specific variation coefficient smoothing processing mode.

Wherein the obtaining of the foreground mask of each class by using the smoothed classification response map of each class and the training image comprises:

obtaining class specific background in each class smooth classification response image;

and acquiring a foreground binary mask in the specific background of each category class by using a preset algorithm and the training image, and combining the foreground binary masks of all the categories to form a foreground matrix.

Through the mode, the importance of the foreground position is represented through the foreground matrix.

Wherein the obtaining a scaling matrix using the foreground mask of each category includes:

obtaining a category foreground score by using a foreground binary mask in the specific background of each category;

acquiring a pixel category score of each pixel in the training image;

obtaining the total of the category foreground scores of all categories;

and acquiring the proportion matrix based on the pixel category fraction and the category foreground fraction sum.

In this way, the importance of each position of each category can be independently calculated by the scale matrix.

Wherein the generating a pseudo mask image based on the foreground mask of each category and the scaling matrix comprises:

and multiplying the elements of the foreground matrix and the elements of the proportional matrix according to the channel dimension of the training image to generate the pseudo mask image.

In this way, the process from the class response graph to the pseudo mask can be optimized by the proportional pseudo mask generation.

The image processing method further comprises the following steps:

carrying out normalization processing on the category response graph;

the step of segmenting the class response map to obtain a classification response map of a plurality of classes includes:

and segmenting the normalized class response graph to obtain a classification response graph of a plurality of classes.

By the method, the statistical distribution of the generalized uniform category response graph can be realized.

Wherein the image processing method further comprises:

inputting the pseudo mask image into a preset segmentation model, and acquiring a loss mean value obtained by training the pseudo mask image;

processing the loss mean value by adopting a preset strategy under the condition that the loss mean value is smaller than a preset loss threshold value;

and training the preset segmentation model by using the processed loss average value.

In this way, the problem of noise of the pseudo mask image can be solved by adjusting the loss value of the segmentation model.

Wherein, the processing of the loss mean value by adopting a preset strategy comprises the following steps:

setting the loss average value as a preset threshold value when the loss average value is greater than or equal to the preset threshold value;

or, performing scaling processing on the loss mean value by using the preset threshold value;

or setting the loss average value to 0 when the loss average value is greater than or equal to the preset threshold value.

By the method, the segmentation model can be adjusted by using an incomplete fitting strategy, and the anti-noise performance of the segmentation part is improved.

Wherein the image training method further comprises:

acquiring a pseudo mask image output by the preset segmentation model;

and taking the output pseudo mask image as the input of the next training of the preset segmentation model.

In the above manner, the output of the segmentation model is used as a new pseudo mask input, namely, a cyclic pseudo mask, so that the quality of the training annotation is improved.

Another technical solution adopted by the present application is to provide an electronic device, including:

the acquisition module is used for acquiring a training image and a class response map thereof;

the segmentation module is used for segmenting the category response graph to obtain a plurality of category classification response graphs;

the processing module is used for carrying out coefficient of variation smoothing processing on each class classification response graph to obtain a smooth classification response graph of each class;

a computing module, configured to obtain a foreground mask of each class by using the smooth classification response map of each class and the training image, and further obtain a scaling matrix by using the foreground mask of each class;

a generating module for generating a pseudo mask image based on the foreground mask of each category and the scaling matrix

Another technical solution adopted by the present application is to provide an electronic device, which includes a memory and a processor coupled to the memory;

wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the image processing method as described above.

Another technical solution adopted by the present application is to provide a computer storage medium for storing program data, which when executed by a computer, is used to implement the image processing method as described above.

The beneficial effect of this application is: the electronic equipment acquires a training image and a class response diagram thereof; dividing the class response graph to obtain a plurality of class classification response graphs; carrying out coefficient of variation smoothing on each class classification response graph to obtain a smooth classification response graph of each class; acquiring a foreground mask of each category by using the smooth classification response image of each category and the training image; acquiring a proportional matrix by using the foreground mask of each category; a pseudo-mask image is generated based on the foreground mask and the scaling matrix for each category. The image processing method provides the proportion pseudo mask with smooth coefficient of variation to generate the high-quality pseudo mask, and the generation quality of the pseudo mask is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flowchart of an embodiment of an image processing method provided in the present application;

FIG. 2 is a block diagram of an embodiment of an image processing method provided in the present application;

FIG. 3 is a schematic flow chart of step S103 of the image processing method shown in FIG. 1;

FIG. 4 is a schematic flowchart of another embodiment of an image processing method provided in the present application;

FIG. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application;

FIG. 6 is a schematic structural diagram of another embodiment of an electronic device provided herein;

fig. 7 is a schematic structural diagram of a computer-readable storage medium of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Specifically, referring to fig. 1 and fig. 2, fig. 1 is a schematic flowchart of an embodiment of an image processing method provided by the present application, and fig. 2 is a schematic frame diagram of an embodiment of an image processing method provided by the present application. The image processing method of the embodiment of the application can be applied to an electronic device, wherein the electronic device can be a server, a terminal device, a system in which the server and the terminal device are matched with each other, or a device with processing capability (such as a processor). Accordingly, each part, such as each unit, sub-unit, module, and sub-module, included in the electronic device may be all disposed in the server, may be all disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

Further, the server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules, for example, software or software modules for providing distributed servers, or as a single software or software module, and is not limited herein.

As shown in fig. 1, the image processing method according to the embodiment of the present application may specifically include the following steps:

step S101: and acquiring a training image and a class response map thereof.

The electronic device of the embodiment of the application first acquires the training image and then acquires the class response map of the training image, wherein the manner of acquiring the class response map from the training image refers to the prior art and is not described herein again. The category response map can map the response size of the feature map to the original map, so that the reader can more intuitively understand the effect of the model, and the category response map can be embodied in expressions such as an attention map and a thermodynamic map.

The electronic device further normalizes the category response map, and specifically, min-max normalization is applied to the category response map

And is expressed as:

wherein h is the ordinate of the category response graph, w is the abscissa of the category response graph, c is the channel index of the category response graph,

corresponding to a pixel point with coordinates (h, w, c)

The value of the pixel is determined by the pixel value,

for the smallest pixel value in the class response map,

is the maximum pixel value in the class response map.

Step S102: and dividing the class response map to obtain a classification response map of a plurality of classes.

Step S103: and carrying out coefficient of variation smoothing treatment on each class classification response graph to obtain a smooth classification response graph of each class.

As shown in fig. 2, the electronic device performs coefficient of variation smoothing on the class response map to obtain the coefficient of variation of each class, and performs exponential transformation.

For example, the electronic device may divide the category response map into a plurality of categories of classification response maps, i.e., CAM slices in fig. 2, calculate a confidence distribution of the classification response map for each category, and smooth the classification response map according to the confidence distribution for each category. Specifically, the category response map has multi-dimensional information, each dimension is embodied as an image channel, that is, the category of the classification response map of the embodiment of the present application, and the classification response map of each category represents image information of one dimension of the category response map.

Referring to fig. 3, fig. 3 is a flowchart illustrating step S103 of the image processing method shown in fig. 1. As shown in fig. 3, step S103 in the embodiment of the present application specifically includes the following steps:

step S131: and obtaining the variation coefficient of each class classification response graph.

In the embodiment of the present application, the motivation of the coefficient of variation smoothing process is to smooth the class response graph based on the change of the spatial domain confidence. Different smoothing strengths are required for different images and different classes depending on their confidence distributions. Therefore, in order to measure the confidence distribution, the coefficient of variation cv is introduced in the embodiment of the present application, and the coefficient of variation cv may be specifically defined by the following equation:

wherein,

for each class of classification response map pixel confidence deviations,

for the pixel confidence mean of the classification response map for each class,

and f, the confidence of the pixel points in the classification response image of the category f.

Step S132: and taking the variation coefficient of each class classification response graph as a smoothing parameter of each class.

Step S133: and smoothing the pixels in the classification response map of each class based on the smoothing parameters of the class to obtain a smooth classification response map of the class.

In an embodiment of the application, the electronic device raises the coefficient of variation cv as an exponential function power of each pixel in the classification response map. Wherein, due to

Lower exponential powers below 1 result in less difference between foreground pixels and smoother classification response maps.

In the embodiment of the application, the variation coefficient is used as a smoothing parameter of each pixel in the classification response map, and each pixel is smoothed, that is, the difference between the product of the variation coefficient and the scale factor s and 1 is used as an index of the classification response map to perform smoothing, so as to obtain a corresponding smooth classification response map. Specifically, the implementation equation is specifically as follows:

wherein,

the method is a smooth classification response graph of n types, cv is a coefficient of variation, and s is a preset scale factor.

Step S104: and acquiring the foreground mask of each category by using the smooth classification response graph of each category and the training image.

In the embodiment of the application, in order to optimize the process from the class response diagram to the pseudo mask, the generation of the pseudo mask in proportion is further proposed.

As shown in fig. 2, after performing coefficient of variation smoothing processing on the class response map, the electronic device further obtains a foreground matrix by using the smoothed classification response map and the training image. An important issue in weakly supervised semantic segmentation is that the class response maps are obtained from binary classifiers, following an independent way of binary cross entropy loss, so that embodiments of the present application can generate class specific backgrounds for each class's smooth class response map by the bg function and apply CRF (conditional random field) thereto. Then, the electronic device introduces a training image, calculates a foreground binary mask by using an fg function, and specifically implements the following equation:

wherein, I is a training image,

in order to smooth the classification response map,

is the foreground binary mask.

It should be noted that the foreground matrix in the embodiment of the present application is formed by combining foreground binary masks in all class-specific backgrounds, and the representation form of the foreground matrix may be that each column or each row of the foreground matrix respectively includes a foreground binary mask in one class-specific background.

Step S105: the scaling matrix is obtained using the foreground mask for each class.

In the embodiment of the application, the electronic device further acquires the proportional matrix by adopting the foreground matrix. Specifically, the electronic device obtains a category foreground score using a foreground binary mask in the specific background of each category class; acquiring a pixel category score of each pixel in a training image; the pixel category score is divided by the sum of the category foreground scores to obtain a scaling matrix. The category response map is obtained from the binary classifier, each pixel point in the category response map outputs a pixel category score through the binary classifier, and the category foreground score of the foreground binary mask is the sum of the pixel category scores of all the pixel points in the foreground binary mask.

As shown in fig. 2, the electronic device employs the category score of the pixels in each category divided by the foreground score sum for that category. Wherein the class scores of the pixels are output by the binary classifier in a training process, and the foreground score sum of each class is provided by the foreground matrix.

Therefore, the specific implementation equation of the scaling matrix in the embodiment of the present application is as follows:

wherein,

the class foreground score for the foreground binary mask in class m,

for the class foreground score of the foreground binary mask in class n,

is the category score of pixel point (c, h, w).

Step S106: a pseudo-mask image is generated based on the foreground mask and the scaling matrix for each category.

In an embodiment of the application, the electronic device calculates a maximum value of a product of the foreground matrix and the scaling matrix using an argmax function. Specifically, the argmax function performs multiplication operation on the elements of the foreground binary mask and the elements of the scale map along the channel dimension to generate a pseudo mask, and the specific implementation equation is as follows:

through the processing of the steps, a high-quality pseudo mask is generated on the training image and can be used as the input of the subsequent semantic segmentation training.

In the embodiment of the application, the electronic equipment acquires a training image and a class response map thereof; dividing the class response graph to obtain a plurality of class classification response graphs; carrying out coefficient of variation smoothing on each class classification response graph to obtain a smooth classification response graph of each class; acquiring a foreground mask of each category by using the smooth classification response image of each category and the training image; acquiring a proportional matrix by using the foreground mask of each category; a pseudo-mask image is generated based on the foreground mask and the scaling matrix for each category. The image processing method provides the proportion pseudo mask with smooth coefficient of variation to generate the high-quality pseudo mask, and the generation quality of the pseudo mask is improved.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating an image processing method according to another embodiment of the present disclosure. As shown in fig. 4, the image processing method according to the embodiment of the present application may specifically include the following steps:

step S201: a pseudo mask image of a training image is obtained.

In this embodiment of the application, the method for obtaining the high-quality pseudo mask image in the training image may be implemented in the above embodiment of the image processing method, and details are not repeated here.

Step S202: and inputting the pseudo mask image into a preset segmentation model, and acquiring a loss mean value obtained by training the pseudo mask image.

In the embodiment of the application, the electronic device inputs the training image with the pseudo mask into the preset segmentation model for training the preset segmentation model. In the segmentation training process, in order to solve the noise problem, an incomplete fitting strategy is provided in an embodiment of the present application, which specifically refers to the following steps:

step S203: and under the condition that the loss average value is smaller than a preset loss threshold value, processing the loss average value by adopting a preset strategy.

In the embodiment of the present application, compared with manual annotation, the pseudo mask image used as a supervisory signal for training semantic segmentation is noisy, and the current research focuses on generating high-quality pseudo masks to reduce noise, and few people try to suppress noise in the model training process.

The embodiment of the application provides a method for re-weighting the loss value of the potential noise pixel in the optimization of the weak supervised segmentation so as to reduce the influence of the noise pixel on the training of the preset segmentation model. Specifically, an incomplete fitting strategy is provided in an embodiment of the present application, first, the electronic device needs to obtain a loss mean value obtained by training a pseudo mask image by using a preset segmentation model, where the loss mean value of the pseudo mask image is an average value of training loss values of all pixels in the pseudo mask image.

Then, the electronic device judges whether the loss mean value of the pseudo mask image needs to be adjusted in the training through a preset loss threshold, and the judgment logic is as the following equation:

wherein pus () represents the operation of an incomplete fit strategy,

is the loss average in the pseudo-mask image.

When the loss mean value of the pseudo mask image is greater than or equal to the preset loss threshold value beta, the electronic equipment does not need to adjust the loss mean value of the pseudo mask image in the training. And when the loss mean value of the pseudo mask image is smaller than a preset loss threshold value beta, the electronic equipment adjusts the loss mean value of the pseudo mask image in the training through incomplete fitting strategies such as cutting, exponential scaling, neglecting and the like.

Specifically, the three incomplete-fit strategies provided herein are as follows:

wherein,

the loss average is set to a preset threshold k, and

the missing values of these pixels are discarded and,

a scaling strategy is performed on the loss means by an exponential function.

In particular, the amount of the solvent to be used,

when the loss average value is smaller than a preset threshold k, the loss average value is kept unchanged, and when the loss average value is larger than or equal to the preset threshold k, the loss average value is set to be a fixed value k.

And performing exponential transformation on the loss mean value through k to realize loss mean value scaling processing.

When the loss average value is smaller than a preset threshold k, the loss average value is kept unchanged, and when the loss average value is larger than or equal to the preset threshold k, the loss average value is set to be 0.

Step S204: and training the preset segmentation model by using the processed loss average value.

In the embodiment of the application, the electronic device trains the preset segmentation model by using the processed loss mean value, so that the noise problem of the pseudo mask image can be well solved.

Further, the embodiment of the application can also correspond to the problem of low accuracy of the pseudo mask, and the output of the preset segmentation model is used as a new pseudo mask image, namely, a cyclic pseudo mask, so that the quality of the training annotation is improved. Specifically, the electronic device may obtain a pseudo mask image output by a preset segmentation model; and taking the output pseudo mask image as the input of the next training of the preset segmentation model, namely, retraining the segmentation model by taking the output of the segmentation model as a new pseudo mask image, and updating the training annotation quality so as to improve the training precision of the segmentation model.

The above embodiments are only one of the common cases of the present application and do not limit the technical scope of the present application, so that any minor modifications, equivalent changes or modifications made to the above contents according to the essence of the present application still fall within the technical scope of the present application.

With continued reference to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application. The electronic device 300 includes an obtaining module 31, a dividing module 32, a processing module 33, a calculating module 34, and a generating module 35.

The obtaining module 31 is configured to obtain a training image and a category response map thereof.

And a dividing module 32, configured to divide the category response map to obtain a classification response map of multiple categories.

And the processing module 33 is configured to perform coefficient of variation smoothing on each class classification response map to obtain a smoothed classification response map of each class.

And a calculating module 34, configured to obtain a foreground mask of each class by using the smoothed classification response map of each class and the training image, and further obtain a scaling matrix by using the foreground mask of each class.

A generating module 35, configured to generate a pseudo mask image based on the foreground mask of each category and the scaling matrix.

With continued reference to fig. 6, fig. 6 is a schematic structural diagram of another embodiment of the electronic device provided in the present application. The electronic device 500 of the embodiment of the present application includes a processor 51, a memory 52, an input-output device 53, and a bus 54.

The processor 51, the memory 52 and the input/output device 53 are respectively connected to the bus 54, the memory 52 stores program data, and the processor 51 is used for executing the program data to implement the image processing method according to the above embodiment.

In the embodiment of the present application, the processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 51 may be any conventional processor or the like.

Please refer to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a computer storage medium provided in the present application, the computer storage medium 600 stores program data 61, and the program data 61 is used to implement the image processing method according to the above embodiment when being executed by a processor.

Embodiments of the present application may be implemented in software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, which is defined by the claims and the accompanying drawings, and the equivalents and equivalent structures and equivalent processes used in the present application and the accompanying drawings are also directly or indirectly applicable to other related technical fields and are all included in the scope of the present application.

Claims

1. An image processing method, characterized in that the image processing method comprises:

acquiring a training image and a class response graph thereof;

acquiring a proportional matrix by using the foreground mask of each category;

2. The image processing method according to claim 1,

the step of performing coefficient of variation smoothing on each class classification response map to obtain a smoothed classification response map includes:

3. The image processing method according to claim 2,

the obtaining of the coefficient of variation of each class classification response map includes:

4. The image processing method according to any one of claims 1 to 3,

the obtaining the foreground mask of each category by using the smooth classification response map of each category and the training image includes:

5. The image processing method according to claim 4,

the obtaining a scaling matrix by using the foreground mask of each category includes:

acquiring a pixel category score of each pixel in the training image;

obtaining the total of the category foreground scores of all categories;

6. The image processing method according to claim 5,

generating a pseudo-mask image based on the foreground mask of each category and the scaling matrix, comprising:

7. The image processing method according to claim 1,

the image processing method further comprises the following steps:

carrying out normalization processing on the category response graph;

8. The image processing method according to claim 1, characterized in that the image processing method further comprises:

9. The image processing method according to claim 8,

and processing the loss mean value by adopting a preset strategy, comprising the following steps:

10. The image processing method according to claim 8,

the image processing method further includes:

acquiring a pseudo mask image output by the preset segmentation model;

11. An electronic device, characterized in that the electronic device comprises:

and the generating module is used for generating a pseudo mask image based on the foreground mask of each category and the proportion matrix.

12. An electronic device, comprising a memory and a processor coupled to the memory;

wherein the memory is used for storing program data, and the processor is used for executing the program data to realize the image processing method according to any one of claims 1-10.

13. A computer storage medium for storing program data for implementing an image processing method according to any one of claims 1 to 10 when executed by a computer.