CN116071628B

CN116071628B - Image processing method, device, electronic equipment and storage medium

Info

Publication number: CN116071628B
Application number: CN202310114092.5A
Authority: CN
Inventors: 刘军伟; 杨叶辉; 曹星星
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2024-04-05
Anticipated expiration: 2043-02-06
Also published as: CN116071628A

Abstract

The disclosure provides an image processing method, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of deep learning, computer vision and image processing. The specific implementation scheme is as follows: extracting features of an image to be processed to obtain an initial feature map, wherein the image to be processed comprises a target object; respectively carrying out compression processing of the most discriminative features and compression processing of irrelevant features on the initial feature map to obtain a first feature processing map and a second feature processing map; fusing the first feature processing diagram and the second feature processing diagram to obtain a fused feature diagram; performing feature selection on the fusion feature map to obtain an output feature map; and determining the category and the position of the target object in the image to be processed according to the output feature map. The present disclosure also provides an image processing apparatus, an electronic device, and a storage medium.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of deep learning, computer vision, and image processing techniques. More specifically, the present disclosure provides an image processing method, apparatus, electronic device, and storage medium.

Background

In the technical field of computer vision, tasks such as target classification, target detection, image segmentation and the like are not separated from the positioning of a target object in an image. The localization of target objects in images generally adopts a target localization method based on deep learning.

Disclosure of Invention

The present disclosure provides an image processing method, apparatus, device, and storage medium.

According to a first aspect, there is provided an image processing method comprising: extracting features of an image to be processed to obtain an initial feature map, wherein the image to be processed comprises a target object; respectively carrying out compression processing of the most discriminative features and compression processing of irrelevant features on the initial feature map to obtain a first feature processing map and a second feature processing map; fusing the first feature processing diagram and the second feature processing diagram to obtain a fused feature diagram; performing feature selection on the fusion feature map to obtain an output feature map; and determining the category and the position of the target object in the image to be processed according to the output feature map.

According to a second aspect, there is provided an image processing apparatus comprising: the extraction module is used for extracting the characteristics of the image to be processed to obtain an initial characteristic diagram, wherein the image to be processed comprises a target object; the processing module is used for respectively carrying out the pressing processing of the most discriminative features and the pressing processing of the irrelevant features on the initial feature map to obtain a first feature processing map and a second feature processing map; the fusion module is used for fusing the first feature processing diagram and the second feature processing diagram to obtain a fused feature diagram; the selection module is used for carrying out feature selection on the fusion feature map to obtain an output feature map; and the determining module is used for determining the category and the position of the target object in the image to be processed according to the output characteristic diagram.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to a fifth aspect, there is provided a computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture to which image processing methods and apparatus may be applied, according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of an image processing method according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an image processing method according to one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a nonlinear curve of different curvatures under a first branch according to one embodiment of the disclosure;

fig. 5 is a block diagram of an image processing apparatus according to one embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device of an image processing method according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The target positioning method based on deep learning comprises a strong-supervision target positioning method and a weak-supervision target positioning method.

The strongly supervised target localization method requires pixel-level labeling or localization frame-level labeling of the target object in the image in the preparation phase. The labeling at the pixel level includes a pixel level mask, e.g., setting a pixel in the image that belongs to the target object to 1 and a pixel in the image that does not belong to the target object to 0. The labeling at the level of the bounding box marks the boundary of the region in the image where the target object is located, for example, using a bounding box.

The weakly supervised object localization method requires only image-level annotations, including for example the class of the annotation image, in the preparation phase. Unlike fine-grained labeling at the pixel level or the positioning frame level, image level labeling only requires labeling personnel to give information about whether an object is contained in an image, and does not require labeling a specific location or contour of a target object in the image.

Compared with a strongly supervised target positioning method, the weakly supervised target positioning method uses image-level category labeling information to replace accurate target position labeling information for supervision training in the network training process, so that the model can infer the position information of a target object in an image.

The labeling at the pixel level and the labeling at the positioning frame level consume a great deal of labor labeling cost. Particularly, labeling images in some professional fields is very time-consuming, such as relatively scattered focus positions in medical images, medical image labeling has very strong professionals, professional medical professionals are required to label the images, and focus-level labeling (such as pixel-level mask or positioning frame labeling) has the problems of great difficulty and high cost. Image level labeling is relatively much smaller in speed and cost, and more labeling samples can be obtained relatively quickly.

Therefore, a weakly supervised target positioning method which enables a model to infer the position information of a target object in an image is more and more focused by learning and modeling image-level annotation data.

Currently, weakly supervised target localization methods mainly include class activation graph (CAM, class Activation Map) based methods. The CAM method uses a weighted sum of the last convolution feature map (e.g., the output of the last convolution layer of the convolution network) of the depth-learning based image classification model to locate the target object. The weighted weights come from the weight parameters associated with the class to which the object belongs in the last full tie layer of the image classification model.

However, CAM methods tend to focus on the most discriminative areas in the image, resulting in weak surveillance targeting that can only acquire a partial area of the target object, with incomplete targeting. For example, only the cat head area is obtained, and the area of the cat body is missed. In addition, the target object often co-occurs with some extraneous objects, which may also be misplaced as target objects during weak supervisory positioning. For example, "people" that often come with the target "dog" are also located as targets.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

FIG. 1 is a schematic diagram of an exemplary system architecture to which image processing methods and apparatus may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be a variety of electronic devices including, but not limited to, smartphones, tablets, laptop portable computers, and the like.

The image processing method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the image processing apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The image processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the image processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

Fig. 2 is a flowchart of an image processing method according to one embodiment of the present disclosure.

As shown in fig. 2, the image processing method 200 includes operations S210 to S250.

In operation S210, features of an image to be processed are extracted, and an initial feature map is obtained.

The image to be processed contains a target object, and the target object can be a cat, a dog and the like and can be a focus and the like.

For example, the image classification model based on the deep learning model is used for extracting the characteristics of the image to be processed, and an initial characteristic diagram is obtained. The image classification model may include a convolutional network, which may include a plurality of convolutional layers (e.g., 16 layers). The initial profile may be a profile of any one of the convolutional layer outputs of the convolutional network. Illustratively, the initial profile may be the output of layer 7, layer 10, or layer 13 of the convolutional network, or the like.

The feature values (which may also be referred to as activation values) of the features in the initial feature map represent the relevance of the feature to the class of the target object. The larger the activation value of a feature, the more important the feature is to the classification result, and the more discriminative the feature is. The smaller the activation value of a feature, the less important (or irrelevant) the feature to the classification result, the less discriminative the feature.

In operation S220, the initial feature map is subjected to the pressing process of the most distinctive feature and the pressing process of the irrelevant feature, respectively, to obtain a first feature processing map and a second feature processing map.

For example, the initial feature map may be processed in two branches. The first branch is used for suppressing the most discriminative feature in the initial feature map, for example, the activation value of the area with the activation value larger than the first threshold in the initial feature map is forcedly suppressed to 0, so as to obtain a first feature processing map. The processing of the first branch can enable the model to pay attention to the features except the features with the most discriminants, and the problem of incomplete target positioning caused by only paying attention to the features with the most discriminants can be avoided.

The second branch is used for suppressing the feature (or irrelevant feature) with the least discrimination in the initial feature map, for example, the activation value of the area (such as the background area) with the activation value smaller than the second threshold value in the initial feature map is forcedly suppressed to 0, so as to obtain a second feature processing map. The second branch processing can enable the model to fully pay attention to the characteristics with information except irrelevant characteristics, remove irrelevant interference, improve the attention capability of the model to all kinds of relevant information, and avoid the problem of positioning irrelevant objects as target objects by mistake.

The first threshold may be determined from the maximum activation value in the initial feature map, for example 80% of the maximum activation value being taken as the first threshold. The second threshold value may also be determined from the maximum activation value in the initial feature map, for example 40% of the maximum activation value being taken as the second threshold value.

In operation S230, the first feature processing diagram and the second feature processing diagram are fused to obtain a fused feature diagram.

Since the most discriminative area is suppressed in the first feature processing diagram and the area irrelevant to the target object is suppressed in the second feature processing diagram, the first feature diagram and the second feature diagram have information deletion.

In order to solve the problem of information loss, the first feature processing diagram and the second feature processing diagram can be fused to obtain a fused feature diagram. For example, the initial feature map is F.epsilon.R ^c×H×W H and W are the height and width of the initial feature map, respectively, and C is the number of channels (channel dimensions) of the initial feature map. Then the first feature processing graph may be D ε R ^C×H×W The second feature processing graph may be S ε R ^C×H×W . Stacking the first feature processing diagram and the second feature processing diagram according to the channel dimension to obtain a fusion feature diagram M epsilon R ^2C×H×W 。

In operation S240, feature selection is performed on the fusion feature map, and an output feature map is obtained.

The information in the fused feature map M is relatively redundant for the image classification model. Thus, feature selection can be performed on the fusion feature map M to reduce redundancy.

For example, the 2C channels of the fused feature map M may be weighted, and the weights of the features in each channel may be obtained by learning, so that features with information may be more preferentially selected from the fused feature map M, and a weighted feature map may be obtained. And converting 2C channel dimensions of the weighted feature map into C channel dimensions to obtain an output feature map, so that the channel numbers of the output feature map and the initial feature map are consistent. Thus, the processing steps for the two branches of the initial feature map and the step of processing the fused feature map as an output feature map (operations S220 to S240) can be used as components inserted into the image classification model, which can be compatible with various classification networks, and can realize plug-and-play.

In operation S250, a category and a position of a target object in the image to be processed are determined according to the output feature map.

For example, the output feature map may be input into a subsequent network of image classification models, and finally into a fully connected network of image classification models, which may output the target class of the target object. For example, the network is fully connected to classify the image as one of class 1, class 2, class 3, and if the network is fully connected to classify the current image as class 1, class 1 is the target class of the target object.

The fully-connected network may also output weights associated with the various categories, including weights associated with category 1, weights associated with category 2, and weights associated with category 3. If category 1 is the target category of the target object, then the weight associated with category 1 may be used as the weight of the feature in the output feature map.

And then, weighting the output feature map according to the weights of the features in the output feature map to obtain a category activation map. And positioning the target object according to the category activation graph. For example, an area with a larger activation value in the category activation graph may be determined as an area where the target object is located, so as to obtain a positioning frame of the target object. The activation values in the class activation map can also be subjected to binarization processing to obtain a pixel-level mask map.

According to the embodiment of the disclosure, the initial feature map of the image to be processed is respectively processed by pressing the most discriminative features and pressing the independent features, the feature processing maps processed by the two branches are fused, feature selection is performed on the fused feature map, weak supervision positioning is performed according to the output feature map after feature selection, the problem that target positioning is incomplete due to the fact that only the most discriminative features are focused in the related art can be avoided, the problem of error positioning due to the independent features is avoided, and therefore the image classification model fully utilizes all information related to categories in the image to be processed, and positioning accuracy of the target object is improved.

The image processing method provided by the present disclosure is described in detail below with reference to fig. 3.

Fig. 3 is a schematic diagram of an image processing method according to one embodiment of the present disclosure.

As shown in FIG. 3, the initial feature map F εR ^C×H×W Such as output by a certain convolution layer (e.g., layer 10) of the convolution network of the image classification model. Respectively carrying out feature processing of two branches on the initial feature map F to obtain a first feature processing map D E R ^C×H×W And a second feature processing diagram S epsilon R ^C×H×W 。

According to the embodiment of the disclosure, the first branch is used for carrying out nonlinear enhancement processing on the features with the feature values smaller than the first threshold value in the initial feature map, and removing the features with the feature values not smaller than the first threshold value in the initial feature map to obtain a first feature processing map.

For example, the initial feature map F may be processed as the first feature processing map D by the following formula (1).

Wherein F (c, i, j) represents a characteristic value of an ith row and an ith column of a c-th channel in an initial characteristic diagram, D (c, i, j) represents a characteristic value of an ith row and an jth column of the c-th channel in a first characteristic processing diagram, max (F (c)) represents a maximum characteristic value in the c-th channel, cr is a super-parameter, cr is 0.ltoreq.cr.ltoreq.1, and α is a parameter for controlling curvature of a nonlinear curve.

When the feature value F (c, i, j) in the initial feature map belongs to the range F (c, i, j) < cr×max (F (c)), the feature value is subjected to nonlinear processing between 0 and max (F (c) according to the formula (1), and the feature value is increased by the formula (1), that is, when F (c, i, j) < cr×max (F (c)), D (c, i, j) > F (c, i, j) is constant.

When the feature value F (c, i, j) in the initial feature map belongs to the range cr×max (F (c). Ltoreq.f (c, i, j). Ltoreq.f (c)), the feature value is forced to be pressed to 0 according to formula (1), so that the network focuses on the region other than the maximum activation region, i.e., the region corresponding to F (c, i, j) < cr×max (F (c)).

Wherein the parameter α controlling the curvature of the nonlinear curve can be expressed by the following formula (2).

By setting the parameter α by the formula (2), the intersection point of the nonlinear mapping curve and the linear mapping curve (the linear mapping relationship refers to the mapping relationship of D (c, i, j) =f (c, i, j)) of the formula (1) at different cr values can be made F (c, i, j) =cr×max (F (c)). Thus, the eigenvalues before the intersection point (i.e., the eigenvalues of the region corresponding to F (c, i, j) < cr max (F (c)) in the initial eigenvector graph) may increase nonlinearly in order to draw more network attention.

According to an embodiment of the disclosure, the second branch is configured to remove features in the initial feature map, where the feature value is smaller than the second threshold, to obtain a second feature processing map. The second branch aims to suppress as much as possible the regions of the initial feature map F which are not related to the target object (regions with lower activation values), and to preserve all the regions with information content (regions with higher activation values).

For example, the initial feature map F may be processed into the second feature processing map S by the following formula (3).

Wherein F (c, i, j) represents a feature value of an ith row and jth column position of a c-th channel in the initial feature map, S (c, i, j) represents a feature value of an ith row and jth column position of the c-th channel in the second feature processing map, max (F (c)) represents a maximum feature value in the c-th channel, t is a second threshold, t is a super-parameter for controlling an active value retention ratio, and 0 < t < 1.

The above two branches are equivalent to the information in the initial feature map F being disassembled, and the first feature processing map D and the second feature processing map S are obtained. The most discriminant region is suppressed in the first feature processing diagram D, and the background information with smaller activation values is suppressed in the second feature processing diagram S. The first feature processing diagram D and the second feature processing diagram S are different in emphasis, but there is a loss of information.

In order to solve the problem of information loss, the first feature processing diagram D and the second feature processing diagram S can be fused according to the channel dimension to obtain a fused feature diagram M epsilon R ^2C×H×W . However, for the image classification model, there is redundancy in the information in the fused feature map M, so feature selection needs to be performed on the fused feature map M to reduce redundancy.

According to an embodiment of the present disclosure, feature selection of the fused feature map M includes determining a first weight of features in the fused feature map M, wherein the first weight represents importance of the features to a classification result of the target object; and processing the fused feature map M into a weighted feature map M according to the first weight _W Then weighting the characteristic diagram M _W The channel dimension (channel number) of the initial feature map F is processed to be identical to the channel dimension (channel number) of the initial feature map F, to obtain an output feature map F.

The effect of the first weights is that features with information content can be selected more preferentially from the fused feature map M based on the first weights. Determining the first weight includes: compressing the fusion feature map M to obtain a first fusion feature vector W ₁ The method comprises the steps of carrying out a first treatment on the surface of the For the first fusion feature vector W ₁ Each feature in the image is subjected to information interaction to obtain a second fusion feature vector W ₂ Second fusion feature vector W ₂ Is dimensioned to be combined with the first feature vector W ₁ Is consistent in dimension to obtain a third fusion feature vector W ₃ The method comprises the steps of carrying out a first treatment on the surface of the For the third fused feature vector W ₃ Normalization processing is carried out to obtain a weight vector W ^* Wherein the weight vector W ^* The first weight is represented by a plurality of elements in the weight vector, and the first weight represented by a plurality of elements in the weight vector corresponds to a plurality of features in the fused feature graph.

Referring to fig. 3, a global maximum pooling operation is performed on the pooled verification fused feature map M using a size h×w, so that the fused feature map M is compressed into a first fused feature vector W ₁ ∈R ^2C 。

The first fusion feature vector W ₁ Through a full connection layer, information interaction of each channel of the feature vector is realized, and a second fusion feature vector W is obtained ₂ ∈R ^C Second fusion feature vector W ₂ The second fusion feature vector W is enabled to pass through the next full connection layer ₂ Is restored to the first fusion feature vector W ₁ The channel number (dimension) is consistent to obtain a third fusion characteristic vector W ₃ ∈R ^2C 。

Next, a third fused feature vector W ₃ Normalization processing is carried out through a Sigmoid function, so that a third fusion feature vector W ₃ The eigenvalues of the (B) are mapped into the (0, 1) range to obtain a weight vector W ^* ∈R ^2C Weight vector W ^* The elements in (a) represent a first weight, the weight vector W ^* 2C first weights and fusion bits in (3)The features of the 2C channels of the signature M each correspond.

Using weight vectors W ^* ∈R ^2C For the fusion characteristic diagram M epsilon R ^2C×H×W Weighting to obtain a weighted feature map M _W ∈R ^2C ^×H×W . For example, a weighted feature map M _W The characteristic of the ith channel is obtained by M _w (i)＝W ^* (i) M (i), wherein W ^* (i) For the weight vector W ^* M (i) is the feature of the i-th channel in the fusion feature map M.

Next, the feature map M is weighted using a convolution kernel of size 1*1 _W ∈R ^2C×H×W Reducing the channel to obtain an output characteristic diagram F ^* ∈R ^C×H×W So that the number of channels of the output profile F is identical to the number of channels of the initial profile F. Therefore, the implementation process from the initial feature map F to the output feature map F (i.e., the content in the dashed box in fig. 3) can be used as a component of the inserted image classification model, and the component can be compatible with other various network structures in a plug-and-play manner, and can reduce the parameter number and improve the calculation speed.

Weak supervision targeting can be performed based on the CAM method using the output profile. The weak supervision target positioning according to the output characteristic diagram comprises the following steps: inputting the output feature map into an image classification model to obtain a target class of a target object and a second weight related to the target class; processing the output feature map into a category activation map according to the second weight; and determining the position of the target object in the image to be processed according to the category activation diagram.

For example, the output feature map is input into a network structure subsequent to the image classification model, which may include the remaining convolutional layers of the convolutional network as well as the fully-connected network. The full-connection network can output the target category of the target object and a second weight related to the target category, and the output feature map is weighted according to the second weight to obtain a category activation map, and the positioning frame or the pixel level mask can be further positioned according to the category activation map.

According to the embodiment, the fusion feature map is compressed, the weights of the features of all channels of the fusion feature map are automatically learned by training of the image classification model, so that the image classification model can adaptively learn the features which are most effective for classification results, the output of two branches is fully utilized, the classification accuracy of a target object is improved, and the accuracy of the subsequent weak supervision target positioning by using the output feature map is further improved.

According to an embodiment of the present disclosure, determining a position of a target object in an image to be processed according to a category activation map includes: performing binarization processing on the characteristic values in the class activation graph to obtain a characteristic mask graph; and determining the position information of the target object according to the characteristic mask map.

The feature values in the class activation map may be binarized according to a third threshold value for controlling feature binarization in the class activation map. For example, a region in the class activation map, where the activation value is greater than or equal to the third threshold, is set to 1, and the remaining regions are set to 0, so that the class activation map is converted into a binarized feature mask map.

Then, a connected region larger than 0 can be found in the mask map, and the connected region is determined as the region where the target object is located. The outer boundary rectangular frame of the communication area may also be determined as a target positioning frame.

For the image processing method provided by the disclosure, experimental data is further provided by the disclosure. The experimental data comprise comparison data of the positioning accuracy of the image processing method of the embodiment and the conventional target positioning method based on the category activation diagram. The following describes the experimental contents.

In the above formula (1) of the first branch, the parameter α value is different, the curvature of the nonlinear curve is different, and the first feature processing map obtained by processing the first branch on the initial feature map is also different. Thus, a specific α may be determined first, and the nonlinear mapping relationship (i.e., equation (1)) may be determined based on the specific α. A comparative experiment was performed based on the determined formula (1).

Fig. 4 is a schematic diagram of a nonlinear curve of different curvatures under a first branch according to one embodiment of the disclosure.

As shown in fig. 4, the abscissa represents the original values of the features in the initial feature map, i.e., F (c, i, j). The ordinate represents the output value corresponding to the nonlinear portion of equation (1), i.e., D (c, i, j). Curve 401 is a nonlinear curve when α=0.4, curve 402 is a nonlinear curve when α=0.6, curve 403 is a nonlinear curve when α=0.8, and straight line 404 is a linear relationship of D (c, i, j) =f (c, i, j).

Due to the setting of the α value of the above formula (2), the intersection point of the nonlinear mapping curve and the linear mapping curve of the formula (1) (the linear mapping relationship refers to the mapping relationship of D (c, i, j) =f (c, i, j)) is F (c, i, j) =cr×max (F (c)).

For example, the intersection point of the curve 401 and the straight line 404 is F (c, i, j) =0.4×max (F (c)), the intersection point of the curve 402 and the straight line 404 is F (c, i, j) =0.6×max (F (c)), and the intersection point of the curve 403 and the straight line 404 is F (c, i, j) =0.8×max (F (c)).

As can be seen from fig. 4, when α=0.8, the enhancement of the original value of the F (c, i, j) < cr×max (F (c)) region by the curve 403 is evident, which can lead to greater network attention. Thus, this example takes the formula (1) at α=0.8 to participate in the experiment.

The experimental image can come from a public data set, a category activation diagram of the experimental image is obtained by using the image processing method provided by the disclosure, and the category activation diagram of the experimental image is obtained by using a traditional target positioning method based on the category activation diagram. Next, the accuracy of the class activation graph obtained by the present disclosure and the class activation graph obtained by the conventional method may be compared using the MaxBoxACC index.

For example, the calculation formula of the MaxBoxACC index is as follows (formula 4).

MaxBoxAcc＝max _τ BoxAcc(τ，δ) (4)

Where τ is the threshold parameter for control class activation map binarization. In the calculation of the MaxBoxACC index, τ will traverse from 0 to 1 in steps of 0.001. Delta is the IOU (Intersection-Union) hit threshold of the positioning rectangular frame, i.e. if the IOU of the prediction frame and the truth frame is more than or equal to delta, the position of the target object is considered to be successfully predicted. In this experiment, delta used two values of 0.5 and 0.7. The positioning accuracy of the predicted positioning frame on the test set is determined by the Box Acc (tau, delta) on the basis of the given tau and delta.

The image processing method provided by the disclosure is based on the VGG16 classification network, the plug-and-play components are inserted, and weak supervision positioning is performed on the test set. The conventional target positioning method based on the class activation diagram also uses the VGG16 classification network and performs weak supervision positioning on the same test set. The positioning results of the image processing method provided by the present disclosure and the positioning results of the conventional class activation diagram-based target positioning method are shown in table 1 below.

TABLE 1

As shown in table 1, in the case of δ=0.5 and δ=0.7, the positioning accuracy is improved by inserting only one component provided by the present disclosure into the VGG16 network. For example, in the case of δ=0.5, the positioning accuracy of the conventional method is 61.11, and the positioning accuracy of the image processing method provided by the present disclosure is 62.2. In the case of δ=0.7, the positioning accuracy of the conventional method is 12.06, and the positioning accuracy of the image processing method provided by the present disclosure is 12.39.

In the case of delta=0.5 and delta=0.7, two assemblies provided by the present disclosure are inserted into the VGG16 network, so that the positioning accuracy is further improved. For example, in the case of δ=0.5, the positioning accuracy of the image processing method provided by the present disclosure is 64.62. In the case of δ=0.7, the positioning accuracy of the image processing method provided by the present disclosure is 19.24.

Experiments provided by the embodiment prove that the image processing method provided by the disclosure can improve the positioning accuracy of the target object.

Fig. 5 is a block diagram of an image processing apparatus according to one embodiment of the present disclosure.

As shown in fig. 5, the image processing apparatus 500 includes an extraction module 501, a processing module 502, a fusion module 503, a selection module 504, and a determination module 505.

The extracting module 501 is configured to extract features of an image to be processed, to obtain an initial feature map, where the image to be processed includes a target object.

The processing module 502 is configured to perform a pressing process with the most distinctive feature and a pressing process with an irrelevant feature on the initial feature map, so as to obtain a first feature processing map and a second feature processing map.

The fusion module 503 is configured to fuse the first feature processing diagram and the second feature processing diagram to obtain a fused feature diagram.

The selection module 504 is configured to perform feature selection on the fused feature map to obtain an output feature map.

The determining module 505 is configured to determine a category and a location of the target object in the image to be processed according to the output feature map.

The selection module 504 includes a first determination unit, a first processing unit, and a second processing unit.

The first determining unit is used for determining a first weight of the feature in the fusion feature map, wherein the first weight represents importance of the feature to a classification result of the target object.

And the first processing unit is used for processing the fusion characteristic diagram into a weighted characteristic diagram according to the first weight.

And the second processing unit is used for processing the dimension of the weighted feature map to be consistent with the dimension of the initial feature map so as to obtain an output feature map.

The first determining unit comprises a compression subunit, an interaction subunit, a dimension processing subunit and a normalization processing subunit.

The compression subunit is used for compressing the fusion feature map to obtain a first fusion feature vector.

And the interaction subunit is used for carrying out information interaction on each feature in the first fusion feature vector to obtain a second fusion feature vector.

The dimension processing subunit is configured to process the dimension of the second fusion feature vector to be consistent with the dimension of the first fusion feature vector, so as to obtain a third fusion feature vector.

The normalization processing subunit is configured to normalize the third fused feature vector to obtain a weight vector, where elements in the weight vector represent first weights, and the first weights represented by the multiple elements in the weight vector correspond to multiple features in the fused feature graph.

The processing module 502 includes a third processing unit and a fourth processing unit.

And the third processing unit is used for carrying out nonlinear enhancement processing on the features with the feature values smaller than the first threshold value in the initial feature map, and removing the features with the feature values not smaller than the first threshold value in the initial feature map to obtain a first feature processing map.

And the fourth processing unit is used for removing the characteristic with the characteristic value smaller than the second threshold value in the initial characteristic diagram to obtain a second characteristic processing diagram.

The third processing unit is used for processing the initial feature map into a first feature processing map according to the following formula:

wherein F (c, i, j) represents a feature value of an ith row and jth column position of a c-th channel in the initial feature map, D (c, i, j) represents a feature value of an ith row and jth column position of the c-th channel in the first feature processing map, max (F (c)) represents a maximum feature value in the c-th channel, cr×max (F (c)) represents a first threshold, cr is a super-parameter, and α is a parameter for controlling non-linearization.

The fourth processing unit is configured to process the initial feature map into a second feature processing map according to the following formula:

wherein F (c, i, j) represents a feature value of an ith row and jth column position of a c-th channel in the initial feature map, S (c, i, j) represents a feature value of an ith row and jth column position of the c-th channel in the second feature processing map, max (F (c)) represents a maximum feature value in the c-th channel, t×max (F (c)) represents a second threshold, and t is a super parameter.

The determining module 505 includes a fifth processing unit, a sixth processing unit, and a second determining unit.

And the fifth processing unit is used for inputting the output feature map into the image classification model to obtain the target category of the target object and the second weight related to the target category.

And the sixth processing unit is used for processing the output characteristic map into a category activation map according to the second weight.

The second determining unit is used for determining the position of the target object in the image to be processed according to the category activation diagram.

The second determination unit includes a binarization subunit and a determination subunit.

And the binarization subunit is used for performing binarization processing on the characteristic values in the class activation graph to obtain a characteristic mask graph.

The determining subunit is configured to determine a position of the target object according to the feature mask map.

According to an embodiment of the present disclosure, the image classification model comprises a fully connected network, and the fifth processing unit is configured to determine a target class of the target object and a second weight related to the target class through the fully connected network.

According to an embodiment of the disclosure, the image classification model comprises a convolutional network, and the extraction module is used for inputting the image to be processed into the convolutional network to obtain an initial feature map.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, for example, an image processing method. For example, in some embodiments, the image processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When a computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the image processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image processing method, comprising:

extracting features of an image to be processed to obtain an initial feature map, wherein the image to be processed comprises a target object;

respectively carrying out compression processing of the most discriminative features and compression processing of irrelevant features on the initial feature map to obtain a first feature processing map and a second feature processing map;

fusing the first feature processing diagram and the second feature processing diagram to obtain a fused feature diagram;

performing feature selection on the fusion feature map to obtain an output feature map; and

determining the category and the position of the target object in the image to be processed according to the output feature map;

wherein the obtaining the first feature processing diagram includes:

performing non-linear enhancement processing on the features with the feature values smaller than a first threshold in the initial feature map, and removing the features with the feature values not smaller than the first threshold in the initial feature map to obtain the first feature processing map;

Wherein the non-linearised enhancement process is performed according to the following formula:

wherein F (c, i, j) represents a characteristic value of a j-th column position of an i-th row of a c-th channel in the initial characteristic diagram, D (c, i, j) represents a characteristic value of a j-th column position of the i-th row of the c-th channel after the nonlinear enhancement processing, max (F (c)) represents a maximum characteristic value in the c-th channel, cr×max (F (c)) represents the first threshold, cr is a super-parameter, and α is a parameter for controlling nonlinear enhancement.

2. The method of claim 1, wherein the feature selecting the fused feature map to obtain an output feature map comprises:

determining a first weight of a feature in the fused feature map, wherein the first weight represents the importance of the feature to the classification result of the target object;

processing the fusion feature map into a weighted feature map according to the first weight; and

and processing the dimension of the weighted feature map to be consistent with the dimension of the initial feature map to obtain the output feature map.

3. The method of claim 2, wherein the determining the first weight of a feature in the fused feature map comprises:

compressing the fusion feature map to obtain a first fusion feature vector;

Performing information interaction on each feature in the first fusion feature vector to obtain a second fusion feature vector;

processing the dimension of the second fusion feature vector to be consistent with the dimension of the first fusion feature vector to obtain a third fusion feature vector; and

and carrying out normalization processing on the third fusion feature vector to obtain a weight vector, wherein elements in the weight vector represent the first weight, and the first weights represented by a plurality of elements in the weight vector respectively correspond to a plurality of features in the fusion feature map.

4. A method according to any one of claims 1 to 3, wherein said subjecting the initial feature map to a press process of most discriminative features and a press process of irrelevant features, respectively, to obtain a first feature process map and a second feature process map comprises:

and removing the features with the feature values smaller than a second threshold value in the initial feature map to obtain the second feature processing map.

5. The method of claim 1, wherein features in the initial feature map having feature values not less than the first threshold are removed according to the following formula:

D(c，i，j)＝0，cr*max(F(c))≤F(c，i，j)≤max(F(c))

wherein D (c, i, j) represents the feature value of the ith row and jth column position of the c-th channel after feature removal.

6. The method of claim 4, wherein the removing features in the initial feature map having feature values less than a second threshold value to obtain the second feature processing map comprises:

processing the initial feature map into the second feature processing map according to the following formula:

wherein F (c, i, j) represents a feature value of a j-th column position of an i-th row of a c-th channel in the initial feature map, S (c, i, j) represents a feature value of a j-th column position of an i-th row of the c-th channel in the second feature map, max (F (c)) represents a maximum feature value in the c-th channel, t×max (F (c)) represents the second threshold, and t is a super parameter.

7. The method of claim 1, wherein the determining the category and location of the target object in the image to be processed from the output feature map comprises:

inputting the output feature map into an image classification model to obtain a target class of the target object and a second weight related to the target class;

processing the output feature map into a category activation map according to the second weight; and

and determining the position of the target object in the image to be processed according to the category activation diagram.

8. The method of claim 7, wherein the determining the location of the target object in the image to be processed according to the category activation map comprises:

performing binarization processing on the characteristic values in the class activation graph to obtain a characteristic mask graph; and

and determining the position of the target object according to the characteristic mask map.

9. The method of claim 7, wherein the image classification model comprises a fully connected network; the step of inputting the output feature map into an image classification model to obtain a target category of the target object and a second weight related to the target category comprises the following steps:

and determining a target category of the target object and a second weight related to the target category through the fully connected network.

10. The method of claim 7, wherein the image classification model further comprises a convolutional network; extracting the characteristics of the image to be processed, wherein the step of obtaining an initial characteristic diagram comprises the following steps:

and inputting the image to be processed into the convolution network to obtain the initial feature map.

11. An image processing apparatus comprising:

the extraction module is used for extracting the characteristics of the image to be processed to obtain an initial characteristic diagram, wherein the image to be processed comprises a target object;

The processing module is used for respectively carrying out the pressing processing of the most discriminative features and the pressing processing of the irrelevant features on the initial feature map to obtain a first feature processing map and a second feature processing map;

the fusion module is used for fusing the first characteristic processing diagram and the second characteristic processing diagram to obtain a fusion characteristic diagram;

the selection module is used for carrying out feature selection on the fusion feature images to obtain output feature images; and

the determining module is used for determining the category and the position of the target object in the image to be processed according to the output characteristic diagram;

wherein the processing module comprises:

the third processing unit is used for carrying out nonlinear enhancement processing on the features with the feature values smaller than a first threshold value in the initial feature map, and removing the features with the feature values not smaller than the first threshold value in the initial feature map to obtain the first feature processing map;

wherein the third processing unit is configured to perform non-linearized enhancement processing according to the following formula:

wherein F (c, i, j) represents a feature value of an ith row and jth column position of a c-th channel in the initial feature map, D (c, i, j) represents a feature value of an ith row and jth column position of the c-th channel in the first feature processing map, max (F (c)) represents a maximum feature value in the c-th channel, cr×max (F (c)) represents the first threshold, cr is a super-parameter, and α is a parameter for controlling nonlinearity.

12. The apparatus of claim 11, wherein the selection module comprises:

a first determining unit, configured to determine a first weight of a feature in the fused feature map, where the first weight represents importance of the feature to a classification result of the target object;

the first processing unit is used for processing the fusion characteristic diagram into a weighted characteristic diagram according to the first weight; and

and the second processing unit is used for processing the dimension of the weighted feature map to be consistent with the dimension of the initial feature map so as to obtain the output feature map.

13. The apparatus of claim 12, wherein the first determining unit comprises:

the compression subunit is used for compressing the fusion feature map to obtain a first fusion feature vector;

the interaction subunit is used for carrying out information interaction on each feature in the first fusion feature vector to obtain a second fusion feature vector;

the dimension processing subunit is used for processing the dimension of the second fusion feature vector to be consistent with the dimension of the first fusion feature vector to obtain a third fusion feature vector; and

and the normalization processing subunit is used for carrying out normalization processing on the third fusion feature vector to obtain a weight vector, wherein elements in the weight vector represent the first weight, and the first weights represented by a plurality of elements in the weight vector respectively correspond to a plurality of features in the fusion feature map.

14. The apparatus of any of claims 11 to 13, wherein the processing module comprises:

and the fourth processing unit is used for removing the characteristic with the characteristic value smaller than a second threshold value in the initial characteristic diagram to obtain the second characteristic processing diagram.

15. The apparatus of claim 11, wherein the third processing unit is configured to remove features in the initial feature map having feature values not less than the first threshold according to the following formula:

D(c，i，j)＝0，cr*max(F(c))≤F(c，i，j)≤max(F(c))

16. The apparatus of claim 14, wherein the fourth processing unit is configured to process the initial feature map into the second feature processing map according to the following formula:

17. The apparatus of claim 11, wherein the means for determining comprises:

A fifth processing unit, configured to input the output feature map into an image classification model, to obtain a target class of the target object and a second weight related to the target class;

a sixth processing unit, configured to process the output feature map into a category activation map according to the second weight; and

and the second determining unit is used for determining the position of the target object in the image to be processed according to the category activation diagram.

18. The apparatus of claim 17, wherein the second determining unit comprises:

the binarization subunit is used for carrying out binarization processing on the characteristic values in the class activation graph to obtain a characteristic mask graph; and

and the determining subunit is used for determining the position of the target object according to the characteristic mask map.

19. The apparatus of claim 17, wherein the image classification model comprises a fully connected network; the fifth processing unit is configured to determine, through the fully connected network, a target class of the target object and a second weight related to the target class.

20. The apparatus of claim 17, wherein the image classification model further comprises a convolutional network; the extraction module is used for inputting the image to be processed into the convolution network to obtain the initial feature map.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 10.

23. A computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, which, when executed by a processor, implements the method according to any one of claims 1 to 10.