CN113361540A

CN113361540A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN113361540A
Application number: CN202110573067.4A
Authority: CN
Inventors: 陈博宇; 李楚鸣
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-09-07
Also published as: WO2022247103A1

Abstract

The present disclosure relates to an image processing method and apparatus, an electronic device, and a storage medium, the method including: determining a plurality of first image block characteristics corresponding to a target image; performing n times of feature enhancement based on an attention mechanism according to the plurality of first image block features to obtain a plurality of second image block features, wherein the number of the second image block features and the number of the channels of the first image block features are the same, and n is an integer greater than or equal to 1; performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features, wherein the number of the third image block features is smaller than that of the second image block features, and the number of channels of the third image block features is larger than that of the second image block features; and performing target image processing operation on the target image according to the characteristics of the plurality of third image blocks to obtain an image processing result. The embodiment of the disclosure can improve the precision of the image processing result.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

Recently, self-attention networks have been widely used in natural language processing, and the self-attention networks perform feature enhancement by establishing a connection between features, thereby improving the final performance of the network. With the proposal of the vision self-attention network, the self-attention network is also applied in the field of computer vision on a large scale, and shows great potential. However, the existing design of the visual self-attention network is only a simple way to follow the design in natural language processing, and the computer vision characteristics are not improved, so that the performance of the visual self-attention network is poor.

Disclosure of Invention

The disclosure provides an image processing method and device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided an image processing method including: determining a plurality of first image block characteristics corresponding to a target image; performing n times of feature enhancement based on an attention mechanism according to the plurality of first image block features to obtain a plurality of second image block features, wherein the number of the second image block features and the number of the channels of the first image block features are the same, and n is an integer greater than or equal to 1; performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features, wherein the number of the third image block features is smaller than that of the second image block features, and the number of channels of the third image block features is larger than that of the second image block features; and performing target image processing operation on the target image according to the characteristics of the plurality of third image blocks to obtain an image processing result.

In a possible implementation manner, the performing feature enhancement n times based on a self-attention mechanism according to the plurality of first image block features to obtain a plurality of second image block features includes: based on a self-attention mechanism, performing feature enhancement on input features corresponding to ith feature enhancement to obtain output features corresponding to the ith feature enhancement, wherein i is an integer which is greater than or equal to 1 and less than or equal to n; determining the output features corresponding to the ith feature enhancement as the features of the plurality of second image blocks when i is equal to n; under the condition that i is equal to 1, the input features corresponding to the 1 st feature enhancement are the features of the plurality of first image blocks; and in the case that i is larger than 1, the input feature corresponding to the ith feature enhancement is the output feature corresponding to the i-1 th feature enhancement.

In a possible implementation manner, the performing feature enhancement on the input feature corresponding to the ith feature enhancement based on the self-attention mechanism to obtain the output feature corresponding to the ith feature enhancement includes: determining a first feature vector, a second feature vector and a third feature vector according to the input features corresponding to the ith feature enhancement; determining an attention feature map corresponding to the ith feature enhancement according to the first feature vector and the second feature vector; and determining the output feature corresponding to the ith feature enhancement according to the attention feature map and the third feature vector corresponding to the ith feature enhancement.

In a possible implementation manner, in a case that i satisfies a preset condition, the method further includes: determining the output characteristic corresponding to the ith characteristic enhancement as the input characteristic corresponding to the (i + 1) th characteristic enhancement; determining the attention feature map corresponding to the ith feature enhancement as the attention feature map corresponding to the (i + 1) th feature enhancement; and performing feature enhancement on the input features corresponding to the i +1 th feature enhancement by using the attention feature map corresponding to the i +1 th feature enhancement to obtain the output features corresponding to the i +1 th feature enhancement.

In a possible implementation manner, the performing feature enhancement on the input feature corresponding to the i +1 th feature enhancement by using the attention feature map corresponding to the i +1 th feature enhancement to obtain the output feature corresponding to the i +1 th feature enhancement includes: determining a fourth feature vector according to the input features corresponding to the i +1 th feature enhancement; and determining the output feature corresponding to the i +1 th feature enhancement according to the attention feature map corresponding to the i +1 th feature enhancement and the fourth feature vector.

In a possible implementation manner, the performing feature pooling on the plurality of second tile features to obtain a plurality of third tile features includes: performing convolution processing on the plurality of second image block features to obtain a plurality of fourth image block features, wherein the number of the fourth image block features is the same as that of the second image block features, and the number of channels of the fourth image block features is greater than that of the channels of the second image block features; and performing pooling processing on the fourth image block characteristics to obtain the third image block characteristics.

In one possible implementation manner, the image processing method is implemented by a self-attention neural network, and the self-attention neural network comprises a self-attention module and a feature pooling module; the obtaining a plurality of second image block features by performing n times of feature enhancement based on an attention-free mechanism according to the plurality of first image block features comprises: performing n times of feature enhancement based on a self-attention mechanism according to the features of the first image blocks by using the self-attention module to obtain the features of the second image blocks; the performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features includes: and performing feature pooling on the plurality of second image block features by using the feature pooling module to obtain a plurality of third image block features.

In one possible implementation, the self-attention module includes n self-attention layers, where each self-attention layer is used for performing feature enhancement once, and at least two adjacent self-attention layers share the same attention feature map; and/or the feature pooling module comprises a convolution layer and a maximum pooling layer, wherein the convolution kernel size corresponding to the convolution layer is smaller than a threshold value.

In one possible implementation, the method further includes: constructing a network structure search space, wherein the network structure search space comprises a plurality of network hyper-parameters corresponding to the self-attention neural network; constructing a super network according to the network structure search space, wherein the network structure search space comprises a plurality of selectable network structures constructed according to the plurality of network super parameters; determining a target network structure from the plurality of selectable network structures by network training the super network; and constructing the self-attention neural network according to the target network structure.

In one possible implementation, the plurality of network hyper-parameters includes: the image block feature number parameter, the image block feature channel number parameter, the layer parameter corresponding to the self-attention module, and the position parameters of at least two adjacent self-attention layers in the self-attention module, which need to share the same attention feature map.

In a possible implementation manner, the same attention layers included in the self-attention module correspond to the same image block feature number parameter value and the same image block feature channel number parameter value.

According to an aspect of the present disclosure, there is provided an image processing apparatus including: the characteristic determining module is used for determining a plurality of first image block characteristics corresponding to the target image; the self-attention module is used for carrying out n times of feature enhancement based on a self-attention mechanism according to the first image block features to obtain a plurality of second image block features, wherein the number and the channel number of the second image block features and the first image block features are the same, and n is an integer greater than or equal to 1; the characteristic pooling module is used for performing characteristic pooling on the plurality of second image block characteristics to obtain a plurality of third image block characteristics, wherein the number of the third image block characteristics is smaller than that of the second image block characteristics, and the number of channels of the third image block characteristics is larger than that of the channels of the second image block characteristics; and the target image processing module is used for performing target image processing operation on the target image according to the characteristics of the plurality of third image blocks to obtain an image processing result.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, a plurality of first image block characteristics corresponding to a target image are determined; performing n times of feature enhancement based on an attention-free mechanism according to the plurality of first image block features to obtain a plurality of second image block features, wherein the number and the channel number of the second image block features and the first image block features are the same, and n is an integer greater than or equal to 1; performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features, wherein the number of the third image block features is less than that of the second image block features, and the number of channels of the third image block features is greater than that of the channels of the second image block features; and performing target image processing operation on the target image according to the characteristics of the plurality of third image blocks to obtain an image processing result. The image block features after feature enhancement based on the self-attention mechanism are subjected to feature pooling so as to reduce the number of the image block features and improve the number of channels of the image block features, so that not only can the spatial redundancy features be reduced, but also the semantic expression capability of the image block features can be improved, and further, after the image block features with higher semantic expression capability are utilized to perform target image processing operation, the precision of an image processing result can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of an image processing method according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of determining a plurality of first image block features corresponding to a target image according to an embodiment of the disclosure;

FIG. 3 illustrates a network architecture diagram of a self-attentive neural network, according to an embodiment of the disclosure;

FIG. 4 shows a schematic diagram of a self-attention module in a self-attention neural network, in accordance with an embodiment of the present disclosure;

FIG. 5 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow chart of an image processing method according to an embodiment of the present disclosure. The image processing method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory. Alternatively, the method may be performed by a server. As shown in fig. 1, the image processing method may include:

in step S11, a plurality of first image block features corresponding to the target image are determined.

The target image is an image to be processed which needs image processing. In order to better acquire the internal correlation of the target image, the target image may be divided into a plurality of image blocks, and feature extraction may be performed on each image block to obtain a plurality of first image block features corresponding to the target image. The number of the plurality of image blocks may be determined according to actual conditions, which is not specifically limited by the present disclosure.

Fig. 2 illustrates a schematic diagram of determining a plurality of first image block characteristics corresponding to a target image according to an embodiment of the present disclosure. As shown in fig. 2, the target image is divided into L image blocks, and feature extraction is performed on each of the L image blocks to obtain L first image block features, where the number of channels of each first image block feature is d. The number of channels of each first image block feature may be set according to an actual situation, which is not specifically limited in this disclosure.

In order to facilitate subsequent processing of the plurality of first image block features, the plurality of first image block features may be converted into a sequence of first image block features. The specific manner of converting the plurality of first image block features into the first image block feature sequence may be determined according to actual situations, and this disclosure does not specifically limit this.

In step S12, feature enhancement is performed n times based on the self-attention mechanism according to the plurality of first image block features to obtain a plurality of second image block features, where the number of the second image block features and the number of the channels of the first image block features are the same, and n is an integer greater than or equal to 1.

The number of the image block features and the number of channels are not changed in each feature enhancement process, so that after n times of feature enhancement is performed on the basis of the self-attention mechanism according to the plurality of first image block features, the number of the obtained second image block features and the number of the obtained first image block features before feature enhancement are the same as each other. The feature enhancement process will be described in detail later in conjunction with possible embodiments of the present disclosure, and will not be described in detail here.

Still taking the L first image block features with the channel number d shown in fig. 2 as an example, according to the L first image block features, feature enhancement is performed n times based on the self-attention mechanism to obtain L second image block features, where the channel number of each second image block feature is still d.

In the case that the plurality of first image block features have been converted into the first image block feature sequence, feature enhancement may be performed n times based on a self-attention mechanism according to the first image block sequence to obtain a second image block feature sequence including a plurality of second image block features.

In step S13, the features of the plurality of second image blocks are pooled to obtain a plurality of third image block features, where the number of the third image block features is smaller than the number of the second image block features, and the number of channels of the third image block features is greater than the number of channels of the second image block features.

The feature enhancement process is equivalent to a further feature extraction process of the target image, and redundant features may exist in the spatial dimension along with the deep feature extraction, so that the dimension of the image block features can be reduced in the spatial dimension and increased in the channel dimension by pooling the features of the plurality of second image block features, so that the spatial redundant features can be reduced and a plurality of third image block features obtained after the feature pooling have higher semantic expression capability under the condition of keeping the calculated amount unchanged. The feature pooling process will be described in detail later in conjunction with possible embodiments of the present disclosure and will not be described in detail here.

Still taking the L second image block features with the channel number d as an example, performing feature pooling on the L second image block features to obtain L/4 third image block features, where the channel number of each third image block feature is 2d, thereby implementing dimension reduction on the spatial dimension and dimension increase on the channel dimension for the L second image block features with the channel number d. The specific values of the number of the features and the number of the channels of the third image block may be set according to an actual situation, which is not specifically limited by the present disclosure.

And under the condition that the plurality of second image block features are in the form of the second image block feature sequence, performing feature pooling on the second image block feature sequence to obtain a third image block feature sequence comprising a plurality of third image block features.

In step S14, a target image processing operation is performed on the target image according to the features of the plurality of third image blocks, so as to obtain an image processing result.

The third image block features have higher semantic expression capability, so that the third image block features can be used for carrying out target image processing operation according to image processing requirements, and an image processing result with higher precision can be obtained.

In a possible implementation manner, performing feature enhancement n times based on a self-attention mechanism according to a plurality of first image block features to obtain a plurality of second image block features includes: based on a self-attention mechanism, performing feature enhancement on input features corresponding to the ith feature enhancement to obtain output features corresponding to the ith feature enhancement, wherein i is an integer less than or equal to n; determining output characteristics corresponding to the ith feature enhancement as a plurality of second image block characteristics under the condition that i is equal to n; under the condition that i is equal to 1, the input features corresponding to the 1 st feature enhancement are a plurality of first image block features; and in the case that i is larger than 1, the input feature corresponding to the ith feature enhancement is the output feature corresponding to the i-1 th feature enhancement.

The method comprises the steps of initializing a plurality of first image block features into input features corresponding to the feature enhancement of the 1 st time, then iteratively executing subsequent feature enhancement based on an attention mechanism, determining output features corresponding to the feature enhancement of the last time as input features corresponding to the feature enhancement of the next time in an iteration process, and therefore after n times of feature enhancement, effectively obtaining a plurality of second image block features. The specific value of n can be determined according to actual conditions, and the disclosure does not specifically limit this.

For example, n takes the value 6. In the initialization process, initializing a plurality of first image block features into input features corresponding to the 1 st feature enhancement, and further performing feature enhancement on the input features corresponding to the 1 st feature enhancement based on an attention mechanism to obtain output features corresponding to the 1 st feature enhancement; determining the output characteristic corresponding to the 1 st characteristic enhancement as the input characteristic corresponding to the 2 nd characteristic enhancement, and further performing characteristic enhancement on the input characteristic corresponding to the 2 nd characteristic enhancement based on an attention-free mechanism to obtain the output characteristic corresponding to the 2 nd characteristic enhancement; and repeating the steps until the output characteristic corresponding to the 6 th feature enhancement is obtained, and determining the output characteristic corresponding to the 6 th feature enhancement as the characteristics of the plurality of second image blocks.

The feature enhancement performed n times based on the self-attention mechanism does not change the number of the image block features nor the number of channels of the image block features. Therefore, taking the above-mentioned L first image block features, where the number of channels of each first image block feature is d as an example, after performing feature enhancement n times based on the self-attention mechanism according to the L first image block features, L second image block features can be obtained, where the number of channels of each second image block feature is still d.

In a possible implementation manner, based on a self-attention mechanism, performing feature enhancement on an input feature corresponding to an i-th feature enhancement to obtain an output feature corresponding to the i-th feature enhancement includes: determining a first feature vector, a second feature vector and a third feature vector according to the input features corresponding to the ith feature enhancement; determining an attention feature map corresponding to the ith feature enhancement according to the first feature vector and the second feature vector; and determining the output feature corresponding to the ith feature enhancement according to the attention feature map and the third feature vector corresponding to the ith feature enhancement.

By determining the attention feature map corresponding to the ith feature enhancement, the feature enhancement can be performed on the input feature corresponding to the ith feature enhancement according to the attention feature map corresponding to the ith feature enhancement, and the output feature corresponding to the ith feature enhancement is effectively obtained.

For example, the ith feature enhancement is converted into three different feature vectors by enhancing the corresponding input features: first feature vector QⁱA second feature vector KⁱAnd a third feature vector VⁱWherein the first feature vector QⁱA second feature vector KⁱAnd a third feature vector VⁱThe corresponding number of channels is d. Further, feature enhancement can be achieved by the following equation (1):

wherein, Softmax(. cndot.) represents a normalization function, and represents a vector dot product. As shown in the above equation (1), the first feature vector Q is usedⁱAnd a second feature vector KⁱPerforming dot product operation to obtain a dot product result Qⁱ·((Kⁱ)^T) The product result Q is added to the channel number d and the normalization function Softmax (·)ⁱ·((Kⁱ)^T) Normalization is carried out to obtain the attention feature map corresponding to the ith feature enhancement

Enhancing corresponding attention feature map by using ith-time feature

And a third eigenvector VⁱPerforming dot product operation to obtain the output characteristic Att (Q) corresponding to the ith characteristic enhancementⁱ,Kⁱ,Vⁱ)。

In a possible implementation manner, in a case where i satisfies a preset condition, the image processing method further includes: determining the output characteristic corresponding to the ith characteristic enhancement as the input characteristic corresponding to the (i + 1) th characteristic enhancement; determining the attention feature map corresponding to the ith feature enhancement as the attention feature map corresponding to the (i + 1) th feature enhancement; and performing feature enhancement on the input features corresponding to the i +1 th feature enhancement by using the attention feature map corresponding to the i +1 th feature enhancement to obtain the output features corresponding to the i +1 th feature enhancement.

Under the condition that i meets the preset condition, the attention feature graph corresponding to the ith feature enhancement can be repeatedly utilized in the (i + 1) th feature enhancement process, so that the calculated amount can be reduced, and the feature enhancement efficiency can be effectively improved.

In one example, each adjacent k feature intensification of the n feature intensifications may be divided into a group, with the k feature intensifications in the same group sharing the same attention feature map. At this time, the preset condition may be that i is not equal to mk, k is a positive number greater than or equal to 1, and m is an integer greater than or equal to 0.

For example, if n is 9 and k is 3, then in the 9 feature enhancement processes, every 3 feature enhancements are divided into one group, and the 3 feature enhancements in the same group share the same attention feature map. At this time, the feature boosts of levels 1 to 3 share the same attention feature map (e.g., the feature boosts of levels 2 and 3, sharing the attention feature map corresponding to the feature boost of level 1), the feature boosts of levels 4 to 6 share the same attention feature map (e.g., the feature boosts of levels 5 and 6, sharing the attention feature map corresponding to the feature boost of level 4), and the feature boosts of levels 7 to 9 share the same attention feature map (e.g., the feature boosts of levels 8 and 9, sharing the attention feature map corresponding to the feature boost of level 7). Therefore, in the case where i is less than or equal to 9 and i is not equal to 3m (m is equal to 0, 1, 2, i is not equal to 0, 3, 6), the i +1 th feature enhancement can recycle the corresponding attention feature map of the i-th feature enhancement; and in the case that i is equal to 3m (m is equal to 0, 1, 2, i is equal to 0, 3, 6), the feature enhancement of the (i + 1) th time needs to acquire the corresponding attention feature map according to the input features of the (i + 1) th time.

In an example, the preset condition may be a specific number of times of feature intensification sharing the same attention feature map, for example, when the preset condition is that i is 2 and 5, the i +1 th feature intensification may share the attention feature map corresponding to the i-th feature intensification; the preset condition may also be set in other forms according to actual situations, and the present disclosure is not particularly limited thereto.

In a possible implementation manner, performing feature enhancement on the input feature corresponding to the i +1 th feature enhancement by using the attention feature map corresponding to the i +1 th feature enhancement to obtain the output feature corresponding to the i +1 th feature enhancement includes: determining a fourth feature vector according to the input features corresponding to the i +1 th feature enhancement; and determining the output characteristic corresponding to the i +1 th feature enhancement according to the attention characteristic diagram corresponding to the i +1 th feature enhancement and the fourth feature vector.

Compared with the prior art that the attention feature map needs to be determined every time feature enhancement is performed, the attention feature map generated in the last feature enhancement process is repeatedly utilized, so that the attention feature map does not need to be determined separately in the current feature enhancement process, the calculated amount can be reduced, and the feature enhancement efficiency is improved.

Taking the above equation (1) as an example, the output characteristic Att (Q) corresponding to the ith feature enhancement is obtained by using the equation (1)ⁱ,Kⁱ,Vⁱ) Then, the ith characteristic is strengthened to the corresponding output characteristic Att (Q)ⁱ,Kⁱ,Vⁱ) And determining to strengthen the corresponding input features for the (i + 1) th feature. When i satisfies the predetermined condition, the i +1 th feature enhancement may share the attention feature map corresponding to the i-th feature enhancement, and thus, the attention feature map corresponding to the i-th feature enhancement is enhanced

And determining an attention feature map corresponding to the i +1 th feature enhancement. At this time, the i +1 th order features are enhanced to the corresponding input features Att (Q)ⁱ,Kⁱ,Vⁱ) Conversion only to the fourth eigenvector Vⁱ⁺¹Further, the corresponding attention feature map can be enhanced by using the (i + 1) th order feature

And a fourth feature vector Vⁱ⁺¹Performing dot product operation to determine the output characteristic Att (Q) corresponding to the i +1 st characteristic enhancementⁱ,Kⁱ,Vⁱ⁺¹)。

By sharing the same attention feature map in the process of at least two times of feature enhancement, the calculation amount can be reduced in the process of n times of feature enhancement, and therefore the feature enhancement efficiency is effectively improved.

In an example, when i is not greater than the preset condition (for example, when i is equal to 0, 3, or 6), the i +1 th feature enhancement cannot repeatedly use the attention feature map corresponding to the i-th feature enhancement, and at this time, for the i +1 th feature enhancement, the attention feature map corresponding to the i + 1-th feature enhancement may be determined by using the method shown in the above formula (1) and the feature enhancement is implemented, and the specific process may refer to the specific process of the above formula (1), which is not described herein again.

In a possible implementation manner, performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features includes: performing convolution processing on the plurality of second image block features to obtain a plurality of fourth image block features, wherein the number of the fourth image block features is the same as that of the second image block features, and the number of channels of the fourth image block features is greater than that of the channels of the second image block features; and performing pooling processing on the plurality of fourth image block characteristics to obtain a plurality of third image block characteristics.

Through the feature pooling, the number of the image block features is reduced, and the number of channels of the image block features is increased, so that not only can the spatial redundancy features be reduced, but also the semantic expression capability of the image block features can be improved.

Still taking the above-mentioned L second image block features, where the number of channels of each second image block feature is d as an example, a smaller-sized convolution core may be used to perform one-dimensional convolution processing on the L second image block features to obtain L fourth image block features, where the number of channels of each fourth image block feature is 2d, so as to increase the number of channels of the image block features; and then performing one-dimensional maximum pooling on the L fourth image block features to obtain L/4 third image block features, wherein the number of channels of each third image block feature is still 2d, so that the number of the image block features is reduced.

The pooling treatment may be performed by the above-mentioned one-dimensional maximum pooling treatment, one-dimensional average pooling treatment, or other pooling treatment methods, which is not specifically limited in the present disclosure.

In the embodiment of the present disclosure, the above-mentioned n times of feature enhancement and feature pooling processes may be repeated repeatedly and iteratively for a plurality of times to obtain a finally determined number of the reduced image block features and a plurality of third image block features with an increased number of channels of the image block features, so that the image processing operation on the target image may be completed by using the plurality of third image block features. The iteration times of the n-time feature enhancement and the feature pooling can be determined according to actual conditions, and are not particularly limited by the disclosure.

In one possible implementation, the target image processing operation includes one of: target detection, target tracking, image recognition and image classification.

And performing target image processing operation by using the characteristics of the plurality of third image blocks with higher semantic expression capability, so that an image processing result with higher precision can be obtained.

For example, the target image is subjected to image classification according to the characteristics of the plurality of third image blocks, and an image classification result with higher classification accuracy corresponding to the target image is obtained. The target image processing operation may include other image processing operations according to actual image processing requirements, besides the above-mentioned target detection, target tracking, image recognition, and image classification, which is not specifically limited by the present disclosure.

In one possible implementation manner, the image processing method is implemented by a self-attention neural network, and the self-attention neural network comprises a self-attention module and a feature pooling module; according to the multiple first image block features, performing feature enhancement for n times based on a self-attention mechanism to obtain multiple second image block features, wherein the method comprises the following steps: performing n times of feature enhancement based on a self-attention mechanism by using a self-attention module according to the features of the first image blocks to obtain a plurality of second image block features; performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features, including: and performing feature pooling on the plurality of second image block features by using a feature pooling module to obtain a plurality of third image block features.

The self-attention module and the feature pooling module are arranged in the self-attention neural network, so that the feature strengthening of the image block features can be realized by the self-attention module, and the feature pooling module is further used for performing feature pooling on the image block features after the feature strengthening, so that the number of the image block features is reduced, the number of channels of the image block features is increased, the spatial redundancy features can be reduced, the semantic expression capability of the image block features can be improved, and the network performance of the self-attention neural network is effectively improved.

Fig. 3 illustrates a network architecture diagram of a self-attention neural network, according to an embodiment of the present disclosure. As shown in fig. 3, the self-attention neural network includes three self-attention modules (self-attention module A, B, C) and two feature pooling modules (feature pooling module D, E). The method comprises the steps of dividing a target image, dividing the target image into L image blocks, and respectively performing feature extraction on the L image blocks to obtain an image block feature sequence (the dimension is L multiplied by d) comprising L first image block features. The number of the self-attention modules and the feature pooling modules included in the self-attention neural network may be set according to actual situations, and the present disclosure is not particularly limited thereto.

Inputting the L multiplied by d image block characteristic sequence into a self-attention module A, and carrying out N on the L multiplied by d image block characteristic sequence by using the self-attention module A₁Performing secondary feature enhancement to obtain an L multiplied by d image block feature sequence after feature enhancement; and inputting the L × D image block feature sequence subjected to feature enhancement into a feature pooling module D, and performing feature pooling on the L × D image block feature sequence subjected to feature enhancement by using the feature pooling module D to obtain an L/2 × 1.5D image block feature sequence subjected to feature pooling. N is a radical of₁The specific value of (a), the number of the image block feature sequences and the number of channels after the feature pooling can be set according to actual conditions, which is not specifically limited by the present disclosure.

Inputting the L/2 multiplied by 1.5d image block feature sequence after the feature pooling into a self-attention module B, and performing N on the L/2 multiplied by 1.5d image block feature sequence after the feature pooling by using the self-attention module B₂Performing secondary feature enhancement to obtain an L/2 multiplied by 1.5d image block feature sequence after feature enhancement; and inputting the L/2 × 1.5D image block feature sequence after feature enhancement into a feature pooling module E, and performing feature pooling on the L/2 × 1.5D image block feature sequence after feature enhancement by using a feature pooling module D to obtain an L/4 × 2D image block feature sequence after feature pooling. N is a radical of₂The specific value of (a), the number of the image block feature sequences and the number of channels after the feature pooling can be set according to actual conditions, which is not specifically limited by the present disclosure.

Inputting the L/4 multiplied by 2d image block characteristic sequence after the characteristic pooling into a self-attention module C, and carrying out N on the L/4 multiplied by 2d image block characteristic sequence after the characteristic pooling by utilizing the self-attention module C₃Sub-feature enhancement to obtain L/4 x 2d after feature enhancementA sequence of image block features. And taking the L/4 multiplied by 2d image block feature sequence as the feature finally obtained by the attention neural network and used for carrying out subsequent target image processing operation. For example, cls features for image classification. N is a radical of₃The specific value of (a), the number of the image block feature sequences and the number of channels after the feature pooling can be set according to actual conditions, which is not specifically limited by the present disclosure.

In one possible implementation, the self-attention module includes a plurality of self-attention layers, wherein each self-attention layer is used for performing feature enhancement once, and at least two adjacent self-attention layers share the same attention feature map; and/or the feature pooling module comprises a convolution layer and a maximum pooling layer, wherein the convolution kernel size corresponding to the convolution layer is smaller than a threshold value.

By sharing the same attention feature map in at least two adjacent self-attention layers, the self-attention module can reduce the calculation amount, thereby effectively improving the feature enhancement efficiency.

Still taking the above FIG. 3 as an example, as shown in FIG. 3, the self-attention module A includes N₁A self-attention layer (using N)₁One self-attention layer may carry out N₁Sub-feature enhancement), including N in self-attention module B₂A self-attention layer (using N)₂One self-attention layer may carry out N₂Sub-feature enhancement), including N in self-attention module C₃A self-attention layer (using N)₃One self-attention layer may carry out N₃Sub-feature enhancement). Wherein, for at least one self-attention module, at least two adjacent self-attention layers in the self-attention module share the same attention feature map. E.g. N in the self-attention module A₁In each self-attention layer, two adjacent self-attention layers are taken as a group, and the two self-attention layers in each group share the same attention feature map. The number and the position distribution mode of the adjacent attention layers sharing the same attention feature map can be determined according to actual situations, and the number and the position distribution mode are not particularly limited in the present disclosure.

FIG. 4 illustrates self in a self-attention neural network, according to an embodiment of the present disclosureSchematic diagram of the attention module. As shown in fig. 4, a plurality of self-attention layers are included in one self-attention module. For any self-attention module, at least two adjacent self-attention layers in the self-attention module share the same attention feature map. For example, for the ith self-attention layer, the ith self-attention layer corresponds to the ith feature enhancement, and the ith self-attention layer converts the corresponding input feature (i.e., corresponding to the ith feature enhancement) into three different feature vectors: first feature vector QⁱA second feature vector KⁱAnd a third feature vector VⁱAnd further according to the first feature vector QⁱAnd a second feature vector KⁱThe attention feature map corresponding to the ith self-attention layer (i.e. corresponding to the ith feature enhancement) is determined by using the formula (1)

Further the attention feature map corresponding to the ith self-attention layer

And a third eigenvector VⁱPerforming dot product to obtain the output characteristic Att (Q) corresponding to the ith self-attention layer (i.e. the ith feature enhancement layer)ⁱ,Kⁱ,Vⁱ)。

Since the ith self-attention layer and the (i + 1) th self-attention layer share the same attention feature map, the output feature Att (Q) corresponding to the ith self-attention layer is usedⁱ,Kⁱ,Vⁱ) And the attention feature map corresponding to the ith self-attention layer

Inputting the i +1 th self-attention layer, inputting the characteristic Att (Q) by the i +1 th self-attention layerⁱ,Kⁱ,Vⁱ) Conversion into only one fourth eigenvector Vⁱ⁺¹Further directly input the attention feature map

And a fourth feature vector Vⁱ⁺¹And performing dot product to obtain output characteristics corresponding to the (i + 1) th self-attention layer (namely corresponding to the (i + 1) th characteristic enhancement), so that the calculated amount can be reduced in the (i + 1) th characteristic enhancement process, the calculation redundancy is reduced, and the network performance of the self-attention neural network is effectively improved.

Further, if the same attention feature map is shared only between two adjacent self-attention layers, in the above example, the self-attention feature map corresponding to the ith self-attention layer is generated in the ith self-attention layer, and the attention feature map corresponding to the ith self-attention layer is directly shared in the (i + 1) th self-attention layer; and generating a self-attention feature map corresponding to the i +2 th self-attention layer from the i +2 th self-attention layer, directly sharing the attention feature map corresponding to the i +2 th self-attention layer from the i +3 th self-attention layer, and the like.

The number of self-attention layers included in the self-attention module, and the positions and the number of self-attention layers that need to share the same attention feature map may be determined according to actual situations, and this disclosure does not specifically limit this.

In one example, the self-attention module includes 6 self-attention layers, wherein each two adjacent self-attention layers are in a group and share the same attention feature map. That is, the same attention feature map is shared between the 1 st and 2 nd self-attention layers (sharing the attention feature map generated in the 1 st self-attention layer), the same attention feature map is shared between the 3 rd and 4 th self-attention layers (sharing the attention feature map generated in the 3 rd self-attention layer), and the same attention feature map is shared between the 5 th and 6 th self-attention layers (sharing the attention feature map generated in the 5 th self-attention layer).

In another example, 6 self-attention layers are included in the self-attention module, wherein the same attention feature map is shared only among the 3 rd, 4 th and 4 th self-attention layers (sharing the attention feature map generated in the 3 rd self-attention layer), and the other self-attention layers are independent of each other.

The number of channels of the image block features can be increased by using the convolution layer with the convolution kernel size smaller than the threshold value, and then the number of the image block features can be reduced by using the maximum pooling layer, so that the spatial redundancy features are effectively reduced, and the semantic expression capability of the image block features is improved. The specific value of the threshold may be determined according to actual conditions, and is not specifically limited by the present disclosure.

Still taking the above fig. 3 as an example, the feature pooling module is arranged between adjacent self-attention modules, so that the feature pooling module is utilized to perform dimension reduction on the spatial dimension and dimension increase on the channel dimension on the image block feature as the network depth of the self-attention neural network increases, thereby effectively reducing the spatial redundancy feature and improving the network performance of the self-attention neural network under the condition of keeping the calculated amount unchanged.

In one possible implementation, the self-attention neural network may be a vision Transformer (transducer). The visual Transformer based on feature pooling and attention sharing of the embodiment of the present disclosure is constructed by adding feature pooling modules in the related art visual Transformer, defining a plurality of self-attention layers between the feature pooling modules as one self-attention module, and sharing attention (sharing the same attention feature map) between at least two adjacent self-attention layers.

When constructing the self-attention neural network, the number of self-attention modules, the number of self-attention layers in the self-attention modules, the number of image block features and the number of channels corresponding to each attention layer, the number and the position distribution of adjacent self-attention layers that need to share the same attention feature map, and the like are network hyper-parameters that need to be considered.

In one possible implementation, the image processing method further includes: constructing a network structure search space, wherein the network structure search space comprises a plurality of network hyper-parameters corresponding to the self-attention neural network; constructing a super network according to a network structure search space, wherein the network structure search space comprises a plurality of selectable network structures constructed according to a plurality of network super parameters; determining a target network structure from a plurality of selectable network structures by network training of the super network; and constructing the self-attention neural network according to the target network structure.

Based on a plurality of network hyper-parameters corresponding to the self-attention neural network, a network structure search space can be constructed to realize the construction of the super-network by using the search space, and the search of the target network structure and the construction of the self-attention neural network based on the target network structure obtained by the search are realized through the training of the super-network, so that the manual design of the network hyper-parameters and the network structure can be avoided, the automatic construction of the self-attention neural network is realized, and the network construction efficiency is effectively improved.

In one possible implementation, the plurality of network hyper-parameters comprises: the number of image block features, the number of image block feature channels, the number of layers corresponding to the self-attention module, and the positions of at least two adjacent self-attention layers in the self-attention module, which need to share the same attention feature map.

In one example, the number of image block features corresponding to each self-attention layer has S_tOne option, the number of the image block characteristic channels is S_fThe option is that the self-attention neural network has L self-attention layers, and the use mode of the attention feature map of the self-attention layer (namely the position of the self-attention layer needing to share the same attention feature map) has S_sAn option, then a containment (S) can be constructed_t×S_f×S_s)^LA network structure search space of the selectable network structures. For example, at S_t＝4、S_f＝4、S_sIn the case of 4 and L36, the network structure search space would include 1.1 × 10⁶⁵An alternative network architecture. The search space is too large, resulting in less efficient searching.

In one possible implementation, the respective attention layers included in the same self-attention module correspond to the same number of image block features and the same number of channels.

The principle of constructing the self-attention neural network may include: 1) the number of self-attention modules is limited (e.g., 3 self-attention modules); 2) the attention layers included in the same self-attention block correspond to the same number of image block features and the same number of image block feature channels; 3) with the increase of the network depth, the number of the image block features corresponding to the respective attention modules is reduced, and the number of the image block feature channels is increased.

Based on the construction principle of the self-attention neural network, the selectable network structures which are included in the super network and do not conform to the construction principle are deleted, so that the size of the search space of the network structure can be reduced, and the subsequent search efficiency of the target network structure is improved.

After the network structure search space is reduced based on the construction principle of the self-attention neural network, the super network comprises optional network structures which accord with the construction principle of the self-attention neural network. The super network is trained based on a Single path one-shot (SPOS) algorithm to obtain a target network architecture for constructing the self-attention neural network.

The SPOS algorithm selects only one alternative network architecture at each training iteration and updates the network parameters of the selected alternative network architecture in the super network. When other optional network architectures are selected for iterative training, the other optional network architectures inherit the trained network parameters from the super network and continuously update the parameters without training from the beginning, so that the training efficiency of the super network is effectively improved, and the target network architecture for constructing the self-attention neural network is obtained by fast searching.

A self-attention neural network can be constructed based on the target network architecture, and the self-attention neural network comprises a self-attention module and a feature pooling module.

The self-attention module comprises a plurality of self-attention layers, and the attention layers in the same self-attention module correspond to the same number of image block features and the same number of image block feature channels. The self-attention module is used for performing feature enhancement on the image block features based on a self-attention mechanism. The feature enhancement process of the self-attention module is similar to the related feature enhancement process described above, and is not described herein again. The same attention feature map is shared between at least two adjacent self-attention layers in the self-attention module, so that the calculated amount of the feature enhancement process is reduced, and the feature enhancement efficiency is effectively improved.

The feature pooling module is used for performing space dimension reduction and channel dimension increasing on the image block features after feature strengthening along with the increase of the depth of the self-attention neural network, so that under the condition of keeping the calculated amount unchanged, space redundancy features are reduced, the semantic representation capability of the image block features is improved, and further the network performance of the self-attention neural network is effectively improved.

According to the image processing requirement, the self-attention neural network disclosed by the disclosure can be applied to image processing tasks such as target detection, target tracking, image recognition, image classification and the like, and the disclosure does not specifically limit the tasks.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides an image processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the image processing methods provided by the present disclosure, and the descriptions and corresponding descriptions of the corresponding technical solutions and the corresponding descriptions in the methods section are omitted for brevity.

Fig. 5 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 50 includes:

a feature determining module 51, configured to determine a plurality of first image block features corresponding to a target image;

the self-attention module 52 is configured to perform feature enhancement n times based on a self-attention mechanism according to a plurality of first image block features to obtain a plurality of second image block features, where the number of the second image block features and the number of the channels of the first image block features are the same, and n is an integer greater than or equal to 1;

the feature pooling module 53 is configured to perform feature pooling on the plurality of second image block features to obtain a plurality of third image block features, where the number of the third image block features is smaller than that of the second image block features, and the number of channels of the third image block features is greater than that of the channels of the second image block features;

and the target image processing module 54 is configured to perform a target image processing operation on the target image according to the characteristics of the plurality of third image blocks to obtain an image processing result.

In one possible implementation, the self-attention module 52 includes:

the ith self-attention submodule is used for performing feature enhancement on the input features corresponding to the ith feature enhancement to obtain output features corresponding to the ith feature enhancement based on a self-attention mechanism, wherein i is an integer smaller than or equal to n;

the nth determining submodule is used for determining output characteristics corresponding to the ith feature enhancement as a plurality of second image block characteristics under the condition that i is equal to n;

under the condition that i is equal to 1, the input features corresponding to the 1 st feature enhancement are a plurality of first image block features; and in the case that i is larger than 1, the input feature corresponding to the ith feature enhancement is the output feature corresponding to the i-1 th feature enhancement.

In a possible implementation manner, the ith self-attention submodule is specifically configured to:

the first determining unit is used for determining a first feature vector, a second feature vector and a third feature vector according to the input features corresponding to the ith feature enhancement;

the second determining unit is used for determining an attention feature map corresponding to the ith feature enhancement according to the first feature vector and the second feature vector;

and the third determining unit is used for determining the output characteristics corresponding to the ith feature enhancement according to the attention feature map and the third feature vector corresponding to the ith feature enhancement.

In a possible implementation manner, in a case that i satisfies a preset condition, the apparatus 50 further includes: the (i + 1) th self-attention submodule; an i +1 th self-attention submodule comprising:

the fourth determining unit is used for determining the output characteristic corresponding to the ith characteristic enhancement as the input characteristic corresponding to the (i + 1) th characteristic enhancement;

a fifth determining unit, which determines the attention feature map corresponding to the ith feature enhancement as the attention feature map corresponding to the (i + 1) th feature enhancement;

and the sixth determining unit is used for performing feature enhancement on the input features corresponding to the i +1 th feature enhancement by using the attention feature map corresponding to the i +1 th feature enhancement to obtain the output features corresponding to the i +1 th feature enhancement.

In a possible implementation manner, the sixth determining unit is specifically configured to:

determining a fourth feature vector according to the input features corresponding to the i +1 th feature enhancement;

and determining the output characteristic corresponding to the i +1 th feature enhancement according to the attention characteristic diagram corresponding to the i +1 th feature enhancement and the fourth feature vector.

In one possible implementation, the feature pooling module 53 includes:

the convolution sub-module is used for performing convolution processing on the plurality of second image block features to obtain a plurality of fourth image block features, wherein the number of the fourth image block features is the same as that of the second image block features, and the number of channels of the fourth image block features is greater than that of the channels of the second image block features;

and the pooling sub-module is used for pooling the plurality of fourth image block features to obtain a plurality of third image block features.

In one possible implementation, the apparatus 50 is implemented by a self-attention neural network that includes a self-attention module 52 and a feature pooling module 53.

In one possible implementation, the self-attention module 52 includes n self-attention layers, where each self-attention layer is used for performing a feature enhancement, and at least two adjacent self-attention layers share the same attention feature map; and/or, the feature pooling module 53 includes convolution layers and a maximum pooling layer, wherein convolution kernel sizes corresponding to the convolution layers are smaller than a threshold.

In one possible implementation, the apparatus 50 further includes:

the network structure searching system comprises a searching space constructing module, a searching module and a searching module, wherein the searching space constructing module is used for constructing a network structure searching space, and the network structure searching space comprises a plurality of network hyper-parameters corresponding to a self-attention neural network;

the super network construction module is used for constructing a super network according to a network structure search space, wherein the network structure search space comprises a plurality of selectable network structures constructed according to a plurality of network super parameters;

the network training module is used for determining a target network structure from a plurality of selectable network structures by performing network training on the super network;

and the self-attention neural network construction module is used for constructing the self-attention neural network according to the target network structure.

Self-attention nerves, a plurality of network hyper-parameters including: the image block feature number parameter, the image block feature channel number parameter, the layer parameter corresponding to the self-attention module, and the position parameters of at least two adjacent self-attention layers in the self-attention module, which need to share the same attention feature map.

The self-attention module comprises a self-attention nerve module, a self-attention module and a self-attention module, wherein the self-attention module comprises respective attention layers corresponding to the same image block feature number parameter value and the same image block feature channel number parameter value.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 6, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 6, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 7, the electronic device 1900 may be provided as a server. Referring to fig. 7, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image processing method, comprising:

determining a plurality of first image block characteristics corresponding to a target image;

performing n times of feature enhancement based on an attention mechanism according to the plurality of first image block features to obtain a plurality of second image block features, wherein the number of the second image block features and the number of the channels of the first image block features are the same, and n is an integer greater than or equal to 1;

performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features, wherein the number of the third image block features is smaller than that of the second image block features, and the number of channels of the third image block features is larger than that of the second image block features;

and performing target image processing operation on the target image according to the characteristics of the plurality of third image blocks to obtain an image processing result.

2. The method according to claim 1, wherein the performing n times of feature enhancement based on a self-attention mechanism according to the first image block features to obtain a second image block features comprises:

based on a self-attention mechanism, performing feature enhancement on input features corresponding to ith feature enhancement to obtain output features corresponding to the ith feature enhancement, wherein i is an integer which is greater than or equal to 1 and less than or equal to n;

determining the output features corresponding to the ith feature enhancement as the features of the plurality of second image blocks when i is equal to n;

under the condition that i is equal to 1, the input features corresponding to the 1 st feature enhancement are the features of the plurality of first image blocks; and in the case that i is larger than 1, the input feature corresponding to the ith feature enhancement is the output feature corresponding to the i-1 th feature enhancement.

3. The method according to claim 2, wherein the performing feature enhancement on the input feature corresponding to the i-th feature enhancement based on the self-attention mechanism to obtain the output feature corresponding to the i-th feature enhancement comprises:

determining a first feature vector, a second feature vector and a third feature vector according to the input features corresponding to the ith feature enhancement;

determining an attention feature map corresponding to the ith feature enhancement according to the first feature vector and the second feature vector;

and determining the output feature corresponding to the ith feature enhancement according to the attention feature map and the third feature vector corresponding to the ith feature enhancement.

4. The method according to claim 3, wherein in case i satisfies a preset condition, the method further comprises:

determining the output characteristic corresponding to the ith characteristic enhancement as the input characteristic corresponding to the (i + 1) th characteristic enhancement;

determining the attention feature map corresponding to the ith feature enhancement as the attention feature map corresponding to the (i + 1) th feature enhancement;

and performing feature enhancement on the input features corresponding to the i +1 th feature enhancement by using the attention feature map corresponding to the i +1 th feature enhancement to obtain the output features corresponding to the i +1 th feature enhancement.

5. The method according to claim 4, wherein the performing feature enhancement on the input features corresponding to the i +1 th feature enhancement by using the attention feature map corresponding to the i +1 th feature enhancement to obtain the output features corresponding to the i +1 th feature enhancement comprises:

and determining the output feature corresponding to the i +1 th feature enhancement according to the attention feature map corresponding to the i +1 th feature enhancement and the fourth feature vector.

6. The method according to any of claims 1 to 5, wherein the pooling of the features of the second tile features to obtain a third tile features comprises:

performing convolution processing on the plurality of second image block features to obtain a plurality of fourth image block features, wherein the number of the fourth image block features is the same as that of the second image block features, and the number of channels of the fourth image block features is greater than that of the channels of the second image block features;

and performing pooling processing on the fourth image block characteristics to obtain the third image block characteristics.

7. The method according to any one of claims 1 to 6, wherein the image processing method is implemented by a self-attention neural network, and the self-attention neural network comprises a self-attention module and a feature pooling module;

the obtaining a plurality of second image block features by performing n times of feature enhancement based on an attention-free mechanism according to the plurality of first image block features comprises:

performing n times of feature enhancement based on a self-attention mechanism according to the features of the first image blocks by using the self-attention module to obtain the features of the second image blocks;

the performing feature pooling on the plurality of second image block features to obtain a plurality of third image block features includes:

and performing feature pooling on the plurality of second image block features by using the feature pooling module to obtain a plurality of third image block features.

8. The method of claim 7, wherein the self-attention module comprises n self-attention layers, wherein each self-attention layer is used for performing a feature enhancement, and at least two adjacent self-attention layers share the same attention feature map; and/or the presence of a gas in the gas,

the feature pooling module comprises convolution layers and a maximum pooling layer, wherein the sizes of convolution kernels corresponding to the convolution layers are smaller than a threshold value.

9. The method according to any one of claims 7 to 8, further comprising:

constructing a network structure search space, wherein the network structure search space comprises a plurality of network hyper-parameters corresponding to the self-attention neural network;

constructing a super network according to the network structure search space, wherein the network structure search space comprises a plurality of selectable network structures constructed according to the plurality of network super parameters;

determining a target network structure from the plurality of selectable network structures by network training the super network;

and constructing the self-attention neural network according to the target network structure.

10. The method of claim 9, wherein the plurality of network hyper-parameters comprises: the image block feature number parameter, the image block feature channel number parameter, the layer parameter corresponding to the self-attention module, and the position parameters of at least two adjacent self-attention layers in the self-attention module, which need to share the same attention feature map.

11. The method according to claim 10, wherein the same attention layers included in the self-attention module correspond to the same image block feature number parameter value and the same image block feature channel number parameter value.

12. An image processing apparatus characterized by comprising:

the characteristic determining module is used for determining a plurality of first image block characteristics corresponding to the target image;

the self-attention module is used for carrying out n times of feature enhancement based on a self-attention mechanism according to the first image block features to obtain a plurality of second image block features, wherein the number and the channel number of the second image block features and the first image block features are the same, and n is an integer greater than or equal to 1;

the characteristic pooling module is used for performing characteristic pooling on the plurality of second image block characteristics to obtain a plurality of third image block characteristics, wherein the number of the third image block characteristics is smaller than that of the second image block characteristics, and the number of channels of the third image block characteristics is larger than that of the channels of the second image block characteristics;

and the target image processing module is used for performing target image processing operation on the target image according to the characteristics of the plurality of third image blocks to obtain an image processing result.

13. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 11.

14. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 11.