CN109996023B

CN109996023B - Image processing method and device

Info

Publication number: CN109996023B
Application number: CN201711471002.9A
Authority: CN
Inventors: 杨帆
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-06-29
Anticipated expiration: 2037-12-29
Also published as: WO2019128726A1; CN109996023A

Abstract

The application provides an image processing method and device, and the method comprises the following steps: acquiring a first image; performing downsampling processing on the first image by a factor of B through a first image processing layer of a convolutional neural network to obtain a second image, wherein the convolutional neural network comprises a plurality of image processing layers, the plurality of image processing layers comprise the first image processing layer, and B is an integer greater than 1; and performing A-time upsampling processing on the second image through a second image processing layer in the plurality of image processing layers to obtain a third image, wherein A is an integer larger than 1 and is not equal to B. By adopting the image processing method and the image processing device, non-integral multiple up-sampling or non-integral multiple down-sampling of the image can be realized.

Description

Image processing method and device

Technical Field

The present application relates to the field of image processing, and more particularly, to an image processing method and apparatus in the field of image processing.

Background

With the continuous development of image processing technology and the continuous improvement of image display quality requirements of people, a Convolutional Neural Network (CNN) based on deep learning has rapidly developed in the field of image processing by a special structure shared by local weights thereof, and is gradually an important technical choice in the industry.

In practical applications, there are scenes that require the resolution of an image to be enlarged from 720 progressive (p) to 1080p, i.e., require non-integer-times up-sampling of the image, or reduced from 1080p to 720p, i.e., require non-integer-times down-sampling of the image. However, up-sampling integer multiples of an image (including a magnification of 1), such as an image super-resolution algorithm, can be achieved by using a convolutional neural network model composed of convolutional layers, such as an effective sub-pixel convolutional neural network (ESPCN) model, a fast super-resolution convolutional neural network (FSRCNN) model, and the like.

Therefore, it is desirable to provide an image processing method that solves the problem of how to implement non-integer-multiple up-sampling or non-integer-multiple down-sampling of an image.

Disclosure of Invention

The application provides an image processing method and device, which can realize non-integral multiple up-sampling or non-integral multiple down-sampling of an image.

In a first aspect, the present application provides an image processing method, including:

acquiring a first image;

performing downsampling processing on the first image by a factor of B through a first image processing layer of a convolutional neural network to obtain a second image, wherein the convolutional neural network comprises a plurality of image processing layers, the plurality of image processing layers comprise the first image processing layer, and B is an integer greater than 1;

and performing A-time upsampling processing on the second image through a second image processing layer in the plurality of image processing layers to obtain a third image, wherein A is an integer larger than 1 and is not equal to B.

According to the image processing method provided by the embodiment of the application, B-time downsampling processing is performed on the acquired first image through the first image processing layer of the convolutional neural network to obtain the second image, A-time upsampling processing is performed on the second image through the second image processing layer of the convolutional neural network to obtain the third image, and non-integral-multiple upsampling or non-integral-multiple downsampling of the first image can be achieved.

In addition, the down-sampling processing is performed before the up-sampling processing, so that the data volume processed by the convolutional neural network is reduced, the computational complexity of image processing can be reduced, and the image processing efficiency can be improved.

It is to be understood that the size of the image may include multiple dimensions, including height and width when the dimension of the image is two-dimensional; when the dimension of the image is three-dimensional, the size of the image includes width, height, and depth.

It should also be understood that a pixel is the most fundamental element that makes up an image, being a logical unit of size.

It is also understood that the height of an image may be understood as the number of pixels the image comprises in the height direction; the width of an image may be understood as the number of pixels the image comprises in the width direction; the depth of an image may be understood as the number of channels of the image.

It should also be understood that in the convolutional neural network model, the depth of an image can be understood as the number of feature maps (features maps) included in the image, wherein the width and height of any one feature map of the image are the same as those of other feature maps of the image. That is, one image is a three-dimensional image, and it can be understood that the three-dimensional image is composed of a plurality of two-dimensional feature maps, and the two-dimensional feature maps have the same size.

It should also be understood that in the embodiment of the present application, when the magnification B of the downsampling is greater than the magnification a of the upsampling, the downsampling of the first image by a non-integral multiple can be implemented; when the down-sampling magnification B is smaller than the up-sampling magnification a, the up-sampling of the first image can be performed by a non-integral multiple.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the first image includes M first feature maps, a height of each first feature map in the M first feature maps is H pixels, a width of each first feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the method for performing B-fold downsampling processing on a first image through a first image processing layer of a convolutional neural network to obtain a second image comprises the following steps: dividing each first feature map into (H multiplied by W)/B which are not overlapped with each other through the first image processing layer²An image block of (H × W)/B²The height of each image block in the image blocks is B pixels, and the width of each image block is B pixels; according to the above

(H×W)/B²An image block to obtain B²Zhang second feature diagram, the²The height of each second feature map in the second feature map is H/B pixels, the width of each second feature map is W/B pixels, and each pixel in each second feature map is taken from the (H multiplied by W)/B pixels²And the position of each pixel in each second feature map is associated with the position of the image block to which each pixel belongs in the first feature map. Wherein, (H × W)/B²H/B and W/B are integers.

It should be understood that the first image includes M first feature maps, each of the M first feature maps has a height of H pixels and a width of W pixels, and the first image is a three-dimensional image, and the size of the three-dimensional first image is H × W × M, that is, the height of the first image is H pixels, the width of the first image is W pixels, and the depth of the first image is M first feature maps, that is, the three-dimensional first image includes M H × W two-dimensional first feature maps.

Optionally, the first image may be an originally acquired image to be processed, or the first image may be a preprocessed image, or the first image may be an image processed by another image processing layer in the convolutional neural network, or the first image may be an image processed by another image processing device, which is not limited in this embodiment of the application.

Optionally, the first image may be acquired in a plurality of different manners in this application embodiment, which is not limited in this application embodiment.

For example, when the first image is an originally acquired image to be processed, the first image may be acquired from an image acquisition device; when the first image is an image obtained after being processed by other image processing layers in the convolutional neural network, the first image output by the other image processing layers can be obtained; when the first image is an image processed by another image processing apparatus, the first image output by the another image processing apparatus may be acquired.

It should be understood that B can be obtained by performing B-fold down-sampling on a first feature map²If a second feature map is obtained, then the downsampling process of B times is carried out on M first feature maps to obtain M multiplied by B²And (5) expanding the second feature map to obtain the second image.

It should also be understood that the second image comprises M B²Zhang second feature map, the M × B²The second feature map has a height of H/B pixels and a width of H/B pixels, and the second image is a three-dimensional image with a size of (H/B) × (W/B) × (M × B) of the first three-dimensional image²) That is, the first image has a height of H/B pixels, a width of W/B pixels, and a depth of M × B²A first characteristic map, i.e. the three-dimensional first image comprises M B²A second two-dimensional characteristic diagram of (H/B) × (W/B).

It should also be understood that the position of each pixel in each second feature map is associated with the position of the image block to which each pixel belongs in the first feature map, and it can be understood that the relative position of each pixel included in the second feature map is the same as the relative position of the image block to which each pixel belongs in the first feature map.

In the image processing method provided by the embodiment of the application, B is obtained by splitting, combining and rearranging the pixels included in each first feature map²The second feature map is capable of realizing a B-fold down-sampling process for the first image.

In addition, the B²The second characteristic diagram includes all pixels in the first characteristic diagram, i.e. B²All image information in each first feature map is reserved in each second feature map; the relative position between the pixels included in each second feature map is determined according to the relative position of the image block to which the pixel belongs in the first feature map, that is, each second feature map obtained from one first feature map is a thumbnail of the first feature map.

With reference to the first aspect, in a second possible implementation manner of the first aspect, the first image includes M first feature maps, a height of each first feature map in the M first feature maps is H pixels, a width of each first feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the method for performing B-fold downsampling processing on a first image through a first image processing layer of a convolutional neural network to obtain a second image comprises the following steps: performing convolution operation on the M first feature maps through the first image processing layer to obtain the second image, wherein convolution step lengths of the convolution operation in the width direction and the height direction are both B, the convolution operation adopts N convolution kernels, the height of each convolution kernel in the N convolution kernels is K pixels, the width of each convolution kernel is J pixels, the depth of each convolution kernel is M feature maps, and each convolution kernel isThe first feature map is filled with P pixels at height boundary, each first feature map is filled with P pixels at width boundary, the second image comprises N second feature maps, and each second feature map in the N second feature maps has height of P

Each second feature map having a width of

And a pixel, wherein N is an integer greater than 0, P is an integer greater than or equal to 0, and J and K are greater than or equal to B.

It should be understood that the convolution kernel is a filter used to extract the feature map of the image. The size of the convolution kernel includes width, height, and depth, wherein the depth of the convolution kernel is the same as the depth of the input image. The convolution operation is carried out on an input image by using different convolution kernels, so that different feature maps can be extracted.

Alternatively, in the convolution layer of the convolutional neural network, the same image may be convolved multiple times by setting convolution kernels of different sizes, different weight values, or different convolution steps to extract features of the image as much as possible.

It should also be understood that the convolution step refers to a distance by which the convolution kernel slides between performing convolution operations in the height direction and the width direction in the process of sliding the convolution kernel on the feature map of the input image to extract the feature map of the input image.

It will be appreciated that the convolution step size may determine the downsampling magnification of the input image, for example, a convolution step size in the width (or height) direction of B may enable the input feature map to be downsampled by a factor of B in the width (or height) direction.

It should also be understood that in convolutional neural networks, convolutional layers primarily play a role in extracting features. The convolution operation is mainly carried out on the input image according to a set convolution kernel.

It should be understood that, when a K × K convolution kernel is used to perform a convolution operation on a two-dimensional input image, the K × K image block covered when the convolution kernel slides on the image is dot-multiplied by the convolution kernel, that is, the gray value of each point on the image block is multiplied by the weight value at the same position on the convolution kernel to obtain K × K results, and after accumulation and offset addition, a result is obtained and output as a single pixel of an output image, where the coordinate position of the pixel on the output image corresponds to the coordinate position of the center of the image block on the input image.

It should also be understood that when a convolution kernel is used to perform a convolution operation on an input three-dimensional image, the dimension of the convolution kernel is also three-dimensional, and the third dimension (depth) of the convolution kernel is the same as the third dimension (depth or number of feature maps) of the three-dimensional image. The convolution operation of the three-dimensional image and the three-dimensional convolution kernel can be converted into a two-dimensional convolution operation of splitting the three-dimensional image and the convolution kernel into a plurality of two-dimensional characteristic graphs and the convolution kernel by the depth (image channel number or characteristic graph number) dimension, and finally accumulating the two-dimensional characteristic graphs and the convolution kernel in the image depth dimension to finally obtain a two-dimensional image output.

It should also be understood that in a convolutional neural network, the output image of the convolutional layer usually also includes a plurality of feature maps, the three-dimensional convolutional kernel processes the three-dimensional input image to obtain a two-dimensional output feature map, and a plurality of three-dimensional convolutional kernels are required to obtain the plurality of output feature maps, so the dimensionality of the convolutional kernels is 1 greater than that of the input image, and the number of the increased dimensionality corresponds to the depth of the output image, that is, the number of feature maps included in the output image.

It should also be understood that the convolution operation is classified as either a padding (padding) approach or a non-padding approach. The padding mode can be understood as a preprocessing operation of an image, and the same padding mode includes a same padding (same padding) mode and a valid padding (valid padding) mode.

It should also be understood that the same padding method refers to adding a same border to both the width and the height of the input image, and performing a convolution operation on the image after adding the border, wherein the border refers to the outer border of the input image. For example, when the size of the input image is 5 × 5 × 2, and the convolution operation is performed by the same padding method, the height boundary of the input image padding is 1 pixel, and the width boundary of the input image padding is 1 pixel, a 7 × 7 × 2 image can be obtained, and then the convolution operation is performed on the 7 × 7 × 2 image.

It should be understood that, in the case where the input image is filled with a width boundary (width of convolution kernel-1)/2, filled with a height boundary (height of convolution kernel-1)/2, and the convolution step is 1, the output image obtained after the input image is convolved with the convolution kernel has the same width and height as the input image.

Alternatively, in general, when the size of the convolution kernel is 3 × 3, the height boundary and the width boundary of the input image filling are both 1 pixel; when the size of the convolution kernel is 5 multiplied by 5, the height boundary and the width boundary filled by the input image are both 2 pixels; when the size of the convolution kernel is 7 × 7, the height boundary and the width boundary of the input image are both 3 pixels, but this is not limited in the embodiment of the present application.

It should be further understood that, when the convolution operation is performed in the same mode as the same convolution operation, only values of the boundary elements are all 0 as an exemplary description, and the values of the boundary elements may also be other values, which is not limited in this embodiment.

Alternatively, assuming that the width (or height) of the input feature map is W, the width (or height) of the convolution kernel is F, the convolution step is S, the convolution operation is performed in a same padding manner, and the width (or height) boundary filled by the input feature map is P, the width (or height) of the obtained output feature map may be represented as:

wherein W, F and S are integers greater than 0, P is an integer greater than or equal to 0,

represents a rounding of the square.

It should be understood that if the convolution operation is performed in a non-padding manner, P may be considered to be 0.

According to the image processing method provided by the embodiment of the application, the convolution operation is performed on the first image through the first image processing layer of the convolution neural network, the convolution step length of the convolution operation in the width direction and the convolution step length in the height direction are both B, N convolution kernels are adopted in the convolution operation, the height of each convolution kernel in the N convolution kernels is K pixels, the width of each convolution kernel is J pixels, the depth of each convolution kernel is M feature maps, the height boundary filled in the first image is P pixels, the width boundary is P pixels, and B-time down sampling of the first image can be achieved.

Additionally, J and K are greater than or equal to B, enabling the convolution kernel to traverse at least every pixel in the first image during the convolution process, i.e., to retain all image information in the first image.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, M, N and B satisfy the following formula: n is more than or equal to MxB/2.

Since a small portion of image information carried by the input image is lost in the process of abstracting the image features through convolution operation, the number of feature maps included in the output image can be increased to achieve the purpose of better retaining the image information.

In the embodiment of the application, N is larger than or equal to MxB/2, namely the depth of the second image is increased according to a certain limiting condition, so that the effect of compensating the image information lost by the first image can be achieved.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the first image includes M first feature maps, a height of each first feature map in the M first feature maps is H pixels, a width of each first feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the method for performing B-fold downsampling processing on a first image through a first image processing layer of a convolutional neural network to obtain a second image comprises the following steps: performing pooling operation on each first feature map in the M first feature maps through the first image processing layer to obtain the second image, wherein the pooling step of the pooling operation in the width direction and the height direction is B, the height of a pooling kernel of the pooling operation is B pixels, the width of the pooling kernel is B pixels, the second image comprises M second feature maps, the height of each second feature map in the M second feature maps is H/B pixels, and the width of each second feature map is W/B pixels. Wherein H/B and W/B are integers.

Alternatively, two common pooling operations are mean pooling (average pooling) and maximum pooling (max pooling), which are performed in two dimensions, i.e., width and height of the feature map, and do not affect the depth of the output feature map.

In addition, the mean pooling operation refers to finding a mean in each region over which the pooling kernel slides.

According to the image processing method provided by the embodiment of the application, the B-time downsampling processing is performed on the first image through the pooling layer, and the data volume of the characteristic layer can be reduced, so that the calculation complexity of the convolutional neural network and the cache bandwidth requirement of the convolutional neural network are reduced.

With reference to the first aspect and any one of the first to fourth possible implementation manners of the first aspect, in a fifth possible implementation manner of the first aspect, a and B are mutually prime numbers.

Since a and B have common divisor, more image information may be lost during the down-sampling process and continuity of image texture information may be damaged.

Therefore, according to the image processing method provided by the embodiment of the application, compared with the case where a and B are relatively prime numbers, the integrity of the image information of the first image and the continuity of the image texture information can be guaranteed to a greater extent.

With reference to the first aspect and any one of the first to the fifth possible implementation manners of the first aspect, in a sixth possible implementation manner of the first aspect, the performing, by the convolutional neural network, a time up-sampling processing on the second image to obtain a third image includes: and performing first processing on the second image through the convolutional neural network to obtain a fourth image, and performing A-time upsampling processing on the fourth image to obtain the third image.

Alternatively, the first process is an operation other than the upsampling process or the downsampling process, such as a convolution operation with a convolution step size of 1.

In a second aspect, the present application provides an image processing method, comprising:

acquiring a first image;

performing A-time upsampling processing on the first image through a first image processing layer of a convolutional neural network to obtain a second image, wherein the convolutional neural network comprises a plurality of image processing layers, the plurality of image processing layers comprise the first image processing layer, and A is an integer greater than 1;

and B times of downsampling processing is carried out on the second image through a second image processing layer in the plurality of image processing layers to obtain a third image, wherein B is an integer larger than 1, and A is not equal to B.

According to the image processing method provided by the embodiment of the application, the first image processing layer of the convolutional neural network is used for carrying out A-time up-sampling processing on the acquired first image to obtain the second image, and the second image processing layer of the convolutional neural network is used for carrying out B-time down-sampling processing on the second image to obtain the third image, so that non-integral-multiple up-sampling or non-integral-multiple down-sampling on the first image can be realized.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the second image includes M second feature maps, a height of each second feature map in the M second feature maps is H pixels, a width of each second feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the downsampling processing of the second image by the factor of B is performed through a second image processing layer of the convolutional neural network to obtain a third image, and the downsampling processing comprises the following steps: dividing each second feature map into (H multiplied by W)/B which are not overlapped with each other by the second image processing layer²An image block of (H × W)/B²The height of each image block in the image blocks is B pixels, and the width of each image block is B pixels; according to the formula (H X W)/B²An image block to obtain B²Zhang third characteristic diagram, this B²The height of each third characteristic diagram in the third characteristic diagrams is H/B pixelsThe width of each third feature map is W/B pixels, and each pixel in each third feature map is taken from the (H multiplied by W)/B²And the position of each pixel in each third feature map is associated with the position of the image block to which each pixel belongs in the second feature map. Wherein, (H × W)/B²H/B and W/B are integers.

With reference to the second aspect, in a second possible implementation manner of the second aspect, the second image includes M second feature maps, a height of each second feature map in the M second feature maps is H pixels, a width of each second feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the downsampling processing of the second image by the factor of B is performed through a second image processing layer of the convolutional neural network to obtain a third image, and the downsampling processing comprises the following steps: performing convolution operation on the M second feature maps through the second image processing layer to obtain the third image, where convolution step lengths of the convolution operation in the width direction and the height direction are both B, the convolution operation uses N convolution kernels, the height of each convolution kernel in the N convolution kernels is K pixels, the width of each convolution kernel is J pixels, the depth of each convolution kernel is M feature maps, the height boundary filled in each second feature map is P pixels, the width boundary filled in each second feature map is P pixels, the third image includes N third feature maps, and the height of each third feature map in the N third feature maps is B

Each third feature map having a width of

With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, M, N and B satisfy the following formula: n is more than or equal to MxB/2.

In combination with the second aspectIn a fourth possible implementation manner of the second aspect, the second image includes M second feature maps, the height of each second feature map in the M second feature maps is H pixels, the width of each second feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the downsampling processing of the second image by the factor of B is performed through a second image processing layer of the convolutional neural network to obtain a third image, and the downsampling processing comprises the following steps: performing pooling operation on each second feature map in the M second feature maps through the second image processing layer to obtain the third image, wherein the pooling step of the pooling operation in the width direction and the height direction is B, the height of a pooling kernel of the pooling operation is B pixels, the width of the pooling kernel is B pixels, the third image comprises M third feature maps, the height of each third feature map in the M third feature maps is H/B pixels, and the width of each third feature map is W/B pixels. Wherein, (H × W)/B²H/B and W/B are integers.

With reference to the second aspect and any one of the first to fourth possible implementation manners of the second aspect, in a fifth possible implementation manner of the second aspect, a and B are mutually prime numbers.

In a third aspect, the present application provides an image processing apparatus configured to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides an image processing apparatus configured to perform the method of the second aspect or any possible implementation manner of the second aspect.

In a fifth aspect, the present application provides an image processing apparatus comprising: memory, processor, communication interface and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a sixth aspect, the present application provides an image processing apparatus comprising: memory, processor, communication interface and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to perform the method of the second aspect or any possible implementation of the second aspect.

In a seventh aspect, the present application provides a computer-readable medium for storing a computer program comprising instructions for performing the method of the first aspect or any possible implementation manner of the first aspect.

In an eighth aspect, the present application provides a computer readable medium for storing a computer program comprising instructions for performing the method of the second aspect or any possible implementation of the second aspect.

In a ninth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a tenth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the second aspect described above or any possible implementation of the second aspect.

In an eleventh aspect, the present application provides a chip comprising: an input interface, an output interface, at least one processor, a memory, the input interface, the output interface, the processor and the memory being in communication with each other via an internal connection path, the processor being configured to execute code in the memory, and when executed, the processor being configured to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a twelfth aspect, the present application provides a chip comprising: an input interface, an output interface, at least one processor, a memory, the input interface, the output interface, the processor and the memory are in communication with each other through an internal connection path, the processor is configured to execute code in the memory, and when the code is executed, the processor is configured to perform the method in the second aspect or any possible implementation manner of the second aspect.

Drawings

FIG. 1 is a schematic diagram of the height, width and depth of a three-dimensional image;

FIG. 2 is a schematic diagram of a convolution layer implementation convolution operation;

FIG. 3 is a schematic diagram of a pooling layer implementing a pooling operation;

FIG. 4 is a schematic diagram of a sub-pixel convolution layer implementing a sub-pixel convolution operation;

FIG. 5 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart diagram of an image processing method provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a downsampling process provided by an embodiment of the present application;

FIG. 8 is a diagram illustrating a process for performing convolution operations using convolution kernels of different sizes according to an embodiment of the present application;

fig. 9 is a schematic block diagram of an image processing apparatus provided in an embodiment of the present application;

fig. 10 is a schematic block diagram of another image processing apparatus provided in an embodiment of the present application.

Detailed Description

For the sake of clarity, the terms used in this application are first explained.

1. Pixel

A pixel is the most basic element that makes up an image, being a logical unit of size.

2. Size of image

The size of the image may include a plurality of dimensions, and when the dimension of the image is two-dimensional, the size of the image includes a height and a width; when the dimension of the image is three-dimensional, the size of the image includes width, height, and depth.

It is to be understood that the height of an image may be understood as the number of pixels the image comprises in the height direction; the width of an image may be understood as the number of pixels the image comprises in the width direction; the depth of an image may be understood as the number of channels of the image.

In the convolutional neural network model, the depth of an image can be understood as the number of feature maps (features maps) included in the image, wherein the width and height of any one feature map of the image are the same as those of other feature maps of the image.

That is, one image is a three-dimensional image, and it can be understood that the three-dimensional image is composed of a plurality of two-dimensional feature maps, and the plurality of two-dimensional feature maps have the same size.

It should be understood that an image includes M feature maps, each of the M feature maps having a height of H pixels and a width of W pixels, and that the image is a three-dimensional image having a size of H × W × M, that is, the three-dimensional image includes M H × W two-dimensional feature maps. Wherein H, W is an integer greater than 1, and M is an integer greater than 0.

Fig. 1 shows a 5 × 5 × 3 image, which includes 3 feature maps (e.g., red (R), green (G), and blue (B), each having a size of 5 × 5.

It should be understood that the feature maps of different colors can be understood as different channels of the image, and different channels can be considered as different feature maps in the convolutional neural network.

It should be further understood that fig. 1 only illustrates an image with a depth of 3, and the depth of the image may also be other values, for example, the depth of a grayscale image is 1, the depth of an RGB-depth (depth, D) image is 4, and the like, which is not limited in this embodiment of the application.

It is also understood that the resolution of an image (or feature map) can be understood as the product of the width and height of the image (or feature map), i.e., if the height of the image (or feature map) is H pixels and the width of the image (or feature map) is W pixels, then the resolution of the image (or feature map) is H × W.

3. Convolution kernel

The convolution kernel is a filter for extracting a feature map of an image. The size of the convolution kernel includes width, height, and depth, wherein the depth of the convolution kernel is the same as the depth of the input image. The convolution operation is carried out on an input image by using different convolution kernels, so that different feature maps can be extracted.

For example, one output feature map can be obtained by performing a convolution operation on a 7 × 7 × 3 input image using one 5 × 5 × 3 convolution kernel, and a plurality of different output feature maps can be obtained by performing a convolution operation on a 7 × 7 × 3 input image using a plurality of different 5 × 5 × 3 convolution kernels.

4. Convolution step size

The convolution step refers to a distance by which a convolution kernel slides between two convolution operations performed in the height direction and the width direction in the process of sliding the convolution kernel on the feature map of the input image to extract the feature map of the input image.

It will be appreciated that the convolution step size may determine the downsampling magnification of the input image, for example, a convolution step size in the width (or height) direction of B, which may enable the input feature map to be downsampled by a factor of B in the width (or height) direction, B being an integer greater than 1.

5. Convolution layer (conditional layer)

In convolutional neural networks, convolutional layers mainly play a role in extracting features. The convolution operation is mainly carried out on the input image according to a set convolution kernel.

It should be understood that the padding manners described in the embodiments of the present application all refer to same padding manner, but the embodiments of the present application are not limited thereto.

It should also be understood that, assuming that the width (or height) of the input feature map is W, the width (or height) of the convolution kernel is F, the convolution step is S, and the convolution operation is performed in the same padding manner, and the width (or height) boundary of the input feature map padding is P, the width (or height) of the obtained output feature map can be expressed as:

represents a rounding of the square.

Fig. 2 shows a process in which a convolution layer performs a convolution operation on an input image, the size of a three-dimensional input image is 5 × 5 × 3, a height boundary and a width boundary filled in the input image are both 1 pixel, a 7 × 7 × 3 input image is obtained, the convolution step size of the convolution operation in the width direction and the height direction is 2, the convolution operation is performed by using a convolution kernel w0, the size of the convolution kernel w0 is 3 × 3 × 3, and 3 input feature maps (input feature map 1, input feature map 2, and input feature map 3) included in the input image are convolved with three layers of depths (convolution kernel w0-1, convolution kernel w0-2, and convolution kernel w0-3) of the convolution kernel, respectively, so that an output feature map 1 is obtained, and the size of the output feature map 1 is 3 × 3 × 2.

Specifically, the depth of the first layer of w0 (i.e. w0-1) and the element at the corresponding position in the blue box of the input feature map 1 are multiplied and then summed to obtain 0, and similarly, the other two depths (i.e. w0-2 and w0-3) of the convolution kernel w0 are respectively convolved with the input feature map 2 and the input feature map 3 to obtain 2 and 0, and then the first element of the output feature map 1 in fig. 1 is 0+2+0 — 2. After the first convolution operation of the convolution kernel w0, the blue boxes sequentially slide along the width direction and the height direction of each input feature map, and continue to perform the next convolution operation, where the distance of each slide is 2 (i.e. the convolution step size in the width direction and the height direction is 2), until the convolution operation on the input image is completed, so as to obtain the output feature map 1 of 3 × 3 × 1.

Similarly, if the convolution operation also uses another 1 convolution kernel w1 to convolve the input image, based on a process similar to the convolution kernel w0, the output feature map 2 can be obtained, and the size of the output feature map 2 is 3 × 3 × 2.

Optionally, the output characteristic map 1 and the output characteristic map 2 may also be activated by an activation function, so as to obtain an activated output characteristic map 1 and an activated output characteristic map 2.

6. Pooling layer (padding layer)

On one hand, the pooling layer enables the width and the height of the feature map to be smaller, and the calculation complexity of the convolutional neural network is reduced by reducing the data quantity of the feature layer; on one hand, feature compression is carried out, and main features are extracted.

Fig. 3 shows a process of performing a pooling operation on an input image by a pooling layer, where the input image is a 4 × 4 × 1 image, and performing a max boosting operation on the input image through a 2 × 2 pooling core, that is, finding a maximum value in each area where the pooling core slides through as a pixel in an output image, where the position of each pixel in the output image is the same as the position of the area to which the each pixel belongs in the input image, where the pooling step size is 2, and finally extracting a main feature from the input image to obtain the output image.

7. Deconvolution layer (deconvolution layer)

The deconvolution layer is also called an inverse convolution layer (deconvolution layer), and the downsampling magnification of the input image can be determined by setting a deconvolution step size, for example, the convolution step size in the width (or height) direction is a, and downsampling of the input feature map by a time in the width (or height) direction can be achieved, and B is an integer greater than 1.

It should be understood that the deconvolution operation can be understood as the inverse of the convolution operation shown in fig. 2.

It should also be understood that assuming that the width (or height) of the input feature map is W, the width (or height) of the convolution kernel is F, the deconvolution step size is S, and the width (or height) boundary clipped on the output feature map is P, the width (or height) of the resulting output feature map can be expressed as: s × (W-1) + F-2P, where W, F and S are integers greater than 0 and P is an integer greater than or equal to 0.

For example, in a deconvolution operation on a 5 × 5 × 1 input image using a 3 × 3 × 3 convolution kernel, each pixel in the input feature map of the input image is multiplied by each weight at the first layer depth of the 3 × 3 deconvolution kernel to obtain a 3 × 3 image block corresponding to each pixel, the image block is placed on a 7 × 7 × 1 output feature map 1, the center position of the image block is the position of each pixel, and the distance between the center positions of two adjacent image blocks is equal to the deconvolution step size. Then, accumulating a plurality of values assigned to each pixel in the output feature map to obtain a final output feature map, and in the same way, deconvolving the input feature map through the second layer depth and the third layer depth of the deconvolution kernel to obtain an output feature map 2 and an output feature map 3, and then cutting 1 pixel on the boundary of the 3 feature maps to obtain an output image of 5 × 5 × 3.

8. Sub-pixel convolution layer (sub-pixel convolution layer)

The sub-pixel convolution layer integrates a plurality of feature maps in the dimension of the depth of the input image, and achieves the effect of amplifying the width and the height of the input image in an integer ratio.

The sub-pixel convolution operation may be understood as a method of rearranging data arrangement of a feature map included in an input image.

For example, the input feature layer of a subpixel convolution layer has dimensions H × W × r²(r is the magnification of the image) the sub-pixel convolution layer will be r²The pixels at the same position in the characteristic maps are rearranged into an r × r image block corresponding to an r × r image block in the output characteristic map, the center position of the r × r image block is the position of each pixel, so that H × W × r²Is rearranged into an output profile of rH × rW × 1. This transformation, although called sub-pixel convolution, does not actually perform the convolution operation, but for r²And (5) arranging and combining pixels in the H multiplied by W input feature map.

For example, as shown in fig. 4, 4 2 × 2 input feature maps, after the sub-pixel convolution operation of the sub-pixel convolution layer, result in an output feature map as shown in fig. 4, the size of which is 4 × 4, it being understood that, for clarity of description, the bracketed number on each pixel in fig. 4 represents the number or identification of the pixel, rather than the pixel value on the pixel.

It should be understood that the technical solution provided in the embodiment of the present application may be applied to various scenes that need to perform image processing on an input image to obtain a corresponding output image, and the embodiment of the present application does not limit this.

For example, as shown in fig. 5, the technical solution of the embodiment of the present invention may be applied to a terminal device, which may be mobile or fixed, for example, the terminal device may be a mobile phone with an image processing function, a Tablet Personal Computer (TPC), a media player, a smart television, a notebook computer (LC), a Personal Digital Assistant (PDA), a Personal Computer (PC), a camera, a video camera, a smart watch, a Wearable Device (WD), and the like, which are not limited in this respect.

Fig. 6 shows a schematic flowchart of an image processing method 600 provided by an embodiment of the present application, which may be executed by an image processing apparatus, for example.

S610, acquiring a first image.

S620, carrying out B-time downsampling processing on the first image through a first image processing layer of a convolutional neural network to obtain a second image, wherein the convolutional neural network comprises a plurality of image processing layers, the plurality of image processing layers comprise the first image processing layer, and B is an integer larger than 1.

S630, through a second image processing layer of the plurality of image processing layers, the second image processing layer of the convolutional neural network performs a-time upsampling on the second image to obtain a third image, where a is an integer greater than 1, and a is not equal to B.

Suppose that the first image includes M first feature maps, the height of each first feature map in the M first feature maps is H pixels, the width of each first feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0.

It should be understood that the first image includes M first feature maps, each of the M first feature maps has a height of H pixels and a width of W pixels, and the first image is a three-dimensional image having a size of H × W × M, that is, the first image has a height of H pixels, a width of W pixels and a depth of M first feature maps, that is, the three-dimensional image includes M H × W two-dimensional first feature maps.

For example, when the first image is an originally acquired to-be-processed image, S610 may be to acquire the first image from an image acquisition device; when the first image is an image obtained after being processed by another image processing layer in the convolutional neural network, S610 may be to obtain the first image output by the another image processing layer; when the first image is an image processed by another image processing apparatus, S610 may be to acquire the first image output by the another image processing apparatus.

It should be understood that the convolutional neural network described in the embodiment of the present application may include a plurality of image processing layers, wherein a first image processing layer of the convolutional neural network may include a part of the image processing layers, and a second image processing layer of the convolutional neural network may include another part of the image processing layers, which is not limited in the embodiment of the present application.

Optionally, in S620, the image processing apparatus may perform B-fold down-sampling processing on the first image through the first image processing layer to obtain the second image, which is not limited in this embodiment of the application.

As an alternative embodiment, the image processing apparatus divides each first feature map into non-overlapping (hxw)/B images by the first image processing layer²An image block of (H × W)/B²The height of each image block in the image blocks is B pixels, and the width of each image block is B pixels; according to the formula (H X W)/B²An image block to obtain B²Zhang second feature diagram, the²The height of each second feature map in the second feature map is H/B pixels, the width of each second feature map is W/B pixels, and each pixel in each second feature map is taken from the (H multiplied by W)/B pixels²And the position of each pixel in each second feature map is associated with the position of the image block to which each pixel belongs in the first feature map. Wherein, (H × W)/B²H/B and W/B are integers.

For example, fig. 6 is a schematic diagram of an input image of 4 × 4 × 1 processed by a convolutional neural network to obtain an output image, and it should be understood that, for clarity of description, the number with parentheses on each pixel in fig. 6 is the number or identification of the pixel, rather than the pixel value on the pixel.

As shown in fig. 6, the input image includes 14 × 4 input feature map, which is divided into 4 2 × 2 image blocks; image block 1 comprises pixels numbered 1, 2, 5, 6, image block 2 comprises pixels numbered 3, 4, 7, 8, image block 3 comprises pixels numbered 9, 10, 13, 14, image block 4 comprises pixels numbered 11, 12, 15, 16; taking out a pixel from the upper left corner of each image block to form an output feature map 1, wherein the relative position of each pixel in the output feature map 1 is the same as the relative position of the image block to which the pixel belongs in the input feature map, i.e. the relative positions of the pixels numbered 1, 3, 9, 11 in the output feature map 1 are the same as the relative positions of the image blocks 1, 2, 3, 4.

That is, the relative positions between pixels included in the output feature map are invariant regardless of the output feature map 1 being translated in any direction.

Similarly, one pixel is taken out from the upper right corner of each image block to form the output characteristic map 2, one pixel is taken out from the lower left corner of each image block to form the output characteristic map 3, and one pixel is taken out from the lower right corner of each image block to form the output characteristic map 4, that is, the size of the output image is 2 × 2 × 4.

As another optional embodiment, the image processing apparatus performs a convolution operation on the M first feature maps through the first image processing layer to obtain the second image, where convolution step sizes of the convolution operation in the width direction and the height direction are both B, the convolution operation uses N convolution kernels, a height of each convolution kernel in the N convolution kernels is K pixels, a width of each convolution kernel is J pixels, a depth of each convolution kernel is M feature maps, a height boundary of each first feature map is P pixels, a width boundary of each first feature map is P pixels, the second image includes N second feature maps, and a height of each second feature map in the N second feature maps is P pixels

Each second feature map having a width of

Alternatively, the first image processing layer may be, for example, a convolutional layer, but this is not limited in this embodiment of the present application.

For example, fig. 7 shows a schematic diagram of a convolution process when convolution operations are performed using convolution kernels of 3 different sizes, for example, 1 × 1 × 1, 3 × 3 × 1, and 5 × 5 × 1, with a convolution mode of same coding, in which convolution steps in the height direction and the width direction are 3 pixels, for an input image of 6 × 6 × 1.

As can be understood from fig. 7, (1) when the width (or height) of the convolution kernel is smaller than the convolution step in the width (or height) direction (for example, in the convolution process in which the size of the convolution kernel in fig. 7 is 1 × 1 × 1, and the convolution step in the width and height directions is 3 pixels), there is no overlap between convolution regions corresponding to the convolution kernels in two adjacent convolution operations, and all pixels of the input image are not covered during the convolution kernel sliding process, in this case, from the aspect of the signal source, the output image may cause a large amount of image information to be lost due to the fact that all pixels of the input image are not used for calculation.

(2) When the width (or height) of the convolution kernel is equal to the convolution step in the width (or height) direction (for example, in the convolution process in which the size of the convolution kernel is 3 × 3 × 1 and the convolution step in the width and height directions is 3 pixels in fig. 7), there is no overlap between convolution regions corresponding to the convolution kernels in two adjacent convolution operations, but all pixels of the input image are covered during the convolution kernel sliding process.

(3) When the width (or height) of the convolution kernel is larger than the convolution step in the width (or height) direction (for example, in the convolution process in which the size of the convolution kernel is 5 × 5 × 1 and the convolution step in the width and height directions is 3 pixels in fig. 7), there is an overlap between convolution regions corresponding to the convolution kernels in two adjacent convolution operations, and all pixels of the input image are covered during the convolution kernel sliding process.

Therefore, in both cases (2) and (3), the output image does not lose much image information due to not using all the pixel calculations of the input image in terms of signal source.

It should be understood that since a small portion of image information carried by an input image is lost in the process of abstracting image features, the number of feature maps included in an output image can be increased to achieve the purpose of better retaining the image information.

Therefore, when the downsampling processing of B times is performed on the first image in the manner, M, N and B can satisfy the following formula: n is larger than or equal to M multiplied by B/2, namely the depth of the output image is increased according to a certain limiting condition, so that the effect of compensating the image information lost by the input image is achieved.

According to the image processing method provided by the embodiment of the application, the convolution operation is performed on the first image through the first image processing layer, the convolution step length of the convolution operation in the width direction and the convolution step length in the height direction are both B, N convolution kernels are adopted in the convolution operation, the height of each convolution kernel in the N convolution kernels is K pixels, the width of each convolution kernel is J pixels, the depth of each convolution kernel is M feature maps, the height boundary filled in the first image is P pixels, the width boundary is P pixels, and B-time down sampling of the first image can be achieved.

As a further alternative, the image processing apparatus performs a pooling operation on the M first feature maps through the first image processing layer to obtain the second image, the pooling step size of the pooling operation in the width direction and the height direction is B, the height of the pooling kernel of the pooling operation is B pixels, the width of the pooling kernel is B pixels, the second image includes M second feature maps, the height of each second feature map in the M second feature maps is H/B pixels, and the width of each second feature map is W/B pixels. Wherein H/B and W/B are integers.

Optionally, the first image processing layer may be a pooling layer, which is not limited in this embodiment of the present application.

For a specific process, reference may be made to the above explanation of the pooling layer in the term explanation and fig. 3, and details are not repeated here to avoid repetition.

Alternatively, the first image processing layer may include P sub-image processing layers, P being an integer greater than 1.

Accordingly, the image processing apparatus implements B for the first image through each of the P sub-image processing layers_iA multiple down-sampling process wherein B_iThe following formula is satisfied:

wherein, B_i>1。

Optionally, in S630, the image processing apparatus may perform a multiple of up-sampling processing on the second image by using a plurality of different operations through the second image processing layer, so as to obtain the third image, which is not limited in this embodiment of the application.

As an alternative embodiment, the image processing apparatus may perform a deconvolution operation on the second image by the second image processing layer to obtain the third image, the deconvolution operation having a deconvolution step size a in width and height.

Optionally, the second image processing layer may be an deconvolution layer, but this is not limited in this embodiment of the present application.

For a specific up-sampling process, reference may be made to the explanation of the deconvolution operation process implemented by the deconvolution layer in the above term explanation, and details are not described here again to avoid repetition.

As another alternative, the image processing apparatus may perform a sub-pixel convolution operation on the second image through the second image processing layer, and then perform the sub-pixel convolution operation on the second image by the image processing apparatus to obtain a third image, where a width (or height) of the third image is a times of a width (or height) of the second image, and a depth of the third image is 1/a of a depth of the second image.

Optionally, the second image processing layer may be a sub-pixel convolution layer, but this is not limited in this embodiment of the present application.

For a specific upsampling process, reference may be made to the explanation of the sub-pixel convolution layer implementing the sub-pixel convolution process in the above term explanation and fig. 4, and details are not repeated here in order to avoid repetition.

Alternatively, the second image processing layer may include Q sub-image processing layers, Q being an integer greater than 1.

Accordingly, the image processing apparatus implements a for the second image through each of the Q sub-image processing layers_iMultiple upsampling process, wherein A_iThe following formula is satisfied:

wherein A is_i>1。

Optionally, S630 may include: the image processing device performs first processing on the second image through the second image processing layer to obtain a fourth image, and performs up-sampling processing of A times on the fourth image through the second image processing layer to obtain the third image.

The first processing may be non-sampling processing (the non-sampling processing includes non-upsampling processing and non-downsampling processing), for example, processing the second image by using a convolution operation with a convolution step size of 1 in the width direction and the height direction, or processing the second image by using an activation function, and the like, which is not limited in this embodiment of the present application.

It should be understood that, in the embodiment of the present application, the image processing apparatus may perform an integral multiple down-sampling process on the first image to obtain a second image, and then perform an integral multiple up-sampling process on the second image to obtain a third image, so as to implement non-integral multiple down-sampling or non-integral multiple up-sampling on the first image, but the scope of the embodiment of the present application should not be limited thereto. Accordingly, the image processing apparatus may perform the up-sampling process of a times on the first image to obtain the fifth image, and then perform the down-sampling process of B times on the fifth image to obtain the sixth image, and may also perform the down-sampling or the up-sampling of the first image by non-integer times.

The image processing method provided by the embodiment of the present application is described in detail above with reference to fig. 1 to 8, and the image processing apparatus provided by the embodiment of the present application will be described below with reference to fig. 9 to 10.

Fig. 9 shows a schematic block diagram of an image processing apparatus 900 provided in an embodiment of the present application. The apparatus 900 includes:

an acquisition unit 910 configured to acquire a first image;

a processing unit 920, configured to perform B-fold downsampling on the first image acquired by the acquiring unit 910 through a first image processing layer of a convolutional neural network to obtain a second image, where the convolutional neural network includes a plurality of image processing layers, and the plurality of image processing layers includes the first image processing layer, where B is an integer greater than 1; and performing A-time upsampling processing on the second image through a second image processing layer in the plurality of image processing layers to obtain a third image, wherein A is an integer larger than 1 and is not equal to B.

Optionally, the first image includes M first feature maps, where the height of each first feature map in the M first feature maps is H pixels, the width of each first feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the processing unit is specifically configured to: dividing each first feature map into (H multiplied by W)/B which are not overlapped with each other²An image block of (H × W)/B²The height of each image block in the image blocks is B pixels, and the width of each image block is B pixels; according to the formula (H X W)/B²An image block to obtain B²Zhang second feature diagram, the²The height of each second feature map in the second feature map is H/B pixels, the width of each second feature map is W/B pixels, and each pixel in each second feature map is taken from the (H multiplied by W)/B pixels²And the position of each pixel in each second feature map is associated with the position of the image block to which each pixel belongs in the first feature map. Wherein, (H × W)/B²H/B and W/B are integers.

Optionally, the first image includes M first feature maps, where the height of each first feature map in the M first feature maps is H pixels, the width of each first feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the processing unit is specifically configured to perform convolution operation on the M first feature maps through the first image processing layer to obtain the second image, where convolution step lengths of the convolution operation in the width direction and the height direction are both B, the convolution operation employs N convolution kernels, and each convolution kernel in the N convolution kernels is a convolution kernelThe height of the kernel is K pixels, the width of each convolution kernel is J pixels, the depth of each convolution kernel is M feature maps, the height boundary filled by each first feature map is P pixels, the width boundary filled by each first image is P pixels, the second image comprises N second feature maps, and the height of each second feature map in the N second feature maps is

Each second feature map having a width of

Alternatively, M, N and B satisfy the following formula: n is more than or equal to MxB/2.

Optionally, the first image includes M first feature maps, where the height of each first feature map in the M first feature maps is H pixels, the width of each first feature map is W pixels, H and W are integers greater than 1, and M is an integer greater than 0; the processing unit is specifically configured to perform pooling operation on the M first feature maps through the first image processing layer to obtain the second image, where a pooling step of the pooling operation in the width direction and the height direction is B, a height of a pooling kernel of the pooling operation is B pixels, a width of the pooling kernel is B pixels, the second image includes M second feature maps, a height of each second feature map in the M second feature maps is H/B pixels, and a width of each second feature map is W/B pixels. Wherein H/B and W/B are integers.

Optionally, a and B are relatively prime numbers.

It should be understood that the image processing apparatus 900 herein is embodied in the form of a functional unit. The term "unit" herein may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality. In an alternative example, as will be understood by those skilled in the art, the image processing apparatus 900 may be embodied as an image processing apparatus in the foregoing method 600 embodiment, and the image processing apparatus 900 may be configured to execute each flow and/or step corresponding to the image processing apparatus in the foregoing method 600 embodiment, and is not described herein again to avoid repetition.

Fig. 10 shows a schematic block diagram of an image processing apparatus 1000 according to an embodiment of the present application, where the image processing apparatus 1000 may be the image processing apparatus shown in fig. 9, and the image processing apparatus may adopt a hardware architecture as shown in fig. 10. The image processing apparatus may include a processor 1010, a communication interface 1020, and a memory 1030, the processor 1010, the communication interface 1020, and the memory 1030 communicating with each other through an internal connection path. The related functions implemented by the processing unit 920 in fig. 9 may be implemented by the processor 1010, and the related functions implemented by the obtaining unit 910 may be implemented by the processor 1010 controlling the communication interface 1020.

The processor 1010 may include one or more processors, such as one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The communication interface 1020 is used for transmitting and/or receiving data. The communication interface may include a transmit interface for transmitting data and a receive interface for receiving data.

The memory 1030 includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an Erasable Programmable Read Only Memory (EPROM), and a compact disc read-only memory (CD-ROM), and the memory 1030 is used for storing relevant instructions and data.

The memory 1030 is used to store program codes and data of the image processing apparatus, and may be a separate device or integrated in the processor 1010.

Specifically, the processor 1010 is configured to control the communication interface to perform data transmission with other devices, for example, with other image processing devices. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 10 only shows a simplified design of the image processing apparatus. In practical applications, the image retrieving apparatus may further include other necessary components, including but not limited to any number of communication interfaces, processors, controllers, memories, etc., and all image processing apparatuses that can implement the present application are within the scope of the present application.

In one possible design, the image processing apparatus 1000 may be replaced by a chip apparatus, for example, a chip that can be used in the image processing apparatus, and is used for implementing the relevant functions of the processor 1010 in the image processing apparatus. The chip device can be a field programmable gate array, a special integrated chip, a system chip, a central processing unit, a network processor, a digital signal processing circuit and a microcontroller for realizing related functions, and can also adopt a programmable controller or other integrated chips. The chip may optionally include one or more memories for storing program code that, when executed, causes the processor to implement corresponding functions.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image processing method, comprising:

acquiring a first image;

performing A-time upsampling processing on the second image through a second image processing layer in the plurality of image processing layers to obtain a third image, wherein A is an integer larger than 1 and is not equal to B;

the first image comprises M first feature maps, the height of each first feature map in the M first feature maps is H pixels, the width of each first feature map is W pixels, H and W are integers more than 1, and M is an integer more than 0;

the downsampling processing of the first image by the factor of B is carried out by a first image processing layer of a convolutional neural network to obtain a second image, and the downsampling processing comprises the following steps:

dividing each first feature map into (H multiplied by W)/B which are not overlapped with each other²An image block of (H x W)/B²The height of each image block in the image blocks is B pixels, and the width of each image block is B pixels;

according to the formula (H x W)/B²An image block to obtain B²Zhang second feature diagram, B²The height of each second feature map in the second feature map is H/B pixels, the width of each second feature map is W/B pixels, and each pixel in each second feature map is taken from the (H multiplied by W)/B pixels²Different ones of the image blocks, each of saidThe position of the pixel in each second feature map is associated with the position of the image block to which each pixel belongs in the first feature map.

2. The method of claim 1, wherein a and B are relatively prime numbers.

3. An image processing apparatus characterized by comprising:

an acquisition unit configured to acquire a first image;

the processing unit is used for performing B-time downsampling processing on the first image acquired by the acquisition unit through a first image processing layer of a convolutional neural network to obtain a second image, wherein the convolutional neural network comprises a plurality of image processing layers, the plurality of image processing layers comprise the first image processing layer, and B is an integer larger than 1; performing A-time upsampling processing on the second image through a second image processing layer in the plurality of image processing layers to obtain a third image, wherein A is an integer larger than 1 and is not equal to B;

the processing unit is specifically configured to:

according to the formula (H x W)/B²An image block to obtain B²Zhang second feature diagram, B²The height of each second feature map in the second feature map is H/B pixels, the width of each second feature map is W/B pixels, and each pixel in each second feature map is taken from the (H multiplied by W)/B pixels²Different image blocks in each image block, wherein each pixel is in each second feature mapIs associated with the position of the image block to which said each pixel belongs in said first feature map.

4. The apparatus of claim 3, wherein A and B are relatively prime numbers.

5. An image processing apparatus comprising a memory, a processor, a communication interface and instructions stored on the memory and executable on the processor, wherein the memory, the processor and the communication interface are in communication with each other via an internal connection path, wherein the processor executes the instructions to cause the apparatus to implement the method of any one of claims 1 to 2.

6. A computer-readable medium for storing a computer program, wherein the computer program, when executed by a computer, implements the method of any of claims 1 to 2.