CN113284055A

CN113284055A - Image processing method and device

Info

Publication number: CN113284055A
Application number: CN202110293366.2A
Authority: CN
Inventors: 王宪; 汪涛; 郑卓然; 任文琦; 操晓春
Original assignee: Huawei Technologies Co Ltd; Institute of Information Engineering of CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Information Engineering of CAS
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-08-20

Abstract

The application provides an image processing method and device in the field of artificial intelligence, and features are extracted from a plurality of channels in an input image respectively to conduct up-sampling guidance on bilateral grid data, so that better image enhancement effects such as defogging effects can be achieved in all channels of the image, lightweight image defogging can be achieved, and user experience is improved. The method comprises the following steps: acquiring an input image, wherein the input image comprises information of a plurality of channels; extracting features from information of a plurality of channels of an input image respectively to obtain a plurality of guide images; acquiring bilateral grid data corresponding to an input image, wherein the bilateral grid data comprise data formed by information of brightness dimensions arranged in space dimensions, and the resolution of the bilateral grid data is lower than that of the input image; taking each guide graph in the guide graphs as a guide condition, and performing up-sampling on the bilateral grid data to obtain a plurality of characteristic graphs; and fusing the plurality of feature maps to obtain an output image.

Description

Image processing method and device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method and an apparatus for image processing.

Background

The defogging process has undergone a substantial three-stage development: the early mainstream method is the traditional method, which models and estimates the light propagation, but the artificially designed model and the picture priori knowledge cannot be accurately applied to the complicated real pictures of different types. After the convolutional neural network is applied to a large number of visual tasks and achieves a significant breakthrough in effect, a learning-based method becomes the mainstream, and some modules in the traditional method are replaced by a learnable network layer. In the recent method, artificially designed modules such as light estimation and the like are discarded, an end-to-end network is adopted to solve the defogging task, namely the whole process of image transformation is learned by the network, and the algorithm is completely driven by data. The end-to-end network has a large demand on the calculated amount, and cannot process the image in real time. Therefore, how to achieve a lightweight and better image enhancement effect is a problem to be solved urgently.

Disclosure of Invention

The application provides an image processing method and device, which respectively extract features from a plurality of channels in an input image to conduct up-sampling guidance on bilateral grid data, so that better image enhancement effects such as defogging can be achieved in each channel of the image, and light image defogging can be achieved, and user experience is improved.

In view of the above, in a first aspect, the present application provides an image processing method, including: acquiring an input image, wherein the input image comprises information of a plurality of channels; extracting features from information of a plurality of channels of an input image respectively to obtain a plurality of guide graphs, wherein the guide graphs correspond to the channels one by one, namely each channel has a corresponding guide graph; acquiring bilateral grid data corresponding to an input image, wherein the bilateral grid data comprise data formed by brightness dimension information arranged in a space dimension, the brightness dimension information is obtained according to features extracted from the input image, the resolution of the bilateral grid data is lower than that of the input image, and the space dimension is a preset space or a space determined according to the input image; respectively taking each guide graph in the guide graphs as a guide condition, and performing up-sampling on the bilateral grid data to obtain a plurality of characteristic graphs, wherein each guide graph can be used for guiding the selection of information corresponding to a corresponding channel from the brightness dimension in the bilateral grid data to perform up-sampling; and fusing the plurality of feature maps to obtain an output image.

Therefore, in the embodiment of the present application, features may be extracted from the dimensions of each channel of the input image as a guide map, and bilateral mesh data of the input image is up-sampled, so as to smooth noise in the input image, improve the definition of the image, avoid information loss in each channel, and further improve the definition of the image. Moreover, the image enhancement is realized by up-sampling the low-resolution bilateral grid data, and the consumed calculation amount is less due to the lower resolution of the bilateral grid data, so that the light-weight image enhancement is realized. For example, when the image needs to be defogged, the method provided by the embodiment of the application can be used for extracting features from each channel of the image, so that more detailed information of each channel in the input image can be reserved in the final output image, the defogging effect is realized in a light weight manner, and the user experience is improved.

In a possible implementation, the aforementioned upsampling the bilateral mesh data with each of the plurality of guidance maps as a guidance condition, may include: using a first guide graph as a guide condition, carrying out up-sampling on bilateral grid data to obtain an up-sampling characteristic, wherein the first guide graph is any one of a plurality of guide graphs; and fusing the up-sampling characteristic and the input image to obtain a first characteristic diagram, wherein the first characteristic diagram is included in the plurality of characteristic diagrams.

In the embodiment of the application, the guide graph can be used as a guide condition to perform upsampling on bilateral grid data, so that when the upsampling is performed, the characteristics of each channel in an input image can be referred to, a better upsampling effect is realized under the guide of the characteristics of each channel, the characteristics obtained by the upsampling can more accurately describe the details in the input image, the noise in the input image is smoothed, and the denoising effect is realized.

In a possible implementation, the fusing the upsampled feature and the input image to obtain the first feature map may include: compressing the upsampling features to obtain compressed features, wherein the number of channels of the compressed features is less than that of the channels of the upsampling features; and performing element-wise product (element-wise product) on the compression characteristics and the input image, namely multiplying the value of each pixel point in the compression characteristics and the value of the corresponding pixel point in the input image to obtain a first characteristic diagram.

In the embodiment of the application, the upsampling features can be compressed, so that the compression features with fewer channels are obtained, the subsequent calculation amount during fusion of the input images and the features is reduced, and light-weighted image enhancement is realized by help, so that the upsampling features can be applied to various devices to improve the generalization capability.

In a possible embodiment, the obtaining the bilateral mesh data corresponding to the input image to obtain the output image may include: carrying out down-sampling on an input image to obtain a down-sampled image; and extracting features from the downsampled image to obtain downsampled features, and then obtaining bilateral grid data according to the downsampled features.

Therefore, in the embodiment of the application, the input image can be downsampled to obtain the low-resolution image, and the feature extraction is performed on the low-resolution image, so that when the bilateral mesh data are subsequently upsampled, the upsampling can be performed under the guidance of the guide map, and the effect of smoothing noise is achieved.

In a possible implementation manner, the fusing the plurality of feature maps to obtain a stitched image includes: splicing the plurality of characteristic graphs to obtain a spliced image; performing at least one time of feature extraction on the spliced image to obtain at least one first feature; and fusing the at least one first characteristic and the input image to obtain an output image.

Therefore, in the embodiment of the application, channel dimension splicing can be realized in a splicing mode, so that output images of multiple channels are obtained.

In a possible implementation, the above-mentioned splicing a plurality of feature maps may include: and splicing the plurality of characteristic graphs and the input image to obtain a spliced image.

In the embodiment of the application, when the characteristic diagram is spliced, the input image can be merged, so that the detail information of the spliced image is supplemented through the information included in the input image, the information in the input image is prevented from being lost, and the definition of the image is improved.

In a second aspect, the present application provides a neural network training method, including: acquiring a training set, wherein the training set comprises a plurality of image samples and a true value image corresponding to each image sample, and each image sample comprises information of a plurality of channels; carrying out at least one iterative training on the neural network by using a training set to obtain a trained neural network; wherein, in any iterative training process, the neural network extracts features from the information of a plurality of channels of the input image respectively to obtain a plurality of guide graphs, the guide graphs correspond to the channels one by one, that is, each channel corresponds to one guide map, the bilateral mesh data corresponding to the input image is obtained by taking each guide map in the plurality of guide maps as a guide condition, up-sampling the bilateral grid data to obtain multiple characteristic graphs, fusing the multiple characteristic graphs to obtain an output image, updating the neural network according to the true value images corresponding to the output image and the input image to obtain the updated neural network at the current time, wherein the bilateral grid data comprise data formed by brightness dimension information arranged in a preset space, the brightness dimension information is obtained according to characteristics extracted from the input image, and the resolution of the bilateral grid data is lower than that of the input image.

Therefore, in the method provided by the application, when the neural network is trained, the features can be extracted from the dimensionality of each channel of the input image to serve as the guide graph, and the bilateral grid data of the input image is up-sampled, so that the noise in the input image is smoothed, the definition of the image is improved, the information loss in each channel can be avoided, and the definition of the image output by the neural network is further improved. Moreover, the image enhancement is realized by up-sampling the low-resolution bilateral grid data, and the consumed calculation amount is less due to the lower resolution of the bilateral grid data, so that the light-weight image enhancement is realized. For example, when the image needs to be defogged, the neural network obtained by training through the method provided by the embodiment of the application can be used for extracting features from each channel of the image, so that more detailed information of each channel in the input image can be reserved in the final output image, the defogging effect is realized, and the user experience is improved.

In a possible implementation, the fusing the upsampling feature and the input image, as described above, may include: compressing the upsampling features to obtain compressed features, wherein the number of channels of the compressed features is less than that of the channels of the upsampling features; and carrying out item-by-item product on the compressed features and the input image to obtain a first feature map.

In a possible implementation manner, the acquiring bilateral mesh data corresponding to the input image may include: carrying out down-sampling on an input image to obtain a down-sampled image; and extracting features from the downsampled image to obtain downsampled features, wherein the bilateral grid data comprise the downsampled features.

In a possible embodiment, the fusing a plurality of feature maps includes: splicing the plurality of characteristic graphs to obtain a spliced image; performing at least one time of feature extraction on the spliced image to obtain at least one first feature; and fusing the at least one first characteristic and the input image to obtain an output image.

In a third aspect, the present application provides a neural network, which may include: a bilateral mesh generation network, a guide map generation network, a feature reconstruction network, an image reconstruction network and the like.

Wherein the bilateral mesh generation network may be configured to: the method comprises the steps of carrying out down-sampling on an input image with full resolution to obtain a low-resolution image, and then generating bilateral grid data based on the image obtained by the down-sampling, wherein the bilateral grid data comprise information related to space dimensionality and brightness to form at least three-dimensional data.

The guidance map generation network is to: features are extracted based on each channel of the input image with full resolution, and a guide map corresponding to each channel is obtained, namely each channel corresponds to one guide map.

The feature reconstruction network is to: and for each channel, under the guidance of a corresponding guide graph, performing up-sampling on the bilateral grid to obtain a characteristic graph corresponding to each channel.

The image reconstruction network is to: and fusing the characteristic graphs corresponding to each channel, and fusing the fused characteristics with the input image to obtain an output image.

Furthermore, it is to be understood that the neural network may be used to perform the method steps of the first aspect or any of the alternative embodiments of the first aspect as described above.

In a fourth aspect, an embodiment of the present application provides an image processing apparatus having a function of implementing the image processing method of the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a fifth aspect, an embodiment of the present application provides a training apparatus, which has a function of implementing the neural network training method according to the second aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a sixth aspect, an embodiment of the present application provides an image processing apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected by a line, and the processor calls the program code in the memory for executing the processing-related functions of the method for image processing according to any of the first aspect. Alternatively, the image processing device may be a chip.

In a seventh aspect, an embodiment of the present application provides a training apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected by a line, and the processor calls the program code in the memory to execute the processing-related functions of the neural network training method according to any one of the third aspect. Alternatively, the training device may be a chip.

In an eighth aspect, the present application provides an image processing apparatus, which may also be referred to as a digital processing chip or chip, where the chip includes a processing unit and a communication interface, and the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute functions related to processing in the first aspect or any one of the optional implementations of the first aspect.

In a ninth aspect, embodiments of the present application provide a training apparatus, which may also be referred to as a digital processing chip or chip, where the chip includes a processing unit and a communication interface, and the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute functions related to processing in any of the above-mentioned second aspect or second aspect.

In a tenth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method in any optional implementation manner of the first aspect or the second aspect.

In an eleventh aspect, embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method in any of the optional embodiments of the first or second aspects.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence body framework for use in the present application;

fig. 2 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;

FIG. 3 is a system architecture diagram according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another system architecture according to an embodiment of the present application;

fig. 5 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 6 is a schematic view of another application scenario provided in the embodiment of the present application;

fig. 7 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 8 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 9 is a schematic flowchart of a process of generating bilateral mesh data according to an embodiment of the present application;

FIG. 10 is a flow chart illustrating a method for generating a guidance diagram according to an embodiment of the present disclosure;

FIG. 11 is a schematic flow chart of another example of generating a guidance diagram according to the present disclosure;

fig. 12 is a schematic flowchart of feature reconstruction of one of the channels according to an embodiment of the present disclosure;

fig. 13 is a schematic flowchart of another feature reconstruction of one of the channels according to an embodiment of the present disclosure;

fig. 14 is a schematic flowchart of image reconstruction provided in an embodiment of the present application;

fig. 15 is a schematic structural diagram of a neural network according to an embodiment of the present application;

FIG. 16 is a schematic flow chart of a neural network training method provided herein;

fig. 17 is a schematic diagram of an image enhancement effect provided in the present application;

fig. 18A is a schematic diagram of an image enhancement effect provided in the present application;

FIG. 18B is a schematic diagram of another image enhancement effect provided by the present application;

FIG. 18C is a schematic diagram of another image enhancement effect provided by the present application;

fig. 19 is a schematic structural diagram of an image processing apparatus provided in the present application;

FIG. 20 is a schematic diagram of a neural network training architecture provided in the present application;

fig. 21 is a schematic structural diagram of another image processing apparatus provided in the present application;

FIG. 22 is a schematic diagram of another neural network training configuration provided herein;

fig. 23 is a schematic structural diagram of a chip provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by an intelligent chip, such as a Central Processing Unit (CPU), a Network Processor (NPU), a Graphic Processor (GPU), an Application Specific Integrated Circuit (ASIC), or a Field Programmable Gate Array (FPGA), or other hardware acceleration chip; the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, safe city etc..

The embodiments of the present application relate to a large number of related applications of neural networks, and in order to better understand the scheme of the embodiments of the present application, the following first introduces related terms and concepts of neural networks that may be related to the embodiments of the present application.

(1) Neural network

The neural network may be composed of neural units, and the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be as shown in formula (1-1):

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple intermediate layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, intermediate layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the number of the middle layers is an intermediate layer or a hidden layer. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, each layer can be represented as a linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector or referred to as a bias parameter, w is a weight matrix (also referred to as a coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is also large. The definition of these parameters in DNN is as follows: taking coefficient w as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more intermediate layers make the network more able to characterize complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

The network for extracting features mentioned below in the present application may include one or more convolutional layers, and may be implemented by using CNN, for example.

(4) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible. The loss function may generally include a loss function such as mean square error, cross entropy, logarithm, or exponential. For example, the mean square error can be used as a loss function, defined as

The specific loss function can be selected according to the actual application scenario.

(5) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

(6) Receptive Field (Receptive Field)

A term in the field of deep neural networks in the field of computer vision is used to denote the size of the range of perception of the original image by neurons at different positions within the neural network. The larger the value of the neuron receptive field is, the larger the range of the original image which can be contacted with the neuron is, which also means that the neuron possibly contains more global and higher semantic level features; and the smaller the value, the more local and detailed the features it contains. The receptive field value can be used to approximate the level of abstraction at each level.

(7) Image (image quality) enhancement

The technique of processing the brightness, color, contrast, saturation, dynamic range, etc. of an image to satisfy a certain specific index, or referred to as image quality enhancement, is equivalent to improving the quality of the image to make the image clearer.

(8) Defogging

It is a kind of image enhancement, which makes the image with some blur clearer. For example, an image can be shot in an environment with fog, and the shot image may not be clear at this time, and the defogging processing can be performed by the method provided by the application, so that the definition or contrast of the image is improved, and the image is clearer.

(9) RGB image

An RGB image is an image having at least three channels, red (red), green (green), and blue (blue) respectively. The three colors constitute all the colors that the vision can perceive, and are one of the most widely used color systems. For example, a frame of an RGB image is an array of M x N x3 color pixels, where M x N is the size of the image and each color pixel is a set of three values corresponding to red, green, and blue components, respectively.

Generally, CNN is a common neural network, and as mentioned below in this application, the network for feature extraction may be CNN or other networks including convolutional layers, and for understanding, the following exemplary structure of the convolutional neural network is described.

The structure of CNN is described in detail below with reference to fig. 2. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 2, a Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230. In the following embodiments of the present application, each layer is referred to as a stage for ease of understanding. The relevant contents of these layers are described in detail below.

Convolutional layer/pooling layer 220:

the convolutional layer/pooling layer 220 shown in fig. 2 may include layers such as example 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

In the following embodiments of the present application, the process of extracting features may be a process of extracting features by convolution. For example, the convolution layer may include a separate convolution (sep _ conv _3x3) having a convolution kernel size of 3 × 3, a separate convolution (sep _ conv _5x5) having a convolution kernel size of 5 × 5, a hole convolution (dil _ conv _3x3) having a convolution kernel size of 3 × 3 and a hole rate of 2, a hole convolution (dil _ conv _5x5) having a convolution kernel size of 5 × 5 and a hole rate of 2, and the like.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, which may also be referred to as a downsampling layer, and which may be used to downsample the image. The layers 221-226, as illustrated by 220 in FIG. 2, may be a convolutional layer followed by a pooling layer, or a multi-layer convolutional layer followed by one or more pooling layers. The purpose of the pooling layer is to reduce the spatial size of the image during image processing. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

Fully connected layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the fully-connected layer 230. Therefore, a plurality of hidden layers (231, 232 to 23n shown in fig. 2) may be included in the fully-connected layer 230, and parameters included in the hidden layers may be obtained by pre-training according to the related training data of a specific task type, for example, the task type may include image enhancement, image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the fully-connected layer 230, i.e., the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from the direction 210 to 240 in fig. 2 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e., the propagation from the direction 240 to 210 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models. For example, including only a portion of the network structure shown in fig. 2, for example, a convolutional neural network employed in an embodiment of the present application may include only an input layer 210, a convolutional/pooling layer 220, and an output layer 240.

In the present application, the convolutional neural network 200 shown in fig. 2 may be used to process the image to be processed, so as to obtain an enhanced image. As shown in fig. 2, the image to be processed is processed by the input layer 210, the convolutional layer/pooling layer 220 and the fully connected layer, and then the enhanced image with more clarity and more texture information is output.

The method for image processing provided by the embodiment of the present application may be executed on a server, and may also be executed on a terminal device, or a neural network mentioned below in the present application may be deployed on the server, and may also be deployed on the terminal, and may specifically be adjusted according to an actual application scenario. The terminal device may be a mobile phone with an image processing function, a Tablet Personal Computer (TPC), a media player, a smart tv, a notebook computer (LC), a Personal Digital Assistant (PDA), a Personal Computer (PC), a camera, a camcorder, a smart watch, a Wearable Device (WD), an autonomous vehicle, or the like, which is not limited in the embodiment of the present application.

As shown in fig. 3, the present embodiment provides a system architecture 100. In fig. 3, a data acquisition device 160 is used to acquire training data. In some alternative implementations, for image enhancement, the sample pairs included in the training data may include images with low quality and clear images, for example, one sample pair may include an image captured in foggy days and a clear image (or called a truth image) after a large amount of processing.

After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130. Alternatively, the training set mentioned in the following embodiments of the present application may be obtained from the database 130, or may be obtained by inputting data by a user.

The target model/rule 101 may be a neural network trained in the embodiment of the present application.

Describing the target model/rule 101 obtained by the training device 120 based on the training data, the training device 120 processes the input original image, and compares the output image with the original image until the difference between the output image and the original image of the training device 120 is smaller than a certain threshold, thereby completing the training of the target model/rule 101.

The target model/rule 101 can be used for implementing a neural network obtained by training the method for image processing according to the embodiment of the present application, that is, the data to be processed (e.g., an image) is input into the target model/rule 101 after being subjected to relevant preprocessing, so that a processing result can be obtained. The target model/rule 101 in the embodiment of the present application may specifically be a first neural network mentioned below in the present application, and the first neural network may be a CNN, DNN, RNN, or other type of neural network described above. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 3, where the execution device 110 may also be referred to as a computing device, and the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or the like, and may also be a server or a cloud device, or the like. In fig. 5, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: the client device inputs data to be processed.

The preprocessing module 113 and the preprocessing module 114 are configured to perform preprocessing according to input data (such as data to be processed) received by the I/O interface 112, and in this embodiment, the input data may be processed directly by the computing module 111 without the preprocessing module 113 and the preprocessing module 114 (or only one of them may be used).

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, if the I/O interface 112 returns the processing result to the client device 140 and provides the processing result to the user, for example, if the first neural network is used for image classification and the processing result is a classification result, the I/O interface 112 returns the obtained classification result to the client device 140 and provides the classification result to the user.

It should be noted that the training device 120 may generate corresponding target models/rules 101 based on different training data for different targets or different tasks, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results. In some scenarios, the performing device 110 and the training device 120 may be the same device or may be located within the same computing device, and for ease of understanding, the performing device and the training device will be described separately and are not intended to be limiting.

In the case shown in fig. 3, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also act as a data collection terminal, collecting input data for the input I/O interface 112 and the predicted tag for the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted into the I/O interface 112 and the prediction tag outputted into the I/O interface 112 as shown in the figure may be directly stored into the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 3 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and a positional relationship between devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 3, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 3, a target model/rule 101 is obtained according to training of the training device 120, where the target model/rule 101 may be a neural network in the present application in this embodiment, and specifically, the neural network provided in this embodiment may be a CNN, a Deep Convolutional Neural Network (DCNN), a Recurrent Neural Network (RNN), or the like.

Referring to fig. 4, the present application further provides a system architecture 400. The execution device 110 is implemented by one or more servers (e.g., a server cluster), optionally in cooperation with other computing devices, such as: data storage, routers, load balancers, and the like; the execution device 110 may be disposed on one physical site or distributed across multiple physical sites. The execution device 110 may use data in the data storage system 150 or call program code in the data storage system 150 to implement the steps corresponding to fig. 7 to 16 below for the image processing method or the neural network training method of the present application.

The user may operate respective user devices (e.g., local device 401 and local device 402) to interact with the execution device 110. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

Each user's local device may interact with the enforcement device 110 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof. In particular, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like. The wireless network includes but is not limited to: a fifth Generation mobile communication technology (5th-Generation, 5G) system, a Long Term Evolution (LTE) system, a global system for mobile communication (GSM) or Code Division Multiple Access (CDMA) network, a Wideband Code Division Multiple Access (WCDMA) network, a wireless fidelity (WiFi), a bluetooth (bluetooth), a Zigbee protocol (Zigbee), a radio frequency identification technology (RFID), a Long Range (Long Range ) wireless communication, a Near Field Communication (NFC), or a combination of any one or more of these. The wired network may include a fiber optic communication network or a network of coaxial cables, among others.

In another implementation, one or more aspects of the execution device 110 may be implemented by each local device, e.g., the local device 401 may provide local data or feedback calculations for the execution device 110. The local device may also be referred to as a computing device.

It is noted that performing all of the functions of device 110 may also be performed by a local device. For example, the local device 401 may be implemented to perform the functions of the device 110 and provide services to a user, or to provide services to a user of the local device 402, such as providing image-optimized services.

In general, in some scenes where fog exists or in scenes where the light intensity is weak, the captured image may be blurred or have low contrast, which may cause the image to be unclear. For example, in the case of defogging, the early mainstream method is the traditional method, which models and estimates the light propagation, but the artificially designed model and the picture priori knowledge cannot be accurately applied to different complicated and different types of real pictures. After the convolutional neural network is applied to a large number of visual tasks and achieves a significant breakthrough in effect, a learning-based method becomes the mainstream, and some modules in the traditional method are replaced by a learnable network layer. In the existing method, artificially designed modules such as light estimation and the like are discarded, an end-to-end network is adopted to solve the defogging task, namely the whole process of image transformation is learned by the network, and the algorithm is completely driven by data. The current end-to-end network for obtaining the SOTA (State-of-the-art) effect has a large demand on the calculation amount, and it takes much time to process the image with higher resolution, and some images with higher resolution cannot be processed in real time.

Some common defogging methods can, for example, use U-Net as a backbone network of the network, and use a method in the denoising field for reference to gradually enhance the defogging effect; and a feature fusion module is provided to fuse features of different scales together, so that the defects that the native U-Net loses space information and lacks non-adjacent level connection are overcome. However, this method has high computational complexity, and the time consumed for processing the image is long, so that real-time processing cannot be realized.

For another example, the small-resolution image may be processed to obtain the transform coefficients stored in the bilateral mesh, and the features of the full-resolution image may be extracted to obtain the guide map; and under the guidance of the guide map, up-sampling the bilateral grid coefficients to obtain a pixel transformation matrix, and transforming each pixel of the input image by using the pixel transformation matrix to obtain final output. When the guide map is obtained, the features of the image need to be compressed, which may result in information loss, and the implementation of the manner of upsampling the bilateral grid coefficients is inefficient, which requires a long time.

Therefore, the present application provides an image processing method that achieves efficient image enhancement by a lightweight framework, and even achieves real-time image processing, thereby improving the efficiency of image enhancement.

First, some scenarios applied in the present application will be exemplarily described.

For example, as shown in fig. 5, low-quality video data collected by each monitoring device may be collected and stored in a memory, such as a smart city, outdoor scene monitoring, indoor monitoring, in-vehicle monitoring, pet monitoring, field shooting, field monitoring, etc. When the video data is played, the image processing method provided by the application can be used for carrying out image enhancement on the video data, such as defogging or contrast improvement, so that the video data with higher definition can be obtained, and the watching experience of a user can be improved.

For another example, the image processing method provided by the present application may be applied to a live video scene, and as shown in fig. 6, a server may send a video stream to a client used by a user. After the client receives the data stream sent by the server, the data stream can be subjected to image enhancement processing by the image processing method provided by the application, so that video data with higher definition can be obtained, and the watching experience of a user is improved.

For another example, in an automatic driving scenario, a camera provided on a smart car may capture an image of the surroundings of the vehicle, and based on the image, a travel path may be planned for the vehicle or a travel decision may be made. By the image processing method, the image shot by the camera can be enhanced, a clearer image can be obtained, and the defogging effect is realized particularly in a foggy scene, so that the driving safety of a vehicle is improved, and the user experience is improved.

For example, a user can use the terminal to take a picture in an environment with fog, the taken image may be unclear due to the influence of the fog, and the taken image can be subjected to defogging processing by the image processing method provided by the application, so that the definition of the image is improved, and the user experience is improved.

For example, in some scenes with large ambient light differences, an image shot by a user may be unclear due to low contrast, and at this time, the shot image may be processed by the image processing method provided by the present application, so that the contrast of the image is improved, the image is clearer, and the user experience is improved.

The following describes the image processing method provided by the present application in detail with reference to the foregoing scene and system architecture.

Moreover, the image processing method provided by the present application may be implemented by a neural network, for example, the neural network may be deployed on a terminal of a user, and the terminal may operate the neural network, thereby implementing the steps of the image processing method provided by the present application.

Referring to fig. 7, a flow chart of an image processing method provided in the present application is shown as follows.

701. An input image is acquired.

The input image may be an image captured by the terminal or a received image, and the input image has a plurality of channels.

The input image may be input by a user, for example, referring to fig. 3, the user may send input data to the execution device 110 via the client device 140, and the input data may carry the input image.

Alternatively, if the image processing method provided by the present application is executed by a terminal, the input image may also be an image captured by the terminal. For example, the input image may be an image captured by the terminal in a foggy scene, and the defogging process is required to improve the definition of the image.

702. Features are extracted from information of a plurality of channels of an input image respectively to obtain a plurality of guide maps.

The input image is provided with a plurality of channels, the method and the device can extract features from the dimension of each channel respectively to obtain a feature map of each channel dimension, namely a guide map, and each channel can correspond to one feature map to obtain a plurality of guide maps. The plurality of pilot maps may subsequently be used to direct upsampling of the bilateral mesh data, equivalently functioning as a pilot upsampling.

For example, the input image may include information of three channels, and the features are extracted from the information of each channel, that is, the features corresponding to each channel may be obtained, and the guide map corresponding to each channel may be obtained.

703. And acquiring bilateral grid data corresponding to the input image.

The bilateral mesh data may include data formed by information of a space dimension and information of a brightness dimension, in other words, the bilateral mesh data includes data formed by information of a brightness dimension arranged in a preset space. It will be appreciated that the bilateral mesh data may include data in at least three dimensions.

Specifically, the spatial dimension may be preset, or may be determined according to the size of the input image, where the spatial dimension includes at least two dimensions, that is, the spatial dimension may include a two-dimensional space, or may include a three-dimensional space, and the like.

For example, the spatial dimension corresponds to the size of the input image, each position in the spatial dimension corresponds to one or more pixel points in the input image, and accordingly, after a feature is extracted from the input image, the feature may be processed, information obtained after the processing is related to the brightness of the input image, and the feature is assigned to the spatial dimension, so that the bilateral mesh data may be obtained.

In one possible implementation, an input image is downsampled to obtain a downsampled image; features are extracted from the downsampled image to obtain downsampled features, and the downsampled features can be included in the bilateral mesh data. The down-sampling mode can include various modes, such as bilinear interpolation or bicubic interpolation.

The method can be understood as that firstly, the input image is downsampled, the resolution of the image is reduced, the subsequent calculation complexity can be reduced, and the lightweight image enhancement is realized; and then extracting features from the downsampled image, performing average pooling on the extracted features to obtain information related to the brightness of the input image, namely the downsampled features, and mapping the downsampled features to the space dimensionality to obtain the bilateral grid data.

The spatial dimension in the bilateral mesh data may be a preset dimension, or a dimension determined according to the size of the input image. For example, a size corresponding to the size of the spatial dimension of the bilateral mesh data may be preset, and a pixel point in the down-sampled image may have a corresponding relationship with the size, for example, the size may be 100 × 100, and the resolution of the down-sampled image is 200 × 200, and then every 4 pixel points in the down-sampled image may be corresponding to one pixel point in the size, and after the down-sampled feature is extracted from the down-sampled image, the down-sampled feature is allocated to the corresponding pixel point in the size according to the corresponding relationship between the pixel point in the down-sampled image and the size, so as to obtain the bilateral mesh data.

It should be noted that, in the present application, the execution sequence of step 702 and step 703 is not limited, and step 702 may be executed first, step 703 may be executed first, step 702 and step 703 may also be executed simultaneously, and the specific method may be adjusted according to an actual application scenario.

704. And taking each guide graph in the guide graphs as a guide condition, and performing up-sampling on the bilateral grid data to obtain a plurality of characteristic graphs.

After obtaining the plurality of guidance maps, the bilateral mesh data may be up-sampled for each channel under the guidance of the guidance map corresponding to each channel, respectively, to obtain a plurality of feature maps corresponding to the plurality of channels. Each guide graph can be used for guiding to select information related to a corresponding channel from information of brightness dimensions in the bilateral grid data to perform upsampling when upsampling is performed, so that a characteristic graph corresponding to each channel is obtained.

The upsampling mode may adopt various interpolation algorithms, such as bilinear interpolation, bicubic interpolation, or trilinear interpolation, to obtain a feature map with a resolution higher than that of the bilateral grid data.

Optionally, taking a processing procedure of one of the plurality of guidance maps (referred to as a first guidance map) as an example, the first guidance map is used as a guidance condition to perform upsampling on the bilateral mesh data to obtain an upsampled feature, and then the upsampled feature and the input image are fused to obtain a first feature map, that is, one of the plurality of feature maps. Fusion modalities may include, but are not limited to: multiplication, concatenation, or weighted fusion, etc.

For example, the information of the brightness dimension in the bilateral grid data may be used as a coefficient, and upsampling may be performed in an interpolation manner, where each point in the spatial dimension in the bilateral grid data may correspond to a plurality of coefficient values, and when upsampling is performed, one or more of the coefficients corresponding to each point or each of the plurality of points are selected to perform interpolation under guidance of the guidance map, so as to obtain an upsampling feature with the same size as that of the guidance map. The specific guidance mode of the guidance diagram may include: the value corresponding to each pixel point in the guidance diagram can be understood as a brightness level, each point in the guidance diagram and a point of a space dimension in the bilateral grid data have a corresponding relation, and a coefficient matched with the brightness level can be selected from a plurality of coefficients corresponding to each point or every few points in the bilateral grid data according to the brightness level of each point in the guidance diagram, so that interpolation is carried out, and the up-sampling characteristic is obtained.

Therefore, in the embodiment of the application, the guide graph can be used as a guide to perform upsampling on the bilateral grid data, so that the upsampled features with higher resolution are obtained, and the guide graph of each channel is used to guide the upsampling of the bilateral grid data, so that the feature graph can be obtained for each channel, and the finally obtained output image can be better and clearer in each channel.

Optionally, the manner of fusing the upsampled features and the input image may include: and compressing the upsampling to obtain a compression characteristic. And then, fusing the compression features and the input image to realize feature reconstruction and obtain one corresponding feature map. The specific fusion mode may include term-by-term product, or weighted fusion, etc. The compression reduces the number of channels of the upsampled features, so that the number of channels of the compressed features is less than that of the upsampled features, thereby reducing the data volume. The compression can be performed in a convolution mode, which is equivalent to performing further feature extraction from the upsampling features, and the compression can be performed in a mode of adjusting the number of convolution kernels, so that the compression features with smaller data size are obtained.

Therefore, in the embodiment of the application, the upsampling features can be compressed, the compressed features and the input image are fused, feature reconstruction is realized by fusing the compressed features and the input image, and features of the instances in the input image in each channel can be represented more accurately.

705. And fusing the plurality of feature maps to obtain an output image.

After obtaining the plurality of feature maps, the plurality of feature maps can be fused, which is equivalent to restoring each channel to obtain a multi-channel output image. The number of channels of the output image and the number of channels of the input image are usually the same.

Therefore, in the embodiment of the application, features can be extracted from dimensions of multiple channels of an input image to obtain multiple guide maps, the low-resolution bilateral grid data are respectively subjected to up-sampling under the guidance of the multiple guide maps to obtain multiple feature maps, the multiple feature maps and the output image are fused, noise in the input image is smoothed in a mode of up-sampling the low-resolution bilateral grid data, the definition of the input image is improved, the defogging effect of the input image is achieved, or the contrast of the input image is improved, and a clearer output image is obtained. Moreover, the image enhancement is realized by up-sampling the low-resolution bilateral grid data, and the consumed calculation amount is less due to the lower resolution of the bilateral grid data, so that the light-weight image enhancement is realized.

Optionally, the specific process of fusing the plurality of feature maps may include: and splicing the plurality of feature maps to obtain a spliced image, then performing at least one time of feature extraction on the spliced image, wherein when a plurality of feature extractions exist, the iterative extraction can be performed, that is, the current feature is extracted from the last extracted feature to obtain at least one first feature, and the at least one first feature and the input image are fused to obtain an output image. Therefore, in the embodiment of the application, a plurality of feature maps can be fused in a splicing manner, and since the feature maps are obtained by up-sampling the bilateral mesh data under the guidance of the guide map of each channel, in this step, by splicing the plurality of feature maps, the channel of the spliced image can be recovered, so that the spliced image has a plurality of channels, and is fused with the input image to obtain the output image of the plurality of channels. Equivalently, the input image is enhanced from the dimensionality of each channel, the enhanced output image is obtained, the definition of the image is improved, and the user experience is improved.

Optionally, when a plurality of feature maps are stitched, the plurality of feature maps and the input image may be stitched to obtain a stitched image. Therefore, the details of the characteristic diagrams can be supplemented through the details included in the input image, such as low-frequency information of color or brightness and the like, so that the details of the spliced image are richer, excessive smoothness caused by the steps is avoided, and the definition of the output image is further improved.

The foregoing describes the flow of the image processing method provided in the present application in detail, and for convenience of understanding, the following describes the flow of the image processing method provided in the present application in more detail by taking a specific application scenario as an example.

Referring to fig. 8, a flow chart of another image processing method provided by the present application is schematically illustrated.

For the sake of understanding, the image processing method of the image of the present application is described by dividing into a plurality of steps, such as bilateral mesh generation 801, guide map generation 802, feature reconstruction 803, and image reconstruction 804 shown in fig. 8.

Therein, bilateral mesh generation 801: the input image I with the full resolution can be downsampled to obtain a low-resolution image

And then based on the images obtained by down-sampling

And generating bilateral grid data g, wherein the bilateral grid data comprise information related to space dimensionality and brightness to form at least three-dimensional data.

Guidance map generation 802: extracting features based on each channel of the full-resolution input image I to obtain a guide map corresponding to each channel, such as the guide map G shown in FIG. 8₁/G₂/G₃。

Feature reconstruction 803: for each channel, under the guidance of a corresponding guide graph, performing up-sampling on the bilateral grid to obtain a feature graph F corresponding to each channel₁/F₂/G₃。

Image reconstruction 804: and fusing the feature maps corresponding to each channel, and fusing the fused features with the input image to obtain an output image O. Optionally, when the feature map is fused, the input image may be fused, so as to fuse details (low-frequency features such as color or brightness) included in the input image with features of each channel, and obtain a fused image with richer details.

In the embodiment of the application, the characteristic reconstruction can be carried out based on bilateral grid data under the guidance of the guide graph of each channel, so that more accurate and clear characteristics can be reconstructed for each channel of the input image, the input image can be further fused with the input image, the noise in the input image can be smoothed, the input image is clearer, and the defogging or contrast improvement effect is realized.

Further, the respective steps are described in more detail below.

One, two-sided mesh generation

Wherein, the input image I with the full resolution is downsampled to obtain a low-resolution image

I.e., downsampled images, the downsampling may include various ways, such as bilinear interpolation, trilinear interpolation, or bicubic interpolation. And then from the low resolution image

And extracting the features to obtain down-sampling features, and distributing the down-sampling features to corresponding spaces to obtain the bilateral grid data. The space may be preset, or may be a space obtained according to the size of the input image, for example, a two-dimensional space obtained by reducing the size of the input image.

For example, generating bilateral mesh data may be described with reference to FIG. 9 in obtaining a low resolution image

Then, the low resolution image is processed

As an input to the multi-layer convolution, the extracted features may be used to represent a low resolution image

And then performing pooling processing on the extracted features by average pooling to obtain down-sampling features. A two-dimensional space can be preset, the two-dimensional space and the input image or the low-resolution image

For example, one or more pixel points in the input image correspond to a point in the two-dimensional space. And then mapping the down-sampling features to each pixel point in the two-dimensional space to obtain bilateral grid data. That is, the mesh data includes at least three-dimensional data including a two-dimensional space, and information related to the luminance of the input image, which can be represented by down-sampling features, and the like.

Second, guide diagram generation

The input image I may include information of a plurality of channels, for example, an RGB image includes information of three channels. Information can be extracted from each channel of the input image I to obtain a feature map corresponding to each channel, and the feature map corresponding to each channel is used as a guide map for guiding subsequent up-sampling of the bilateral grid data.

For example, referring to fig. 10, the input image I has three channels, such as channel 1, channel 2 and channel 3 shown in fig. 10, and the RGB channel may be divided into R, G, B three channels. Then, feature extraction is performed on each channel to obtain a feature map corresponding to each channel, such as a feature map G shown in fig. 10₁、G₂And G₃。

Specifically, for example, as shown in fig. 11, the feature extraction may be implemented by multi-layer convolution, as shown in fig. 11, one or more convolution layers and one or more parametric modified linear unit (PReLU) layers may be used as a feature extraction network, each channel of an input image is used as an input of the feature extraction network, and a feature map of each channel, that is, a feature map, is outputG₁、G₂And G₃。

Therefore, in the embodiment of the application, feature extraction is performed on each channel respectively, a plurality of guide maps are obtained, more details in the input image can be extracted, and the loss of the details is reduced.

Third, feature reconstruction

For example, as shown in fig. 12, after obtaining the guidance map and the bilateral grid data, under guidance of the guidance map, upsampling the bilateral grid data by using a manner including, but not limited to, bilinear interpolation, bicubic interpolation, or trilinear interpolation to obtain an upsampled feature U, then performing feature compression on the upsampled feature U to obtain a compressed feature C, and then performing feature reconstruction based on the input image I and the compressed feature C, for example, performing a product-by-product on the input image I and the compressed feature C to obtain a feature map F corresponding to the current channel. For example, the resolution of the input image I may be 100 × 100, the resolution of the compression feature C may also be 100 × 100, the pixel points of the input image I correspond to the pixel points of the compression feature C one to one, and the value of each pixel point in the input image I may be multiplied by the value of the corresponding pixel point in the compression feature C to obtain the feature map F. Of course, besides the product-by-product, the feature reconstruction may also be implemented in a manner of weighted fusion or splicing, etc., in the present application, multiplication is exemplarily presented, but not limited, and the specific feature reconstruction manner may be adjusted according to the actual application scenario.

Specifically, for example, as shown in FIG. 13, a feature map G is provided₁、G₂And G₃And respectively serving as guide graphs to guide the up-sampling of the bilateral grid g to obtain an up-sampling feature U, and then reducing the channel number of the up-sampling feature U through convolution to obtain a compression feature C. And multiplying the compressed characteristic C and the input image I to obtain a characteristic diagram F.

Therefore, in the embodiment of the present application, the feature reconstruction can be performed by multiplying the compressed feature by the input image, which is advantageous for the convergence of the training of the model, and is advantageous for the image enhancement with a small amount of calculation and a light weight.

Fourth, image reconstruction

Taking an example that an input image has three channels, after the feature reconstruction is performed, a feature map F can be obtained₁、F₂And F₃The feature map F can be used₁、F₂And F₃And splicing to obtain a spliced image. And then, at least one characteristic is obtained by carrying out at least one iterative characteristic extraction on the spliced image. And splicing the at least one characteristic to be fused with the input image to obtain a final output image.

Specifically, for example, the image reconstruction method may be as shown in fig. 14, first, the feature map F is set₁、F₂And F₃Splicing to obtain a spliced image C₁Optionally, in the mosaic signature F₁、F₂And F₃And meanwhile, the input images can be spliced simultaneously, so that the information included in the spliced images is richer, and the loss of details is avoided. To C₁And performing at least one iteration feature extraction, and recording the features obtained by each feature extraction, wherein each feature extraction can be realized by one or more basic units (blocks) formed by convolution and stacking, and each block can output a feature graph. Splicing the extracted at least one characteristic to obtain a splicing characteristic C₂Then, the splicing characteristics C are fused through a block₂And obtaining a fusion characteristic M, and fusing the fusion characteristic M and the input image to obtain a full-resolution enhanced output image O. The mode of fusing the feature M and the input image may be weighted fusion, or may be a product item by item, and may be specifically adjusted according to the actual application scenario.

Therefore, in the embodiment of the present application, the feature maps of each channel may be spliced, so as to restore each channel of the image. And when in splicing, the input images can be spliced to supplement the details in the feature map, so that the obtained spliced images have richer details. And, the feature M and the input image are fused in a multiplication mode, so that the defogging of the input image can be realized, and the enhanced image can be obtained.

The foregoing describes in detail the flow of the image processing method provided in the present application, and a neural network for implementing the image processing method and a training method of the neural network provided in the present application are described in addition, and the neural network can be used for implementing the steps of the methods corresponding to the foregoing fig. 7 to 14.

Referring to fig. 15, a schematic diagram of a neural network according to the present application is shown.

Corresponding to the aforementioned fig. 8, it can be divided into a plurality of modules, such as a bilateral mesh generation network, a guide map generation network, a feature reconstruction network, an image reconstruction network, and the like shown in fig. 15.

Wherein the bilateral mesh generation network may be configured to: down-sampling the input image I with full resolution to obtain a low-resolution image

And then based on the images obtained by down-sampling

And generating bilateral grid data, wherein the bilateral grid data comprise information related to space dimensionality and brightness to form at least three-dimensional data.

The guidance map generation network is to: extracting features based on each channel of the full-resolution input image I to obtain a guide map corresponding to each channel, such as the guide map G shown in FIG. 8₁/G₂/G₃。

The feature reconstruction network is to: for each channel, under the guidance of a corresponding guide graph, performing up-sampling on the bilateral grid to obtain a feature graph F corresponding to each channel₁/F₂/G₃。

The image reconstruction network is to: and fusing the feature maps corresponding to each channel, and fusing the fused features with the input image to obtain an output image O. Optionally, when fusing the feature maps, the input images may be fused to fuse the details included in the input images with the features of the channels, so as to obtain fused images with richer details.

Based on the structure of the neural network, a training method of the neural network is described below.

Referring to fig. 16, a schematic flow chart of a neural network training method provided by the present application is described as follows.

1601. A training set is obtained.

The training set comprises a plurality of image samples and labels corresponding to the image samples, and each image sample comprises information of a plurality of channels.

In particular, the manner of obtaining the training set may include various manners, for example, if the present application is performed by the training device 120 shown in fig. 2, the training set may be information extracted from the database 130 or data transmitted by the client device 140.

For example, the training set may include a plurality of sample pairs, each sample pair having an image sample and a corresponding true-valued image (i.e., label) with a higher sharpness than the image sample. For example, after a clear multi-channel image is captured, the image is used as a true value image, and the true value image is subjected to blurring processing, such as fog increase or contrast reduction, to obtain an image sample, which is used as a pair of sample pairs with the true value image.

1602. And performing at least one iterative training on the neural network by using the training set to obtain the trained neural network.

The image samples in the training set can be used as the input of the neural network, so that the neural network is subjected to at least one iterative training to obtain the trained neural network. The trained neural network may be used for image enhancement, such as implementing the steps of the corresponding methods of fig. 7-14 described previously.

The following is an example of a training process, which is described as steps 16021-16026 below, using an iterative process as an example.

16021. Features are extracted from information of a plurality of channels of an input image respectively to obtain a plurality of guide maps.

In an iterative training process, one or more image samples in a training set may be used as an input to a neural network, and this application exemplarily describes one input image as an example.

16022. And acquiring bilateral grid data corresponding to the input image.

16023. And taking each guide map in the guide maps as a guide condition, and performing up-sampling on the bilateral grid data to obtain a plurality of characteristic maps.

16024. And fusing the plurality of feature maps to obtain an output image.

The steps 16021-16023 are similar to the aforementioned steps 701-705, and are not described herein again.

16025. And updating the neural network according to the output image and the corresponding true value image to obtain the updated neural network at the current time.

The loss function can be various, such as square mean square error, cross entropy, logarithm, exponent, etc. Corresponding to calculating the offset between the output image and the true image using a function.

After the loss value is calculated, a gradient of the parameter of the neural network may be calculated based on the loss value, and the gradient may be used to represent a derivative when the parameter of the neural network is updated, so that the parameter of the neural network may be updated according to the gradient to obtain an updated neural network.

16026. It is determined whether the convergence condition is satisfied, if yes, step 1603 is executed, otherwise, step 16021 is executed.

After the updated neural network is obtained, whether the convergence condition is met or not can be judged, if the convergence condition is met, the updated neural network can be output, and the training of the neural network is completed. If the convergence condition is not met, the training of the neural network may be continued, i.e., the step 16021 is repeatedly executed until the convergence condition is met.

Wherein the convergence condition may include one or more of: the training frequency of the neural network reaches a preset frequency, or the output precision of the neural network is higher than a preset precision value, or the average precision of the neural network is higher than a preset average value, or the training time length of the neural network exceeds a preset time length, and the like.

In the embodiment of the application, when the upsampling and the input image are fused, the upsampling can be compressed to obtain the compression characteristic, then the compression characteristic and the input image are subjected to item-by-item product, and the information in the input image can be quickly fed back to the characteristic diagram in a product mode, so that the convergence speed of the neural network is increased, and the trained neural network is efficiently obtained.

1603. And outputting the updated neural network.

After the convergence condition is met, the trained neural network can be output, the neural network can be deployed in a terminal or a server, such as a mobile phone, a camera, a smart car or a monitoring device, and the like, and can be used for enhancing the acquired image and improving the definition of the image in scenes such as mobile phone photographing, camera photographing, automatic driving or smart cities, and the like, and especially can be used for enhancing the image through the light-weighted model provided by the application in scenes with higher requirements on real-time performance and computational overhead.

Therefore, the method provided by the application can be used for efficiently obtaining the converged neural network, can be applied to various light-weight scenes, and can be deployed on the computing nodes of related equipment, particularly in some weak computing equipment, so that the picture degraded due to fog can be enhanced, a clear and normal-color result can be obtained, or the contrast of the image can be improved, and the image enhancement effect can be realized.

In order to further facilitate understanding of the image enhancement effect of the method provided by the present application, the image enhancement effect achieved by the method provided by the present application is introduced in some common ways.

First, some common image enhancement methods may include: a fast image haze removal algorithm (a fast image haze removal algorithm, CAP), Non-local image haze removal (NLD), Single image haze removal method (DCP) using dark channel prior, a high efficiency haze removal method (effective image haze removal with boundary constraint and context regularization, BCCR), an end-to-end system for haze removal of Single images (DehazeNet), an integrated haze removal network (All-in-one haze removal, network), a Single image haze removal method (aom for Single image haze removal), a robust image haze removal method (msn-video haze removal) based on multi-scale convolution network, a Single image haze removal method (c-image haze removal, DCP), PMS), an attention-based multi-scale network picture defogging method (identification-based multi-scale network for image defogging, a Domain adaptation for image Defogging (DA), a multi-scale enhanced defogging network with dense feature fusion (MSBDN), and the like. Illustratively, the above-mentioned image enhancement method and the peak signal-to-noise ratio (PSNR) achieved by the image processing method provided by the present application may be as shown in fig. 17, and it is obvious that the image processing method provided by the present application may obtain an image with a higher PSNR on the basis of a lower operation duration, may achieve an image with a lighter weight and a better defogging effect, and may achieve the capability of processing a 4K image in real time.

In more detail, the following table 1 shows the comparison between the output results of the above-mentioned conventional defogging method and the image processing method provided by the present application in the O-HAZE data set, where PSNR and Structural Similarity Index (SSIM) are used for comparison,

TABLE 1

Obviously, the image processing method provided by the application can consume less calculation amount on the basis of realizing high PSNR and high SSIM, and obtain the image with better defogging effect.

More intuitively, the comparison between the image enhancement effect of the present application and some common modes can be as shown in fig. 18A, 18B and 18C, and obviously, the image defogging effect achieved by the image processing method provided by the present application is better, the achieved PSNR and SSIM are also better, the image definition is greatly improved, and the user experience is improved.

The flow of the method and the neural network provided by the present application are described in detail above, and the structure of the apparatus provided by the present application for performing the steps of the method is described below.

Referring to fig. 19, a schematic structural diagram of an image processing apparatus provided in the present application is shown.

The image processing apparatus may include:

a transceiver module 1901, configured to obtain an input image, where the input image includes information of multiple channels;

a feature extraction module 1902, configured to extract features from information of multiple channels of an input image, respectively, to obtain multiple guide maps;

a bilateral generation module 1903, configured to acquire bilateral mesh data corresponding to an input image, where the bilateral mesh data includes data formed by luminance dimension information arranged in a spatial dimension, the luminance dimension information is obtained according to a feature extracted from the input image, a resolution of the bilateral mesh data is lower than a resolution of the input image, and the spatial dimension is a preset space or a space determined according to the input image;

a guidance module 1904, configured to take each of the multiple guidance maps as a guidance condition, and perform upsampling on the bilateral grid data to obtain multiple feature maps;

a fusion module 1905, configured to fuse the multiple feature maps to obtain an output image.

In a possible implementation, the guiding module 1904 may be specifically configured to: using a first guide graph as a guide condition, carrying out up-sampling on bilateral grid data to obtain an up-sampling characteristic, wherein the first guide graph is any one of a plurality of guide graphs; and fusing the up-sampling characteristic and the input image to obtain a first characteristic diagram, wherein the first characteristic diagram is included in the plurality of characteristic diagrams.

In a possible implementation, the guiding module 1904 may be specifically configured to: compressing the upsampling features to obtain compressed features, wherein the number of channels of the compressed features is less than that of the channels of the upsampling features; and carrying out item-by-item product on the compressed features and the input image to obtain a first feature map.

In a possible implementation manner, the bilateral generation module 1902 may be specifically configured to: carrying out down-sampling on an input image to obtain a down-sampled image; and extracting features from the downsampled image to obtain downsampled features, wherein the bilateral grid data comprise the downsampled features.

In a possible implementation, the fusion module 1905 may be specifically configured to: splicing the plurality of characteristic graphs to obtain a spliced image; performing at least one time of feature extraction on the spliced image to obtain at least one first feature; and fusing the at least one first characteristic and the input image to obtain an output image.

In a possible implementation, the fusion module 1905 may be specifically configured to splice a plurality of feature maps and the input image to obtain a spliced image.

Referring to fig. 20, a schematic diagram of a neural network training structure provided by the present application is described as follows.

The training device may comprise:

an obtaining module 2001, configured to obtain a training set, where the training set includes a plurality of image samples and a true value image corresponding to each image sample, and each image sample includes information of a plurality of channels;

a training module 2002, configured to perform at least one iterative training on a neural network using a training set to obtain a trained neural network;

in any iterative training process, the neural network extracts features from information of a plurality of channels of an input image respectively to obtain a plurality of guide graphs, bilateral grid data corresponding to the input image are obtained and are up-sampled respectively by taking each guide graph in the guide graphs as a guide condition to obtain a plurality of feature graphs, the plurality of feature graphs are fused to obtain an output image, the neural network is updated according to a true value image corresponding to the output image and the input image to obtain the current updated neural network, the bilateral grid data comprise data formed by brightness dimension information arranged in a preset space, the brightness dimension information is obtained according to the features extracted from the input image, and the resolution of the bilateral grid data is lower than that of the input image.

In a possible implementation, the training module 2002 may be specifically configured to: using a first guide graph as a guide condition, carrying out up-sampling on bilateral grid data to obtain an up-sampling characteristic, wherein the first guide graph is any one of a plurality of guide graphs; and fusing the up-sampling characteristic and the input image to obtain a first characteristic diagram, wherein the first characteristic diagram is included in the plurality of characteristic diagrams.

In a possible implementation, the training module 2002 may be specifically configured to: compressing the upsampling features to obtain compressed features, wherein the number of channels of the compressed features is less than that of the channels of the upsampling features; and carrying out item-by-item product on the compressed features and the input image to obtain a first feature map.

In a possible implementation, the training module 2002 may be specifically configured to: carrying out down-sampling on an input image to obtain a down-sampled image; and extracting features from the downsampled image to obtain downsampled features, wherein the bilateral grid data comprise the downsampled features.

In a possible implementation, the training module 2002 may be specifically configured to: splicing the plurality of characteristic graphs to obtain a spliced image; performing at least one time of feature extraction on the spliced image to obtain at least one first feature; and fusing the at least one first characteristic and the input image to obtain an output image.

In one possible implementation, the training module 2002 may be specifically configured to concatenate the plurality of feature maps and the input image to obtain a concatenated image.

Referring to fig. 21, a schematic structural diagram of another image processing apparatus provided in the present application is shown as follows.

The training device may include a processor 2101 and a memory 2102. The processor 2101 and the memory 2102 are interconnected by a line. The memory 2102 has stored therein program instructions and data.

The memory 2102 stores program instructions and data corresponding to the steps described above in fig. 7-14.

The processor 2101 is configured to perform the method steps performed by the image processing apparatus shown in any of the embodiments of fig. 7-14 described above.

Optionally, the image processing apparatus may further include a transceiver 2103 for receiving or transmitting data.

Also provided in embodiments of the present application is a computer-readable storage medium, which stores a program that, when executed on a computer, causes the computer to perform the steps of the method described in the foregoing embodiments shown in fig. 7-14.

Alternatively, the aforementioned image processing apparatus shown in fig. 21 is a chip.

Referring to fig. 22, a schematic structural diagram of another training device provided in the present application is shown as follows.

The training device may include a processor 2201 and a memory 2202. The processor 2201 and the memory 2202 are interconnected by a line. The memory 2202 stores therein program instructions and data.

The memory 2202 stores therein program instructions and data corresponding to the steps of fig. 15-16 described above.

The processor 2201 is configured to perform the method steps performed by the training apparatus shown in any one of the embodiments of fig. 15-16.

Optionally, the training device may further comprise a transceiver 2203 for receiving or transmitting data.

Also provided in embodiments of the present application is a computer-readable storage medium having a program stored therein, which when run on a computer causes the computer to perform the steps of the method as described in the embodiments of fig. 15-16.

Optionally, the aforementioned training device shown in fig. 22 is a chip.

The embodiment of the present application further provides an image processing apparatus, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute the method steps shown in any one of the foregoing embodiments in fig. 7 to fig. 14.

Embodiments of the present application further provide a training apparatus, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, and the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute the steps of the method shown in any one of the foregoing fig. 15 to fig. 16.

The embodiment of the application also provides a digital processing chip. The digital processing chip integrates a circuit and one or more interfaces for realizing the functions of the processor 2101 and the processor 2201 or the processor 2101 and the processor 2201. When integrated with memory, the digital processing chip may perform the method steps of any one or more of the preceding embodiments. When the digital processing chip is not integrated with the memory, the digital processing chip can be connected with the external memory through the communication interface. The digital processing chip implements the method steps in the above embodiments according to program codes stored in an external memory.

Embodiments of the present application also provide a computer program product, which when run on a computer, causes the computer to execute the steps of the method as described in any of the embodiments of fig. 7-14 or any of the embodiments of fig. 15-16.

The image processing apparatus or the training apparatus provided in the embodiment of the present application may be a chip, and the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer executable instructions stored by the storage unit to cause the chip in the server to perform the deep learning training method for the computing device described in the embodiments shown in fig. 7-14. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, the aforementioned processing unit or processor may be a Central Processing Unit (CPU), a Network Processor (NPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic devices (programmable gate array), discrete gate or transistor logic devices (discrete hardware components), or the like. A general purpose processor may be a microprocessor or any conventional processor or the like.

Referring to fig. 23, fig. 23 is a schematic structural diagram of a chip according to an embodiment of the present disclosure, where the chip may be represented as a neural network processor NPU 230, and the NPU 230 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2303, and the controller 2304 controls the arithmetic circuit 2303 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 2303 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit 2303 is a two-dimensional systolic array. The arithmetic circuit 2303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2303 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2302 and buffers the data in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2301 and performs matrix operation with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 2308.

The unified memory 2306 is used for storing input data and output data. The weight data directly passes through a Direct Memory Access Controller (DMAC) 2305, and the DMAC is transferred to a weight memory 2302. The input data is also carried into the unified memory 2306 by the DMAC.

A Bus Interface Unit (BIU) 2310 for interaction of the AXI bus with the DMAC and the instruction fetch memory (IFB) 2309.

A bus interface unit 2310 (BIU) for fetching an instruction from the external memory by the instruction fetch memory 2309 and for fetching the original data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 2305.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2306, to transfer weight data to the weight memory 2302, or to transfer input data to the input memory 2301.

The vector calculation unit 2307 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as batch normalization (batch normalization), pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 2307 can store the processed output vector to the unified memory 2306. For example, the vector calculation unit 2307 may apply a linear function and/or a nonlinear function to the output of the operation circuit 2303, such as linear interpolation of the feature planes extracted by the convolution layer, and further such as a vector of accumulated values to generate an activation value. In some implementations, the vector calculation unit 2307 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 2303, for example, for use in subsequent layers in a neural network.

An instruction fetch buffer (2309) connected to the controller 2304, for storing instructions used by the controller 2304;

the unified memory 2306, the input memory 2301, the weight memory 2302, and the instruction fetch memory 2309 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The operation of each layer in the recurrent neural network can be performed by the operation circuit 2303 or the vector calculation unit 2307.

Where any of the above mentioned processors may be a general purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of the programs of the methods of any of the above mentioned embodiments of fig. 7-14 or any of the embodiments of fig. 15-16.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of image processing, comprising:

acquiring an input image, wherein the input image comprises information of a plurality of channels;

extracting features from information of a plurality of channels of the input image respectively to obtain a plurality of guide graphs, wherein the guide graphs correspond to the channels one by one;

acquiring bilateral grid data corresponding to the input image, wherein the bilateral grid data comprise data formed by brightness dimension information arranged in a space dimension, the brightness dimension information is obtained according to features extracted from the input image, the resolution of the bilateral grid data is lower than that of the input image, and the space dimension is a preset space or a space determined according to the input image;

taking each guide graph in the guide graphs as a guide condition, and performing up-sampling on the bilateral grid data to obtain a plurality of characteristic graphs;

and fusing the plurality of characteristic graphs to obtain an output image.

2. The method of claim 1, wherein the upsampling the bilateral mesh data with each of the plurality of pilot maps as a pilot condition, respectively, comprises:

using a first guide graph as a guide condition, performing upsampling on the bilateral grid data to obtain an upsampling characteristic, wherein the first guide graph is any one of the plurality of guide graphs;

and fusing the up-sampling feature and the input image to obtain a first feature map, wherein the first feature map is included in the plurality of feature maps.

3. The method of claim 2, wherein said fusing the upsampled features with the input image to obtain a first feature map comprises:

compressing the up-sampling features to obtain compressed features, wherein the number of channels of the compressed features is less than that of the up-sampling features;

and carrying out item-by-item product (element-wise product) on the compressed features and the input image to obtain the first feature map.

4. The method according to any one of claims 1-3, wherein the obtaining bilateral mesh data corresponding to the input image comprises:

down-sampling the input image to obtain a down-sampled image;

extracting features from the downsampled image to obtain downsampled features;

and determining the bilateral grid data according to the downsampling characteristics.

5. The method according to any one of claims 1-4, wherein said fusing the plurality of feature maps to obtain the output image comprises:

splicing the plurality of characteristic graphs to obtain a spliced image;

performing at least one time of feature extraction on the spliced image to obtain at least one first feature;

and fusing the at least one first feature and the input image to obtain the output image.

6. The method of claim 5, wherein said stitching the plurality of feature maps to obtain a stitched image comprises:

and splicing the plurality of feature maps and the input image to obtain the spliced image.

7. A neural network training method, comprising:

acquiring a training set, wherein the training set comprises a plurality of image samples and a true value image corresponding to each image sample, and each image sample comprises information of a plurality of channels;

performing at least one iterative training on the neural network by using the training set to obtain a trained neural network;

wherein, in any iterative training process, the neural network extracts features from the information of a plurality of channels of the input image respectively to obtain a plurality of guide maps, the plurality of guide graphs correspond to the plurality of channels one by one, the bilateral grid data corresponding to the input image are obtained, and each guide graph in the plurality of guide graphs is used as a guide condition, up-sampling the bilateral grid data to obtain a plurality of characteristic graphs, fusing the characteristic graphs to obtain an output image, updating the neural network according to the true value images corresponding to the output image and the input image to obtain the updated neural network at the current time, the bilateral mesh data includes data formed of information of a luminance dimension arranged in a preset space, the information of the brightness dimension is obtained according to features extracted from the input image, and the resolution of the bilateral mesh data is lower than that of the input image.

8. The method of claim 7, wherein the upsampling the bilateral mesh data with each of the plurality of pilot maps as a pilot condition, respectively, comprises:

9. An apparatus for image processing, comprising:

the receiving and sending module is used for acquiring an input image, and the input image comprises information of a plurality of channels;

the characteristic extraction module is used for extracting characteristics from information of a plurality of channels of the input image respectively to obtain a plurality of guide graphs, and the guide graphs correspond to the channels one to one;

the bilateral generation module is used for acquiring bilateral grid data corresponding to the input image, wherein the bilateral grid data comprise data formed by brightness dimension information arranged in a space dimension, the brightness dimension information is obtained according to features extracted from the input image, the resolution of the bilateral grid data is lower than that of the input image, and the space dimension is a preset space or a space determined according to the input image;

the guiding module is used for respectively taking each guiding graph in the plurality of guiding graphs as a guiding condition and carrying out up-sampling on the bilateral grid data to obtain a plurality of characteristic graphs;

and the fusion module is used for fusing the plurality of characteristic graphs to obtain an output image.

10. The apparatus according to claim 9, wherein the guidance module is specifically configured to:

11. The apparatus according to claim 10, wherein the guidance module is specifically configured to:

and carrying out item-by-item product on the compressed characteristic and the input image to obtain the first characteristic map.

12. The apparatus according to any one of claims 9 to 11, wherein the bilateral generation module is specifically configured to:

down-sampling the input image to obtain a down-sampled image;

extracting features from the downsampled image to obtain downsampled features;

13. The device according to any one of claims 9 to 12, wherein the fusion module is specifically configured to:

splicing the plurality of characteristic graphs to obtain a spliced image;

14. The apparatus of claim 13,

the fusion module is specifically configured to splice the plurality of feature maps and the input image to obtain the spliced image.

15. An exercise device, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a training set, the training set comprises a plurality of image samples and a true value image corresponding to each image sample, and each image sample comprises information of a plurality of channels;

the training module is used for carrying out at least one iterative training on the neural network by using the training set to obtain the trained neural network;

16. The apparatus of claim 15, wherein the training module is specifically configured to:

17. An image processing apparatus comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of any of claims 1 to 6.

18. An exercise apparatus comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of claim 7 or 8.

19. A computer readable storage medium comprising a program which, when executed by a processing unit, performs the method of any of claims 1 to 8.

20. A computer program product, characterized in that it comprises a software code for performing the method according to any one of claims 1 to 8.