CN115063301A

CN115063301A - Video denoising method, video processing method and device

Info

Publication number: CN115063301A
Application number: CN202110221439.7A
Authority: CN
Inventors: 马提奥·麦乔尼; 黄亦斌; 李琤; 付中前
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-02-27
Filing date: 2021-02-27
Publication date: 2022-09-16

Abstract

The application provides a video denoising method, a video processing method and a video processing device, which are used for improving the denoising effect while realizing light denoising, obtaining a clearer image and further obtaining video data with better image quality. The method comprises the following steps: acquiring a current frame and a first fusion image in video data, wherein the first fusion image comprises information of at least one frame adjacent to the current frame in the video data according to a preset sequence; extracting features from the current frame and the first fusion image to obtain a first feature and a second feature, and then determining a first fusion weight and a second fusion weight, wherein the weight corresponding to the foreground in the current frame is not less than the weight corresponding to the foreground in the first fusion image, and the weight corresponding to the background in the current frame is not more than the weight corresponding to the background in the first fusion image; fusing the current frame and the first fused image according to the first fusion weight and the second fusion weight to obtain a second fused image; and denoising the second fusion image to obtain a denoised image.

Description

Video denoising method, video processing method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a video denoising method, a video processing method, and a video processing apparatus.

Background

Denoising video in computational photography is crucial to video imaging quality. The photographing and imaging of the terminal equipment are limited by the hardware performance of the optical sensor of the terminal equipment, and due to the imperfection of the acquisition process, the formation of a digital image is always influenced by different forms of noise and degradation, and an image restoration algorithm is required to restore the degraded input into high-quality input. Image restoration algorithms such as denoising, demosaicing, super-resolution, etc.

For video denoising, the spatial-temporal fusion of video information can be processed in a circular recursive manner, but the denoising quality is low. Therefore, how to improve the video denoising quality becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a video denoising method, a video processing method and a video processing device, which are used for improving the denoising effect while realizing light denoising, obtaining a clearer image and further obtaining video data with better image quality.

In view of the above, in a first aspect, the present application provides a video denoising method, including: firstly, acquiring a current frame and a first fusion image, wherein the current frame is any frame image behind a first frame arranged in the video data according to a preset sequence, namely the current frame is a non-first frame in the video data, and the first fusion image comprises information of at least one frame adjacent to the current frame in the video data according to the preset sequence; extracting features from the current frame to obtain first features, and extracting features from the first fusion image to obtain second features; then, determining to obtain a first fusion weight and a second fusion weight according to the first characteristic and the second characteristic, wherein the first fusion weight comprises a weight corresponding to the current frame, the second fusion weight comprises a weight corresponding to the first fusion image, the first fusion weight and the second fusion weight can be set according to the foreground and the background, the weight corresponding to the foreground in the current frame is not less than the weight corresponding to the foreground in the first fusion image, and the weight corresponding to the background in the current frame is not more than the weight corresponding to the background in the first fusion image; fusing the current frame and the first fused image according to the first fusion weight and the second fusion weight to obtain a second fused image; and denoising the second fusion image to obtain a denoised image.

Therefore, in the embodiment of the present application, by fusing the current frame and the first fused image, it is equivalent to combining the time-domain related information between the adjacent frames in the scene, smoothing the noise in the current frame, reducing the noise in the image, and obtaining the second fused image with smaller noise. The method is equivalent to smoothing the noise in each frame of image by utilizing the relevance between the images in the video in the time domain, thereby improving the denoising effect of each frame in the video data and avoiding ghosting, thereby obtaining the video data with better image quality and improving the user experience.

In one possible embodiment, after denoising the second fused image, the method further includes: and fusing the two fused images and the de-noised image to obtain an updated de-noised image.

When the image is denoised, the phenomenon of excessive smoothness may occur, and at this time, compared with the denoised image, the second fusion image has more details, so that the details included in the second fusion image and the denoised image are fused, the image with richer details is obtained, and the image quality is improved.

In one possible embodiment, fusing the two fused images and the denoised image comprises: extracting features from the second fused image to obtain third features; extracting features from the de-noised image to obtain a fourth feature; determining a third fusion weight and a fourth fusion weight according to the third feature and the fourth feature, wherein the third fusion weight is a weight corresponding to the third feature, and the fourth fusion weight is a weight corresponding to the fourth feature, and in the third feature and the fourth feature, the frequency of each pixel point and the corresponding weight value are in a negative correlation relationship; and fusing the third characteristic and the fourth characteristic according to the third fusion weight and the fourth fusion weight to obtain an updated de-noised image.

In this embodiment, the frequencies of the pixel points and the corresponding weight values are in a negative correlation relationship, that is, the higher the frequency value of the pixel point is, the lower the weight corresponding to the pixel point is, so as to reduce the noise carried in the high-frequency component, and obtain the denoised image.

In a possible implementation, before extracting the feature from the current frame and extracting the feature from the first fused image, the method may further include: and performing color transformation on the current frame through a color transformation matrix to obtain a first chrominance component and a first luminance component, wherein the first chrominance component and the first luminance component form a new current frame, performing color transformation on the first fused image to obtain a second chrominance component and a second luminance component, the second chrominance component and the second luminance component form a new first fused image, and the color transformation matrix is a preset matrix or is obtained by training at least one convolution kernel.

The aforementioned extracting features from the current frame to obtain the first features, and extracting features from the first fused image to obtain the second features may include: extracting features from the new current frame to obtain first features, and extracting features from the new first fused image to obtain second features.

Therefore, in the embodiment of the application, the color removal processing is also performed on the current frame, which is equivalent to color removal operation, so that the association among color channels is reduced, the subsequent denoising complexity is reduced, the denoising efficiency and effect are improved, and an image with better image quality is obtained.

In a possible implementation, after denoising the second fused image, the method may further include: and carrying out color transformation on the denoised image through an inverse color transformation matrix to obtain an updated denoised image, wherein the inverse color transformation matrix is the inverse matrix of the color transformation matrix.

In the embodiment of the application, after the color-removing correlation transformation is performed before the denoising, the inverse color transformation may be performed, so as to recover the color in the image and obtain the image with the color.

In one possible implementation, before extracting the features from the current frame and extracting the features from the first fused image, the method further comprises: the method comprises the steps of performing wavelet transformation on a current frame by using wavelet coefficients to obtain a first low-frequency component and a first high-frequency component, forming a new current frame by the first low-frequency component and the first high-frequency component, performing wavelet transformation on a first fusion image to obtain a second low-frequency component and a second high-frequency component, forming a new first fusion image by the second low-frequency component and the second high-frequency component, and obtaining the wavelet coefficients by training at least one convolution kernel.

Wavelet transformation can be understood as decorrelation transformation of pixel points of an image in the frequency dimension, and generally, effective parts and noise of the image are separated into different frequency components based on frequency representation, so that denoising is simpler. Equivalently, the frequency of the pixel points is subjected to discrete distribution, so that a better denoising effect is realized subsequently.

In a possible implementation, after denoising the second fused image, the method may further include: and performing inverse wavelet transformation on the denoised image through an inverse wavelet coefficient to obtain an updated denoised image, wherein the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient. Therefore, the inverse wavelet change can accurately recover the high-frequency component and the low-frequency component, and a clearer de-noising image is obtained.

In a possible implementation, the determining the first fusion weight corresponding to the current frame according to the first feature and the second feature may include: calculating shot noise and read noise according to shooting parameters of equipment used for shooting video data; and determining a first fusion weight corresponding to the current frame by combining shot noise and read noise and the first characteristic and the second characteristic.

The noise level is combined when the first fusion weight is calculated, so that the method adapts to scenes with different noise levels, can perform accurate denoising under different noise levels, and has strong generalization capability.

In a possible implementation manner, the determining, by combining the shot noise and the read noise, the first feature and the second feature, the first fusion weight corresponding to the current frame may specifically include: calculating the noise variance of each pixel point in the current frame according to the shot noise and the read noise; extracting a fifth feature from the noise variance; in combination with the fifth feature, the first feature and the second feature determine a first fusion weight corresponding to the current frame. The noise variance can be used for accurately determining the noise level of the current frame, so that accurate denoising can be performed subsequently, and the denoising effect is improved.

In one possible implementation, denoising the second fused image may include: and denoising the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain a denoised image.

Therefore, in the embodiment of the application, the second fusion image can be denoised by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame, which is equivalent to providing more reference information for denoising, so that the denoising effect is improved, the calculated data can be repeatedly utilized, and the effective utilization rate of the data generated in each step is improved.

In a possible implementation manner, the denoising the second fused image with the first fused weight, the second fused weight, the first fused image, and the current frame specifically includes: calculating the variance of each pixel point in the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain a fusion image variance; and then the variance of the fusion image and the second fusion image are used as the input of a denoising model, and a denoising image is output.

In the process of denoising the video data, the variance of the fusion image is reduced along with the increase of the number of denoised frames, namely, the noise in the fusion image is smaller and smaller, and the finally obtained denoised image has less and smaller noise, so that the image with less noise and better image quality is obtained.

In a possible implementation, the above-mentioned using the variance of the fused image and the second fused image as the input of the denoising model may include: and taking the current frame, the fusion image variance and the second fusion image as the input of a denoising model, and outputting a denoising image, wherein the denoising model is used for removing noise in the input image.

In the embodiment, the fusion image variance can be used as the input of the denoising model, so that the denoising model can determine the noise level through the fusion image variance, a better denoising effect is realized, and a denoising image with better image quality is obtained.

In a possible implementation, the method described above may further include: at least one down-sampling is carried out on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image, the scales of the images obtained by down-sampling at each time are different, at least one first down-sampled image is obtained by carrying out at least one down-sampling on the current frame, and at least one second down-sampled image is obtained by carrying out at least one down-sampling on the first fused image; denoising the at least one down-sampling frame and the at least one down-sampling fusion image to obtain a multi-scale fusion image; and fusing the denoised image and the multi-scale fusion image to obtain an updated denoised image.

In the embodiment of the application, the current frame and the first fused image can be sampled down once or for multiple times, and the down-sampled frames and the down-sampled fused images with different scales are subjected to iteration processing, so that the denoising processing is performed under different scales, and the denoising effect of the finally obtained denoising image is better.

In a possible embodiment, any one of the denoising processes, i.e., the denoising process of one scale, in the process of denoising the at least one down-sampled frame and the at least one down-sampled fused frame may include: determining the weight corresponding to the first downsampling frame according to the features extracted from the first downsampling frame and the features extracted from the first downsampling fusion image to obtain a fifth fusion weight, and determining the weight corresponding to the first downsampling fusion image to obtain a sixth fusion weight, wherein the first downsampling frame is any one of at least one downsampling frame, and the first downsampling fusion image is one of the at least one downsampling fusion image with the same scale as the first downsampling frame; determining the weight corresponding to the second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of at least one frame of image with the lower scale than the first downsampling frame; fusing the first downsampled frame, the first downsampled fused image and the second fused image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampled fused image, wherein the upsampled image of the third downsampled image is used for being fused with at least one image of which the down-sampling frame has the middle scale larger than that of the first downsampled frame; denoising the third downsampling fusion image to obtain a first downsampling denoising image; and performing upsampling on the first downsampling denoised image to obtain an upsampling denoised image, wherein the upsampling image is used for denoising by combining the fused images with the same scale as the upsampling image to obtain an image with the scale larger than that of the first downsampling frame.

Therefore, in the embodiment of the application, in the process of denoising each scale, the processing results of other scales can be fused for iterative denoising, so that the denoising effect can be improved, and an image with less noise can be obtained.

In a possible implementation, the downsampling the current frame and the first fused image at least once as described above may include: and performing wavelet transformation on the current frame and the first fused image at least once to obtain at least one downsampling frame and at least one downsampling fused image.

Therefore, in the embodiment of the application, downsampling can be performed in a wavelet transform mode, and meanwhile, the frequency of the pixel points can be distributed in a frequency space in a discrete mode, so that the denoising difficulty is reduced, and the denoising effect is improved.

In a second aspect, the present application provides a video denoising apparatus, which may include:

the device comprises an acquisition module and a fusion module, wherein the acquisition module is used for acquiring a current frame and a first fusion image, the current frame is any frame image behind a first frame arranged in the video data according to a preset sequence, and the first fusion image comprises information of at least one frame adjacent to the current frame in the video data according to the preset sequence;

the time domain fusion module is used for extracting features from the current frame to obtain first features and extracting features from the first fusion image to obtain second features;

the time domain fusion module is further used for determining and obtaining a first fusion weight and a second fusion weight according to the first characteristic and the second characteristic, the first fusion weight comprises a weight corresponding to a current frame, the second fusion weight comprises a weight corresponding to a first fusion image, the weight corresponding to a foreground in the current frame is not less than the weight corresponding to the foreground in the first fusion image, and the weight corresponding to a background in the current frame is not more than the weight corresponding to the background in the first fusion image;

the time domain fusion module is also used for fusing the current frame and the first fusion image according to the first fusion weight and the second fusion weight to obtain a second fusion image;

and the denoising module is used for denoising the second fusion image to obtain a denoised image.

In a possible implementation, the video denoising apparatus may further include:

and the refinement module is used for fusing the two fused images and the denoised image after denoising the second fused image to obtain an updated denoised image.

In a possible implementation, the denoising module is specifically configured to: extracting features from the second fused image to obtain third features; extracting features from the de-noised image to obtain a fourth feature; determining a third fusion weight and a fourth fusion weight according to the third feature and the fourth feature, wherein the third fusion weight is a weight corresponding to the third feature, and the fourth fusion weight is a weight corresponding to the fourth feature, and in the third feature and the fourth feature, the frequency of each pixel point and the corresponding weight value are in a negative correlation relationship; and fusing the third characteristic and the fourth characteristic according to the third fusion weight and the fourth fusion weight to obtain an updated denoised image.

In a possible implementation, the video denoising apparatus may further include: the color-removing correlation transformation module is used for carrying out color transformation on the current frame through a color transformation matrix before extracting features from the current frame and extracting the features from the first fusion image to obtain a first chrominance component and a first luminance component, the first chrominance component and the first luminance component form a new current frame, and carrying out color transformation on the first fusion image to obtain a second chrominance component and a second luminance component, the second chrominance component and the second luminance component form a new first fusion image, and the color transformation matrix is a preset matrix or is obtained by training at least one convolution kernel;

and the time domain fusion module is specifically used for extracting features from the new current frame to obtain first features, and extracting features from the new first fusion image to obtain second features.

and the inverse color correlation transformation module is used for carrying out color transformation on the denoised image through an inverse color transformation matrix after denoising the second fused image to obtain an updated denoised image, wherein the inverse color transformation matrix is an inverse matrix of the color transformation matrix.

and the de-frequency correlation transformation module is used for performing wavelet transformation on the current frame by using a wavelet coefficient before the time domain fusion module extracts the characteristics from the current frame and the first fusion image to obtain a first low-frequency component and a first high-frequency component, the first low-frequency component and the first high-frequency component form a new current frame, and performing wavelet transformation on the first fusion image to obtain a second low-frequency component and a second high-frequency component, the second low-frequency component and the second high-frequency component form a new first fusion image, and the wavelet coefficient is a preset coefficient or is obtained by training at least one convolution kernel.

and the inverse frequency correlation transformation module is used for carrying out inverse wavelet transformation on the denoised image through an inverse wavelet coefficient after the denoising module denoises the second fused image to obtain an updated denoised image, wherein the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient.

In a possible implementation manner, the time domain fusion module is specifically configured to: calculating shot noise and read noise according to shooting parameters of equipment used for shooting video data; and determining a first fusion weight corresponding to the current frame by combining shot noise and read noise and the first characteristic and the second characteristic.

In a possible implementation manner, the time domain fusion module is specifically configured to: calculating the noise variance of each pixel point in the current frame according to the shot noise and the read noise; extracting a fifth feature from the noise variance; in combination with the fifth feature, the first feature and the second feature determine a first fusion weight corresponding to the current frame.

In a possible implementation, the denoising module is specifically configured to: and denoising the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain the denoised image.

In a possible implementation, the denoising module is specifically configured to: calculating the variance of each pixel point in the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain a fusion image variance; and taking the variance of the fusion image and the second fusion image as the input of the denoising model, and outputting the denoising image.

In a possible implementation manner, the denoising module is specifically configured to take the current frame, the variance of the fusion image, and the second fusion image as inputs of a denoising model, and output the denoised image, where the denoising model is used to remove noise in the input image.

In a possible implementation, the video denoising apparatus may further include: a down-sampling module;

the down-sampling module is used for performing down-sampling on the current frame and the first fused image for at least one time to obtain at least one down-sampled frame and at least one down-sampled fused image, the scales of the images obtained by down-sampling for each time are different, at least one first down-sampled image is obtained by performing down-sampling for the current frame for at least one time, and at least one second down-sampled image is obtained by performing down-sampling for the first fused image for at least one time;

the denoising module is also used for denoising the at least one down-sampling frame and the at least one down-sampling fusion image to obtain a multi-scale fusion image;

and the denoising module is also used for fusing the denoised image and the multi-scale fusion image to obtain an updated denoised image.

In a possible implementation manner, the denoising module performs any one of denoising processes on the at least one down-sampled frame and the at least one down-sampled fused frame, and may include:

determining the weight corresponding to the first downsampling frame according to the features extracted from the first downsampling frame and the features extracted from the first downsampling fusion image to obtain a fifth fusion weight, and determining the weight corresponding to the first downsampling fusion image to obtain a sixth fusion weight, wherein the first downsampling frame is any one of at least one downsampling frame, and the first downsampling fusion image is one of the at least one downsampling fusion image with the same scale as the first downsampling frame; determining the weight corresponding to the second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of at least one frame of image with the lower scale than the first downsampling frame; fusing the first downsampled frame, the first downsampled fused image and the second fused image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampled fused image, wherein the upsampled image of the third downsampled image is used for being fused with at least one image of which the down-sampling frame has the middle scale larger than that of the first downsampled frame; denoising the third downsampling fusion image to obtain a first downsampling denoising image; and performing upsampling on the first downsampling denoised image to obtain an upsampling denoised image, wherein the upsampling image is used for denoising by combining the fused images with the same scale as the upsampling image to obtain an image with the scale larger than that of the first downsampling frame.

In a possible embodiment, the down-sampling module is specifically configured to perform at least one wavelet transform on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image.

In a third aspect, an embodiment of the present application provides a video denoising device, including: a processor and a memory, wherein the processor and the memory are interconnected by a line, and the processor calls the program code in the memory to execute the processing-related functions of the video denoising method according to any one of the first aspect. Alternatively, the video denoising apparatus may be a chip.

In a fourth aspect, an embodiment of the present application provides a video denoising apparatus, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute functions related to processing in the foregoing first aspect or any one of the optional implementations of the first aspect.

In a fifth aspect, the present application provides a video processing method, including: firstly, acquiring a current frame and a first fusion image, wherein the current frame is any frame image behind a first frame arranged in a preset sequence in video data, and the first fusion image comprises information of at least one frame adjacent to the current frame in the video data according to the preset sequence; extracting features from the current frame to obtain first features, and extracting features from the first fusion image to obtain second features; then determining to obtain a first fusion weight and a second fusion weight according to the first characteristic and the second characteristic, wherein the first fusion weight comprises the weight corresponding to the current frame, the second fusion weight comprises the weight corresponding to the first fusion image, the weight corresponding to the foreground in the current frame is not less than the weight corresponding to the foreground in the first fusion image, and the weight corresponding to the background in the current frame is not more than the weight corresponding to the background in the first fusion image; and fusing the current frame and the first fused image according to the first fusion weight and the second fusion weight to obtain a second fused image.

Therefore, in the embodiment of the present application, when fusing the current frame and the first fused image, the foreground portion in the current frame and the background portion in the first fused image may be referred to, so that the ghost in the second fused image may be reduced, and the noise included in the current frame may be reduced by the way of fusing the images, so as to obtain a clearer image.

In a possible implementation, before extracting the features from the current frame and the features from the first fused image, the current frame may be further color-transformed by a color transformation matrix to obtain a first chrominance component and a first luminance component, where the first chrominance component and the first luminance component form a new current frame, and the first fused image is color-transformed to obtain a second chrominance component and a second luminance component, where the second chrominance component and the second luminance component form a new first fused image, and the color transformation matrix is a preset matrix or is obtained by training at least one convolution kernel; the extracting features from the current frame to obtain the first features and extracting features from the first fused image to obtain the second features as described above may include: extracting features from the new current frame to obtain first features, and extracting features from the new first fused image to obtain second features.

Therefore, in the embodiment of the application, the color removal processing is also performed on the current frame, which is equivalent to color removal operation, so that the association among color channels is reduced, and if dryness removal is required, the subsequent denoising complexity can be reduced, the denoising efficiency and effect are improved, and an image with better image quality is obtained.

In a possible embodiment, the method may further include: and performing color transformation on the second fused image through an inverse color transformation matrix to obtain an updated second fused image, wherein the inverse color transformation matrix is an inverse matrix of the color transformation matrix.

In the embodiment of the present application, after the color decorrelation conversion is performed, inverse color conversion may be performed to restore the colors in the image, thereby obtaining a color image.

In one possible implementation, before extracting the features from the current frame and extracting the features from the first fused image, the method further includes: the method comprises the steps of performing wavelet transformation on a current frame by using wavelet coefficients to obtain a first low-frequency component and a first high-frequency component, forming a new current frame by the first low-frequency component and the first high-frequency component, performing wavelet transformation on a first fusion image to obtain a second low-frequency component and a second high-frequency component, forming a new first fusion image by the second low-frequency component and the second high-frequency component, and obtaining the wavelet coefficients by training at least one convolution kernel.

Wavelet transformation can be understood as decorrelation transformation of pixels of an image in the frequency dimension, and generally, effective parts and noise of the image are separated into different frequency components based on frequency representation, so that denoising can be simpler if drying is needed subsequently. Equivalently, the frequency of the pixel points is subjected to discrete distribution, so that a better denoising effect is realized subsequently.

In a possible embodiment, the method may further include: and performing inverse wavelet transform on the second fusion image through an inverse wavelet coefficient to obtain an updated second fusion image, wherein the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient. Therefore, the inverse wavelet change can accurately recover the high-frequency component and the low-frequency component to obtain a clearer image.

In a possible implementation, determining a first fusion weight corresponding to the current frame according to the first feature and the second feature includes: calculating shot noise and read noise according to shooting parameters of equipment used for shooting video data; and determining a first fusion weight corresponding to the current frame by combining shot noise, reading noise, the first characteristic and the second characteristic.

In a possible implementation, the determining, by combining shot noise and read noise, the first feature and the second feature to determine the first fusion weight corresponding to the current frame may include: calculating the noise variance of each pixel point in the current frame according to the shot noise and the read noise; extracting a fifth feature from the noise variance; and combining the fifth characteristic, and determining the first fusion weight corresponding to the current frame by the first characteristic and the second characteristic. The noise variance can be used for accurately determining the noise level of the current frame, and if the noise level needs to be removed, the subsequent accurate denoising can be conveniently carried out, so that the denoising effect is improved.

In a possible embodiment, the method may further include: at least one down-sampling is carried out on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image, the scales of the images obtained by down-sampling at each time are different, at least one first down-sampled image is obtained by carrying out at least one down-sampling on the current frame, and at least one second down-sampled image is obtained by carrying out at least one down-sampling on the first fused image; fusing each frame of at least one down-sampling frame with a down-sampling fusion image with the same scale to obtain a multi-scale fusion image; and fusing the second fused image and the multi-scale fused image to obtain an updated second fused image.

In the embodiment of the application, the current frame and the first fusion image can be downsampled once or for multiple times, and the downsampled frame and the downsampled fusion image with different scales are subjected to iteration processing, so that a clearer image can be obtained through fusion, and noise in the image is reduced.

In one possible implementation, fusing any one of the at least one down-sampled frame and the down-sampled fused image of the same scale may include: determining the weight corresponding to the first downsampled frame according to the features extracted from the first downsampled frame and the features extracted from the first downsampled fusion image to obtain a fifth fusion weight, and determining the weight corresponding to the first downsampled fusion image to obtain a sixth fusion weight, wherein the first downsampled frame is any one of at least one downsampled frame, and the first downsampled fusion image is one of the at least one downsampled fusion image with the same scale as the first downsampled frame; determining the weight corresponding to the second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of at least one frame of image with the lower scale than the first downsampling frame; and fusing the first downsampled frame, the first downsampled fused image and the second downsampled fused image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampled fused image, wherein the upsampled image of the third downsampled image is used for being fused with the image of which the mesoscale of at least one downsampled frame is larger than that of the first downsampled frame.

Therefore, in the embodiment of the application, in the process of fusing each scale, processing results of other scales can be fused for iterative fusion, so that the quality of the image obtained by fusion is improved, and an image with less noise is obtained.

In one possible embodiment, the downsampling the current frame and the first fused image at least once comprises: and performing wavelet transformation on the current frame and the first fused image at least once to obtain at least one downsampled frame and at least one downsampled fused image. Therefore, in the embodiment of the application, downsampling can be performed in a wavelet transform mode, and meanwhile, the frequency of the pixel points can be distributed in a frequency space in a discrete mode, so that the denoising effect is achieved.

In a sixth aspect, the present application provides a video processing apparatus comprising:

and the time domain fusion module is also used for fusing the current frame and the first fusion image according to the first fusion weight and the second fusion weight to obtain a second fusion image.

In a possible implementation, the apparatus may further include:

the color-related transformation removing module is used for performing color transformation on the current frame through a color transformation matrix before the time-domain fusion module extracts the features from the current frame and the features from the first fusion image to obtain a first chrominance component and a first luminance component, the first chrominance component and the first luminance component form a new current frame, and performing color transformation on the first fusion image to obtain a second chrominance component and a second luminance component, the second chrominance component and the second luminance component form a new first fusion image, and the color transformation matrix is a preset matrix or is obtained by training at least one convolution kernel;

In a possible embodiment, the apparatus may further include: and the inverse color correlation transformation module is used for carrying out color transformation on the second fusion image through an inverse color transformation matrix to obtain an updated second fusion image, and the inverse color transformation matrix is the inverse matrix of the color transformation matrix.

In a possible implementation, the apparatus may further include: and the de-frequency correlation transformation module is used for performing wavelet transformation on the current frame by using a wavelet coefficient before the time domain fusion module extracts the features from the current frame and extracts the features from the first fusion image to obtain a first low-frequency component and a first high-frequency component, the first low-frequency component and the first high-frequency component form a new current frame, and performing wavelet transformation on the first fusion image to obtain a second low-frequency component and a second high-frequency component, the second low-frequency component and the second high-frequency component form a new first fusion image, and the wavelet coefficient is a preset coefficient or is obtained by training at least one convolution kernel.

In a possible implementation, the apparatus may further include: and the inverse frequency correlation transformation module is used for performing inverse wavelet transformation on the second fusion image through an inverse wavelet coefficient to obtain an updated second fusion image, and the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient.

In a possible implementation manner, the time domain fusion module is specifically configured to: calculating shot noise and read noise according to shooting parameters of equipment used for shooting video data; and determining a first fusion weight corresponding to the current frame by combining shot noise, reading noise, the first characteristic and the second characteristic.

In a possible implementation, the apparatus may further include: the down-sampling module is used for performing down-sampling on the current frame and the first fused image for at least one time to obtain at least one down-sampled frame and at least one down-sampled fused image, the scales of the images obtained by down-sampling for each time are different, at least one first down-sampled image is obtained by performing down-sampling for the current frame for at least one time, and at least one second down-sampled image is obtained by performing down-sampling for the first fused image for at least one time;

the time domain fusion module is also used for fusing each frame of the at least one down-sampling frame with the down-sampling fusion image with the same scale to obtain a multi-scale fusion image;

and the time domain fusion module is also used for fusing the second fusion image and the multi-scale fusion image to obtain an updated second fusion image.

In one possible embodiment, the fusing any one of the at least one down-sampled frame and the down-sampled fused image of the same scale by the temporal fusion module may include: determining the weight corresponding to the first downsampled frame according to the features extracted from the first downsampled frame and the features extracted from the first downsampled fusion image to obtain a fifth fusion weight, and determining the weight corresponding to the first downsampled fusion image to obtain a sixth fusion weight, wherein the first downsampled frame is any one of at least one downsampled frame, and the first downsampled fusion image is one of the at least one downsampled fusion image with the same scale as the first downsampled frame; determining the weight corresponding to the second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of at least one frame of image with the lower scale than the first downsampling frame; and fusing the first downsampled frame, the first downsampled fused image and the second downsampled fused image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampled fused image, wherein the upsampled image of the third downsampled image is used for being fused with the image of which the mesoscale of at least one downsampled frame is larger than that of the first downsampled frame.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions, when executed on a computer, cause the computer to perform the method in any optional implementation manner of the first aspect or the fifth aspect.

In an eighth aspect, embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method in any of the optional embodiments of the first or fifth aspects.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence body framework for use in the present application;

FIG. 2 is a schematic diagram of a convolutional neural network structure provided in the present application;

FIG. 3 is a schematic diagram of another convolutional neural network structure provided in the present application;

FIG. 4 is a schematic diagram of an application scenario provided in the present application;

FIG. 5 is a schematic diagram of another application scenario provided herein;

FIG. 6 is a diagram illustrating a system architecture provided herein;

fig. 7A is a schematic flow chart of a video processing method provided in the present application;

fig. 7B is a schematic flowchart of a video denoising method provided in the present application;

FIG. 8 is a schematic flow chart of another video denoising method provided in the present application;

FIG. 9 is a schematic diagram of a de-color correlation transform provided herein;

FIG. 10 is a schematic diagram of a frequency decorrelation transform according to the present application;

FIG. 11 is a schematic diagram of a time domain fusion method provided in the present application;

FIG. 12 is a schematic diagram of a denoising method provided in the present application;

FIG. 13 is a schematic diagram of a refinement step provided herein;

FIG. 14 is a schematic diagram of an inverse color transform method provided in the present application;

fig. 15 is a schematic diagram of an inverse wavelet transform process provided in the present application;

FIG. 16 is a schematic view of a multi-scale process provided herein;

fig. 17 is a schematic structural diagram of a video denoising apparatus provided in the present application;

fig. 18 is a schematic structural diagram of a video processing apparatus provided in the present application;

FIG. 19 is a schematic structural diagram of another video denoising apparatus provided in the present application;

fig. 20 is a schematic structural diagram of a video processing apparatus provided in the present application;

fig. 21 is a schematic structural diagram of a chip provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, communication with the outside world is achieved, and support is achieved through the foundation platform. Communicating with the outside through a sensor; the computing power is provided by an intelligent chip, such as a Central Processing Unit (CPU), a Network Processor (NPU), a Graphic Processor (GPU), an Application Specific Integrated Circuit (ASIC), or a Field Programmable Gate Array (FPGA), or other hardware acceleration chip; the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

Decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sorting, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further general capabilities may be formed based on the results of the data processing, such as algorithms or a general system, for example, translation, analysis of text, computer vision processing, speech recognition, recognition of images, and so on.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..

The embodiments of the present application relate to some neural network related applications, and in order to better understand the solution of the embodiments of the present application, the following first introduces some terms and concepts related to some neural networks that may be related to the embodiments of the present application.

The embodiments of the present application relate to neural networks and related applications in the field of images, and in order to better understand the scheme of the embodiments of the present application, the following first introduces related terms and concepts of neural networks that may be involved in the embodiments of the present application.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit with intercept 1 as input, the output of which can be as shown in equation (1-1):

wherein s is 1,2, … … n, n is a natural number greater than 1, and W is _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple intermediate layers. The DNNs are divided according to the positions of different layers, and neural networks inside the DNNs can be divided into three categories: input layer, intermediate layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are all middle layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is a function of the input vector or vectors,

is the output vector of the output vector,

is the offset vector, w is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient w as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the third layer of the outputIndex 2 and the input second level index 4.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more intermediate layers make the network more able to characterize complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel may be initialized in the form of a matrix of random size, and may be learned to obtain reasonable weights during training of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function (loss function): also known as cost function (cost function), a comparative machine learning model to samplesIs used to measure the difference between the predicted output of the machine learning model for the sample and the actual value of the sample. The loss function may generally include a loss function such as mean square error, cross entropy, logarithm, or exponential. For example, the mean square error can be used as a loss function, defined as

The specific loss function can be selected according to the actual application scenario.

(5) Gradient: the derivative vector of the loss function with respect to the parameter.

(6) Random gradient: the number of samples in machine learning is large, so the loss function calculated each time is calculated from data obtained by random sampling, and the corresponding gradient is called random gradient.

(7) Back Propagation (BP): an algorithm for calculating gradient of model parameters according to a loss function and updating the model parameters.

(8) Foreground and background

In general, the foreground may be understood as a subject included in an image, or an object that needs attention, or the like, or may also be referred to as an instance. The background is the other area of the image than the foreground. For example, if an image including a traffic light is captured, the foreground (or referred to as an example) in the image is the area where the traffic light is located, and the background is the area except the example in the image. For another example, if the vehicle captures an image of a road during driving, other vehicles, lane lines, traffic lights, road blocks, pedestrians, and the like in the image are examples, and the parts other than the examples are backgrounds.

(9)R(red)、G(green)、B(blue)

Where R represents red, G represents green, and B represents blue, each image may be represented by the color values of the three channels. For example, an RGB image means an image having three color channels, and an RGGB image means an image having four color channels, two of which are G.

(10)YUV

YUV is a color coding method, often used in various video processing components. The use of YUV allows for a reduction in the bandwidth of the chrominance in view of the user's perceptibility when encoding a photograph or video. "Y" represents brightness (i.e., gray scale value), and "U" and "V" represent Chrominance (i.e., Chroma), which is used to describe the color and saturation of an image and is used to specify the color of a pixel.

The CNN is a commonly used neural network, and as in the following embodiments of the present application, the CNN may be used to perform steps such as feature extraction or fusion. For ease of understanding, the structure of the convolutional neural network will be described below by way of example.

CNN is a deep neural network with a convolutional structure. CNN is a deep learning (deep learning) architecture, which refers to learning at multiple levels at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle among these is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

The structure of CNN is described in detail below with reference to fig. 2. As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in FIG. 2, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined. During the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel in the horizontal direction (or two pixels by two pixels … … depending on the value of the step size stride), so as to complete the extraction of the specific feature from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The dimensions of the multiple weight matrixes are the same, the dimensions of the feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted feature maps with the same dimensions are combined to form the output of convolution operation.

Generally, the weight values in the weight matrix need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to perform correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. 121-126 layers as illustrated by 120 in fig. 2, which may be one convolutional layer followed by one pooling layer, or a multi-convolutional layer followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (131, 132 to 13n as shown in fig. 2) and an output layer 140 may be included in the neural network layer 130. In this application, the convolutional neural network is: and carrying out at least one deformation on the selected starting point network to obtain a serial network, and then obtaining the serial network according to the trained serial network. The convolutional neural network can be used for image recognition, image classification, image super-resolution reconstruction and the like.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 2 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

The CNN mentioned below in the present application may refer to the CNN shown in fig. 2 or fig. 3.

In the embodiment of the application, the video can be denoised, so that the display quality of the video is improved. The video denoising method provided by the application can be executed by a terminal and can also be executed by a server. For example, the method provided by the present application may be deployed in a mobile phone, a camera, a video monitor, a television, a server, an Image Signal Processor (ISP), or the like of a user, and is used for performing denoising processing on a captured or received video number.

For example, the video denoising method provided by the present application may be applied to a smart city scene, as shown in fig. 4, low-quality video data acquired by each monitoring device may be acquired, and may be stored in a memory. When the video data are played, the video data can be denoised by the video denoising method provided by the application, so that clearer video data can be obtained, and the watching experience of a user is improved.

For another example, the video denoising method provided by the present application can be applied to various video shooting scenes. For example, a user may use the terminal to take a video and save it locally. Before the user uses the terminal to play the video, the stored video data can be denoised by the video denoising method provided by the application, so that the video data with higher quality can be obtained, and the watching experience of the user can be improved.

For example, the video denoising method provided by the present application may be applied to a live video scene, and as shown in fig. 5, a server may send a video stream to a client used by a user. After the client receives the data stream sent by the server, the data stream can be denoised by the video denoising method provided by the application, so that video data with higher image quality can be obtained, and the watching experience of a user is improved.

In addition, the video denoising method provided by the application can also be applied to scenes such as an automatic driving scene and image enhancement, and the description is omitted here.

The video denoising method provided by the embodiment of the application can be executed on a server and can also be executed on terminal equipment. The terminal device may be a mobile phone with an image processing function, a Tablet Personal Computer (TPC), a media player, a smart tv, a notebook computer (LC), a Personal Digital Assistant (PDA), a Personal Computer (PC), a camera, a camcorder, a smart watch, a Wearable Device (WD), an autonomous vehicle, or the like, which is not limited in the embodiment of the present application.

For example, the system architecture of the application of the video denoising method provided by the present application can be as shown in fig. 6. In the system architecture 400, the server cluster 410 is implemented by one or more servers, optionally in cooperation with other computing devices, such as: data storage, routers, load balancers, and the like. The server cluster 410 may use data in the data storage system 250 or call program code in the data storage system 250 to implement the steps of the video denoising method provided herein.

The user may operate respective user devices (e.g., local device 401 and local device 402) to interact with the server cluster 410. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

Each user's local device may interact with the server cluster 410 via a communication network of any communication mechanism/communication standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof. In particular, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like. The wireless network includes but is not limited to: a fifth Generation mobile communication technology (5th-Generation, 5G) system, a Long Term Evolution (LTE) system, a global system for mobile communication (GSM) or Code Division Multiple Access (CDMA) network, a Wideband Code Division Multiple Access (WCDMA) network, a wireless fidelity (WiFi), a bluetooth (bluetooth), a Zigbee protocol (Zigbee), a radio frequency identification technology (RFID), a Long Range (Long Range ) wireless communication, a Near Field Communication (NFC), or a combination of any one or more of these. The wired network may include a fiber optic communication network or a network of coaxial cables, among others.

For example, in an application scenario, any one of the servers in the server cluster 410 may acquire video data from a data storage system or other devices, such as a terminal, a PC, and the like, and then perform denoising processing on the video data by using the video denoising method provided by the present application, so that each frame image in the video data is clearer, user experience is improved, and the denoised video data is sent to a local device.

However, even with advances in technology, digital images are always subject to inherent or external interference due to the randomness of the acquisition process and/or the presence of challenging sensing conditions, such as high noise in images captured in low light scenes. These interference factors are random and can be modeled as a random variable whose random fluctuations are referred to as "noise". Noise in images acquired by common image sensors such as charge-coupled devices (CCD) or Complementary Metal Oxide Semiconductor (CMOS) approximately follows a combination of poisson and gaussian distributions, modeling signal-dependent and signal-independent noise sources, respectively. Denoising refers to the removal of such random noise from noisy data without introducing artifacts or modifying the original image structure.

In general, the photographing and imaging of the terminal equipment are limited by the hardware performance of the optical sensor of the terminal equipment, and due to the imperfection of the acquisition process, the formation of a digital image is always affected by different forms of noise and degradation, and an image restoration algorithm must be used to restore the degraded input to a high-quality input. There are many image restoration methods, such as denoising, demosaicing, super-resolution, etc. In particular, denoising is an essential task for the camera processing path (e.g., in a smartphone or video surveillance camera), since as the first typical operation in the camera path, the denoising quality will directly affect the output results of all subsequent tasks.

Some commonly used denoising algorithms are based on image processing techniques, which exploit the statistical properties of the input data. For example, denoising can be performed by an algorithm with a low computational complexity, such as a local correlation method, a non-local correlation method, or a sparse algorithm. While more complex algorithms such as NL-means or BM3D can produce higher quality outputs, these algorithms are very slow in reasoning on general purpose processors and need to be implemented on specific hardware. Generally, to better generalize to input images of arbitrary noise levels, multiple parameters need to be carefully and manually adjusted one by one to explicitly control these denoising algorithms to obtain better denoising effect. Therefore, the denoising methods need a large amount of manual participation and abundant manual debugging experience, and the denoising cost is high.

For video denoising, some common approaches are denoising based on non-local imaging or motion estimation. While these approaches work well, they typically require significant computational power to process multiple frame inputs simultaneously. Or the spatial-temporal fusion of the video information can be processed in a cyclic recursive manner, but the denoising quality is often unacceptable.

Typically, neural networks (e.g., CNNs) are based on a large number of trainable convolution kernels whose parameters are optimized in a supervised fashion by a task-specific penalty function. Given enough data, a large number of parameters can automatically map the degraded noisy input to the recovered denoised output during the training process. Another advantage of standard feed forward CNNs is that their inference time is fast, since their basic operations (i.e. convolution) can be easily optimized in hardware. CNN requires a large number of parameters to solve the problem efficiently and reliably, and once the complexity of CNN is reduced, the performance is also drastically reduced. Furthermore, the training cost is high due to the large amount of training data required for CNN.

Therefore, the video processing method and the video denoising method are provided, so that the light denoising is realized, the denoising effect is improved, a clearer image is obtained, and further video data with better image quality is obtained. The video processing method and the video denoising method can be applied to high-performance computing equipment such as consumer-grade products such as mobile phones, video monitoring or televisions and the like or cloud products and the like, improve the imaging quality of videos, reduce the denoising difficulty of noise images through decorrelation transformation, eliminate noise existing in video images through time domain fusion, space denoising, refinement and the like, obtain clean output and enhance the imaging quality of the video images.

First, a flow of a video processing method will be described with reference to fig. 7A.

701. The current frame and the first fused image are acquired.

The current frame may be a non-first frame in the video data, that is, any one of frames arranged after the first frame in a preset order. The first fused image is fused with information of at least one frame image adjacent to the current frame in a preset order, such as one or more frame images arranged before or after the current frame.

The preset sequence may be a playing sequence of the video data, or a sequence arranged according to a time sequence, or a sequence opposite to the playing sequence, and may be specifically adjusted according to an actual application scenario, and is not limited herein.

For example, each frame of image in the video may be denoised according to the playing sequence of the video, so as to obtain a clearer image. In the process of denoising each frame of image, denoising each frame of image by selectively combining the fusion image of one or more frames of images adjacent to each frame of image, thereby combining the time domain information in the video to denoise the image, improving the video denoising effect and obtaining the video with better image quality.

It should be understood that if the current frame is a first frame arranged in a preset order in the video data, when the first frame is processed, the first fused image may not exist, and the step of fusing the current frame and the first fused image does not need to be performed. If the current frame is the second frame, the first fused image can be directly the first frame.

702. And extracting features from the first fused image to obtain a second feature.

In this case, features may be extracted from the current frame and the first fused image, respectively, and for the sake of understanding, the features extracted from the current frame are referred to as first features, and the features extracted from the first fused image are referred to as second features.

In particular, features may be extracted from the image using a feature extraction network, which may be a CNN as described above, or other neural network that includes one or more convolutions, or the like.

703. And determining to obtain a first fusion weight and a second fusion weight according to the first characteristic and the second characteristic.

After extracting the features from the current frame and the first fused image, the first feature and the second feature need to be fused, and before this, the weight occupied by each image when the current frame and the first fused image are fused can be respectively determined according to the features of the foreground and the background respectively included in the first feature and the second feature.

It is understood that the first feature may include features of a foreground portion and a background portion in the current frame, and may be used to identify a position of the foreground portion and a position of the background portion in the current frame; the second features may include features of the foreground portion and the background portion in the first fused image, and may be used to identify a location of the foreground portion and a location of the background portion in the first fused image. For example, in video data, the position of the foreground in each frame of image may be different with time, while the position of the foreground included in the current frame is generally more accurate, so that the foreground portion and the background portion in the current frame need to be distinguished for subsequent fusion. The first fused image is fused with information of at least one frame of image arranged before the current frame, so that the noise included in the first fused image is usually less, and the image quality is better.

In this embodiment, the panoramic part and the background part in the first fused image may be distinguished by the feature extracted from the first fused image, so that when the current frame and the first fused image are fused subsequently, the weights respectively occupied by the foreground part and the background part may be determined more accurately, and thus the noise in the current frame is smoothed, and the noise included in the fused image is reduced.

The weight corresponding to the foreground in the current frame is not less than the weight occupied by the foreground in the first fusion image, and the weight occupied by the background part in the current frame is not more than the weight occupied by the background part in the first fusion image. Specifically, the foreground portion in the current frame may be determined by the first feature, and the foreground portion in the first fused image may be determined by the second feature, and then the foreground portion in the current frame may be given a higher weight, for example, the foreground portion in the current frame may be set to a weight of 0.8, and the background portion in the first fused image may be set to a weight of 0.2. The background portion in the current frame may also be determined by the first feature and the background portion in the first fused image may also be determined by the second feature, and then the background portion in the current frame may be given a lower weight, for example, a weight of 0.4 may be set for the background portion in the current frame and a weight of 0.6 may be set for the background portion in the first fused image.

704. And fusing the current frame and the first fused image according to the first fusion weight and the second fusion weight to obtain a second fused image.

After the first fusion weight and the second fusion weight are obtained, the current frame and the first fusion image can be fused according to the first fusion weight and the second fusion weight to obtain a second fusion image.

Generally, the position of the foreground part in the current frame is more accurate, and the noise in the first fused image is less, so that when the current frame and the first fused image are fused, the foreground in the current frame and the background in the first fused image can be referred to more, thereby reducing the ghost in the fused image and reducing the noise in the fused image.

In the embodiment of the application, the current frame and the first fused image are fused, which is equivalent to combining the time domain related information between adjacent frames in a scene, so that the noise in the current frame is smoothed, the noise in the image is reduced, and the second fused image with smaller noise is obtained. In addition, when the current frame and the first fused image are fused, more reference is made to the foreground part in the current frame and the background part in the first fused image, and because the position of an object in the video data can change along with time, in the embodiment of the application, ghost images in the second fused image can be greatly reduced through the foreground included in the current frame and the background included in the first fused image, and the quality of the finally obtained image is improved.

Optionally, before the current frame and the first fused image are fused, the current frame may be further color-transformed by a color transformation matrix to obtain at least one chrominance component and at least one luminance component, and in order to distinguish the first chrominance component and the first luminance component, the at least one first chrominance component and the at least one first luminance component may constitute a new current frame, that is, the current frame after color transformation. The first fused image is also color transformed by a color transformation matrix to obtain at least one chrominance component and at least one luminance component, referred to herein as a second chrominance component and a second luminance component, for ease of distinction. The at least one second chrominance component and the at least one second luminance component may then constitute a new first fused image. When the current frame and the first fused image are fused, the new current frame and the new first fused image after color-changing transformation can be fused, so that a second fused image is obtained.

The color transformation of the current frame and the first fused image is equivalent to the color-removing related transformation of the current frame and the first fused image, and the current frame and the first fused image are transformed from a pixel domain to different domains so as to carry out subsequent denoising operation.

Specifically, the color matrix may be preset, or may be obtained by training at least one convolution kernel. For example, the method provided by the present application may be implemented by a neural network, and when the neural network is updated by using a large number of samples, the color transformation matrix may be updated at the same time, so as to obtain an updated color transformation matrix.

The difference from the video denoising method provided by the application includes that after the second fused image is obtained, the color transformation is performed on the second fused image through an inverse color transformation matrix to obtain an updated second fused image, and the inverse color transformation matrix is an inverse matrix of the color transformation matrix. In the embodiment of the present application, after the color decorrelation conversion is performed, inverse color conversion may be performed to restore the colors in the image and obtain an image with colors.

Optionally, before the current frame and the first fused image are fused, decorrelation transformation of spatial frequency may be further performed on the current frame and the first fused image, so as to perform subsequent image denoising. Specifically, the current frame and the first fused image may be subjected to wavelet transform, to obtain at least one low-frequency component and at least one high-frequency component of the current frame and at least one low-frequency component and at least one high-frequency component of the first fused image. For the convenience of distinction, the low frequency component and the high frequency component of the current frame are respectively referred to as a first low frequency component and a first high frequency component, and at least one first low frequency component and at least one first high frequency component constitute a new current frame; the low-frequency component and the high-frequency component of the first fusion image are respectively called as a second low-frequency component and a second high-frequency component, and at least one second low-frequency component and at least one second high-frequency component form a new first fusion image, so that the main structure and detail information in the image are separated from the dimensionality of the spatial frequency, the main structure and detail in the image are better denoised respectively, and the denoising effect is improved.

Specifically, a wavelet transform may be performed on a current frame using a wavelet coefficient to obtain a first low-frequency component and a first high-frequency component, where the wavelet coefficient is a preset coefficient or is obtained by training at least one convolution kernel, the first low-frequency component and the first high-frequency component constitute a new current frame, and a wavelet transform may be performed on a first fused image to obtain a second low-frequency component and a second high-frequency component, and the second low-frequency component and the second high-frequency component constitute a new first fused image.

If color transformation and wavelet transformation are required to be carried out on the current frame and the first fused image, the wavelet transformation can be carried out after the color transformation is carried out, so that the color-related transformation and the spatial frequency decorrelation of the current frame and the first fused image are realized, the subsequent denoising is convenient, and the image with better denoising effect is obtained.

The difference from the video denoising method provided by the present application includes that if the wavelet transform is performed, after the second fused image is obtained, the inverse wavelet transform may be performed on the second fused image, for example, the inverse wavelet transform is performed on the second fused image through an inverse wavelet coefficient to obtain an updated second fused image, where the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient. Therefore, the inverse wavelet change can accurately restore the high-frequency component and the low-frequency component, which is equivalent to restoring the structure and the details in the image, and a clearer second fusion image is obtained.

In addition, it can be understood that when the next frame of the current frame is processed, that is, the next frame is used as a new current frame, the second fused image can be used as a new first fused image to perform denoising processing on the new current frame, so that iterative denoising of video data is realized, the temporal correlation between adjacent frames in the video data is used for denoising, the denoising quality of the image is improved, and a denoised image with better image quality is obtained.

In order to further reduce noise in the second fused image, the second fused image may be further subjected to denoising processing, referring to fig. 7B, which is a schematic flow chart of a video denoising method provided by the present application.

It should be noted that, in the embodiment of the present application, the steps 701-704 may refer to the related description in fig. 7A, and are not described herein again.

705. And denoising the second fusion image to obtain a denoised image.

After the current frame and the first fused image are fused, the noise of the obtained second fused image is reduced relative to the current frame, and at the moment, the second fused image can be denoised continuously, so that the noise in the second fused image is reduced, and the denoised image with less noise is obtained.

Specifically, the second fused image may be subjected to filtering processing, such as fir (finite impulse response) filtering, median filtering, wiener filtering, and the like, so as to reduce noise in the second fused image and obtain a denoised image with better image quality.

Optionally, when denoising the second fused image, denoising may be performed by combining the first fused weight, the second fused weight, the current frame, and the first fused image. Therefore, when denoising is carried out, information used for fusing the current frame and the first fused image can be multiplexed, the data utilization rate is improved, and the subsequent denoising effect is better.

Optionally, the variance of each pixel point in the second fusion image may be calculated by combining the first fusion weight, the first fusion image, and the current frame, so as to obtain a fusion image variance. And then fusing the variance of the fusion image and the second fusion image to obtain a denoised image after further denoising.

For example, the variance of each pixel point in the fusion image can be calculated through a cyclic recursion formula to obtain the variance of the fusion image, the variance of each pixel point in the second fusion image is calculated by repeatedly using the information of time domain fusion, and then the variance is combined with the second fusion image to perform denoising, so that a denoised image with better image quality is obtained.

More specifically, the feature may be extracted from the second fused image to obtain a third feature, the feature may be extracted from the variance of the fused image to obtain a sixth feature, and then the third feature and the sixth feature are fused to obtain the denoised image. In this embodiment, the noise in the second fusion image is further smoothed by fusing the variance of the fusion image, so as to obtain a denoised image with less noise.

Specifically, for example, the denoising step may be performed by denoising CNN, where the denoising CNN includes one or more layers of convolution layers and a linear rectification function/modified linear unit (ReLU), and the denoised image may be output by using the second fused image kernel fusion image variance as an input of the denoising CNN. The denoised CNN can be obtained by training a large number of samples, and can be used for extracting features and fusing the extracted features.

In addition, when denoising is performed, denoising can be performed in combination with the current frame. Specifically, for example, when denoising is performed using the denoising CNN, the current frame may be used as an input of the denoising CNN in addition to the fusion map variance and the second fusion image as inputs of the denoising CNN, and the denoised image may be output.

Therefore, in the embodiment of the application, each frame in the video can be updated iteratively, which is equivalent to smoothing the noise in each frame of image by using the correlation between the images in the video in the time domain, so that the denoising effect of each frame in the video data is improved, thereby obtaining the video data with better image quality and improving the user experience.

Optionally, if the color transformation matrix performs color transformation on the current frame and the first fused image before the current frame and the first fused image are fused, after the second fused image is denoised, the denoised image may be subjected to inverse color transformation. Specifically, the denoised image may be inversely transformed by an inverse color transformation matrix, where the inverse color matrix is an inverse matrix of the color transformation matrix, that is, an identity matrix may be obtained by multiplying the color matrix and the inverse color matrix. When at least one convolution kernel is updated, the product of the color matrix and the inverse color matrix can be used as a unit matrix to be updated as a constraint, so that the inverse color transformation can accurately restore the color of the image, and a clearer de-noised image is obtained.

It should be understood that if step 705 is not performed, the second fused image may be directly subjected to inverse color transformation in a manner similar to the above-mentioned transformation of the denoised image, so as to recover the color of the second fused image, thereby obtaining a clear second fused image.

Optionally, if the wavelet transform is performed on the current frame and the first fused image before the current frame and the first fused image are fused, after the second fused image is denoised, the denoised image may be subjected to inverse wavelet transform, so as to obtain a denoised image related to the dimension of the spatial frequency. Specifically, the denoised image may be subjected to inverse wavelet transform by an inverse wavelet coefficient, to obtain an updated denoised image, where the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient.

It should be understood that if step 705 is not performed, the inverse wavelet transform may be directly performed on the second fused image in a similar manner to the above-mentioned transformation on the denoised image, so as to recover the structure and details of the second fused image and obtain a clear second fused image.

Therefore, in the embodiment of the application, the color, the high frequency and the low frequency of the image can be separated through color removing transformation and decorrelation transformation of space dimensionality, so that the noise can be filtered more accurately, and the de-noised image with better de-noising effect can be obtained.

706. And fusing the second fused image and the denoised image to obtain an updated denoised image.

After the denoised image is obtained, in order to avoid excessive smoothness, the second fused image and the denoised image can be fused, so that details in the denoised image are enriched through information included in the second fused image, the denoised image including more details is obtained, and the quality of the denoised image is improved.

Optionally, the specific manner of fusing the second fused image and the denoised image may include: firstly, extracting features from a second fusion image to obtain third features; subsequently, extracting features from the de-noised image to obtain a fourth feature; determining a third fusion weight and a fourth fusion weight according to the third feature and the fourth feature, wherein the third fusion weight is a weight corresponding to the third feature, and the fourth fusion weight is a weight corresponding to the fourth feature, and in the third feature and the fourth feature, the frequency of each pixel point and the corresponding weight value are in a negative correlation relationship; and fusing the third characteristic and the fourth characteristic according to the third fusion weight and the fourth fusion weight to obtain an updated denoised image. Therefore, in the embodiment of the application, the weight value can be determined according to the frequency of the pixel point, and the higher the frequency is, the lower the corresponding weight value is, so that the noise included in the high-frequency information can be effectively smoothed, and the denoising effect is realized.

Therefore, in the embodiment of the application, after the second fused image is denoised to obtain the denoised image, the second fused image and the denoised image can be fused, so that the denoised image can be prevented from being excessively smooth, the second fused image is used for enriching details in the denoised image, the denoised image with richer details is obtained, the image quality of the denoised image is improved, and the user experience is improved.

The foregoing describes the flow of the method provided by the present application, and for convenience of understanding, the flow of the video denoising method provided by the present application is further described below based on a more detailed application scenario.

For example, the flow of another video denoising processing method provided by the present application can refer to fig. 8.

The embodiment of the application can be understood as providing a denoising method, which cyclically uses a multi-stage processing method to remove noise in a video, where an input video may be an image sequence acquired by any camera or sensor, and each frame of image may be identified by different time steps, as denoted by {0,1,2, …, t, … }.

First, the current frame is represented as a noise frame noise due to noise _t Noise frame noise _t Which may be any frame of video data. Noise frame noise _t In the non-first frame, a Fused image Fused with the image of the previous frame, such as represented by Fused, can be obtained _t-1 。

Then to Noisy _t And Fused _t-1 Decorrelation operations, such as color decorrelation, decorrelation of spatial frequencies, etc., are performed to facilitate subsequent noise smoothing. The input video image may be first converted to a color-decorrelated luminance-chrominance space using a color transform, and then frequency-decorrelated using a wavelet transform. The two kinds of transformation can reduce the denoising difficulty, and the wavelet transformation has the advantages of reducing the image resolution and the computational complexity.

Time domain fusion, refers to decorrelated noise at different time periods _t And Fused _t-1 Fusing to obtain Fused image Fused _t Fused image Fused _t The information of the time domain correlation between the current frame and the last fused frame is fused, so that the noise can be smoothed. When the next frame is processed, Fused is performed _t And smoothing the noise to realize iterative denoising of the video data. The temporal fusion step can detect the object information of the motion between the frames of the video, and make the static background of the image gradually converge to the effect of multi-frame averaging by circularly and recursively using the fused image of the previous frame (t-1) and the noise input of the current frame, while the motion foreground comes from the current frame. Therefore, the noise in the input image is greatly reduced, the fusion image is obtained, and the denoising difficulty is greatly reduced.

Obtaining Fused _t Then, spatial denoising can be performed. Filtered Fused _t To obtain a Denoised image _t-1 。

Then bound to Fused _t For Denoised _t-1 Performing refinement treatment, i.e. Fused _t Details included are fused to Denoised _t-1 To avoid Denoised _t-1 Loss of detail due to over-smoothing.

Subsequently, refined Denoised is processed _t-1 The color, scale, etc. of the image are restored by performing an inverse transform, i.e., an inverse transform of the decorrelation transform, to obtain a color-to-noise ratio _t Corresponding to the color or scale of (a), e.g. as Output _t 。

In the embodiment of the application, the current frame at the moment (t) is temporally fused with the image fused in the previous time step (t-1) in the fusion stage. The fusion stage takes advantage of the temporal correlation between successive frames to reduce noise in an optimal way. Subsequently, the denoising stage effectively completely removes the residual noise on the fused image. The initial noise reduction of the fusion stage is beneficial to the denoising stage, so that the denoising task is easier.

However, denoising may produce imperfect output. Therefore, the current fusion image and the denoising image are fused in a refinement stage. The purpose of refinement is to take the image structure out of a fused image that is rich in detail but still noisy, adding to a denoised image that is noise free but may be overly smooth. Ultimately resulting in a higher quality final output image.

It should be noted that, taking fig. 8 as an example, the video processing method provided in the present application and the video denoising method have similar processes, and the difference is that an image with improved image quality can be obtained without performing denoising and refining steps in the video processing method, and the input of the subsequent inverse transformation step is directly Fused _t . The present application exemplarily illustrates an overall flow of a video denoising method.

For ease of understanding, each step in the flow of the video denoising method provided in the present application is described in detail below. The flow of the video denoising method provided by the application can be specifically divided into the steps of decorrelation transformation, time domain fusion, denoising, refinement, inverse transformation and the like, and the steps are respectively introduced in a more detailed way.

First, noise that may be included in an image is exemplarily described for ease of understanding.

Noise image of t frame (noise) _t ) Can be regarded as a Clean image (Clean) _t ) And random Noise (Noise) _t ) Addition of (a): noisy _t ＝Clean _t +Noise _t Wherein, the noise may be a random variable subject to a signal-dependent variance zero-mean gaussian distribution, such as the variance distribution may be expressed as:

Noise _t ～Gaussian(mean＝0,variance＝Shot _t Clean _t +Read _t )。

among them, Shot _t Representing shot noise, Read _t The expression read noise, shot noise and read noise is generally determined in accordance with the performance of the apparatus that captures the video, the capturing parameters or environmental information, such as shot noise and read noise determined in accordance with exposure, gain, sensitivity (ISO), or the like.

Specifically, for example, ISO corresponding to each frame image may be different, for example, an ISO value and a noise value generally have a positive correlation, and the larger the ISO value is, the larger the noise is; the noise may also be correlated to the ambient brightness, e.g., the higher the brightness, the greater the noise. The relationship between noise parameters (such as shot noise and read noise) and the settings of a device that captures video is called a Noise Level Function (NLF), and the NLF of a device can typically be constructed using a calibration method.

By the video denoising method, shot noise or read noise and the like can be effectively filtered, so that a better denoising image which is wanted for denoising is obtained, and the detailed steps of denoising are introduced below.

One, decorrelation transform

Wherein the decorrelation transform comprises a color-decorrelation transform and/or a spatio-frequency-decorrelation transform, etc., and the color-correlation transform or the spatio-frequency-correlation transform is described below, respectively.

1. De-color dependent transformation

The color decorrelation transformation (color transformation for short) can be understood as performing color decorrelation on pixels of an image, so that the association between each pixel point in a color dimension is reduced, and subsequent denoising processing is facilitated.

In particular, color transformation and inverse color transformation may be understood as forward and backward linear decorrelation transformation of an image, transforming the image from a pixel domain to a different domain, such as a luminance domain, a chrominance domain, etc., for subsequent de-noising of the image.

More specifically, a color image, such as an RGB image, may be converted into a luminance component and a chrominance component by performing a de-color correlation transform through a YUV transform. Luminance component i.e. noise _t The brightness may be Noisy _t The average value of the colors of the pixel points in (1) and the chrominance component describe Noisy _t If the color of the image can be divided into three channels, each channel corresponds to a chrominance component, and the average of the color papers of all the channels of each pixel point is calculated, so that a luminance component can be obtained.

As shown in FIG. 9, an image is input (i.e., Noisy) _t ) The method can be expressed by four channels of the RGGB, namely each pixel point has a color value in each channel dimension, the color values of each channel in an input image form a one-dimensional matrix, and then the color transformation matrix is used for carrying out point-by-point convolution on the four-dimensional matrix to obtain three chrominance components and a luminance component.

The color transformation may be understood as a matrix multiplication, the length of which is equal to the number of input color channels, e.g. the input image may be represented as a 4 x 4 matrix, and the color transformation may be implemented by means of a point-by-point convolution.

In the embodiment of the present application, an input image may be converted into a luminance component representing luminance of the input image and at least one chrominance component describing each color of the input image through a de-color correlation transform. Since the luminance may typically be an averaged color value, the noise will be smoothed by the averaged color value, reducing the noise.

2. De-space frequency dependent transform

The spatial frequency de-transformation can be understood as de-correlating the pixels of the image in the frequency dimension. Specifically, the input image may be high-pass or low-pass filtered, respectively, resulting in high-frequency components and low-frequency components. Therefore, the pixel points of the input image can be discretely distributed in the separated spatial frequency, the spatial frequency correlation removal of the input image is realized, the effective part and the noise part of the image are separated into different frequency components, the denoising effect of the subsequent denoising operation is better, and the denoising is simpler.

Specifically, wavelet transform may be performed on the input image, thereby implementing a de-spatial frequency dependent transform. The wavelet transform may specifically use a high-pass filter and a low-pass filter to perform filtering in the vertical and horizontal directions of the input image, that is, to decompose each color channel of the input image into a low-frequency component and a high-frequency component (or referred to as a low-frequency subband and a high-frequency subband), and the size of each component may be equal to or smaller than the input image, for example, the size of each component is half of the input image. In general, a frequency-based representation separates the active portion of the image and the noise into different frequency components, making denoising simpler.

For example, a wavelet transform may be used to decompose an input image into four different frequency components, including one low frequency component and three high frequency components, each of which may be half the size of the input image, with the elements within each subband being referred to as coefficients. These four subbands represent low frequencies (as obtained by filtering the input image with a low-pass kernel) and high frequencies resulting from filtering in the vertical, horizontal, and diagonal directions (as obtained by filtering the image with a directional high-pass kernel).

If the input image includes a plurality of channels, each channel can be represented as a one-dimensional image, and wavelet transformation can be performed on each one-dimensional image to obtain one low-frequency component and three low-frequency components corresponding to each one-dimensional image. For example, if the input image is represented by four channels, each channel may be wavelet transformed to obtain 16 frequency components, including 4 low frequency components and 12 high frequency components.

As shown in fig. 10, the input image is an image of H × W × C, H representing the length, W representing the width, and C representing the number of channels, first two one-dimensional forward decomposition filter implementations (e.g., Harr wavelets) of wavelet coefficients, a selected wavelet family (how to understand, the coefficients) are determined. The outer product of the wavelet coefficient pairs is then computed to obtain wavelet kernels, e.g., four two-dimensional convolution kernels are generated from the outer product of each pair of one-dimensional filters, which kernels can compute an output in a convolution of step size 2. And then performing convolution operation on the input image by using a wavelet kernel, wherein the step size of the convolution operation is 2, and each channel corresponds to one low-frequency component and three high-frequency components.

In general, the frequency characteristics of an image structure corresponding to noise are greatly different. After the frequency decorrelation transform is performed, the noise is mainly concentrated in the low absolute value component among the high frequency components. The method is powerful prior information for denoising, for example, the low value of the high-frequency component can be set to zero by a soft threshold and hard threshold denoising method to filter a large amount of noise and achieve a better denoising effect. In the embodiment of the application, a nonlinear model CNN can be used for denoising to process the frequency component obtained after decorrelation transformation, so as to achieve a better denoising effect.

3. Combining de-color dependent transform and de-spatial frequency dependent transform

The input image may be transformed by selecting one of a color decorrelation transform and a spatial frequency decorrelation transform, or by performing the color decorrelation transform and the spatial frequency decorrelation transform on the input image. For example, after performing the color removal transform to obtain one luminance component and three chrominance components, the one luminance component and the three chrominance components (i.e., images of four channels) are used as input of the wavelet transform, and one low frequency component and three high frequency components corresponding to each channel are output.

The color transform is a linear transform with de-color correlation similar to the YUV transform, while the wavelet transform is a transform with de-spatial frequency correlation. These transformations can be learned to obtain the optimal transformation parameters and can be perfectly reconstructed by the reversible loss function.

Therefore, in the embodiment of the present application, the chrominance component and the luminance component can be separated by the color decorrelation transform, so as to smooth the noise in the image, thereby achieving the denoising effect. Or through frequency correlation removal transformation, the high frequency and the low frequency of each pixel point are discretely distributed, so that simpler filtering can be conveniently carried out subsequently, and a better noise filtering effect is realized.

Two, time domain fusion

After the decorrelation transform, the time-domain fused input may include decorrelated transformed noise _t Fused image Fused _t-1 . The time domain fusion may be performed by the time domain fusion CNN, in particular, respectively from the decorrelated transformed noise _t And Fused image Fused _t-1 Extracting features from the image, and determining the decorrelated transformed noise according to the features _t And Fused image Fused _t-1 Respectively corresponding weights, and then fusing the decorrelated Noisy according to the weights _t Fused image Fused _t-1 . In general, since the foreground in the video data may be in motion, the position in each frame may not be the same, while the background may not vary much.

Thus, Noisy after fusing decorrelating transforms _t Fused image Fused _t-1 Then, the decorrelated transformed noise can be obtained for the current frame _t Set higher weights for foreground portions of e.g. decorrelating transformed noise _t The foreground part in (1) is set to have a weight of 0.8 and is the Fused after decorrelation transformation _t-1 The weight of the pixel point corresponding to the foreground portion is set to 0.2. Decorrelated transformed noise _t The weight of the middle background part is not more than the Fused after the decorrelation transformation _t-1 And the weight of the corresponding pixel point in the image. For example, decorrelated transformed noise _t The weight of the middle background part can be set to 0.4, and the transformed Fused is decorrelated _t-1 The weight of the corresponding pixel point in the image is set to be 0.6.

Therefore, in the embodiment of the application, the fusion weight of each image can be determined by combining the foreground part in the current frame and the background part in the fusion image, so that the foreground part in the fusion image refers to the foreground in the current frame more, and the background part refers to the fusion image in the previous frame more, thereby the fused image has better performance in the foreground part and the background part, and the generation of ghost can be reduced.

In addition, the input of the time domain fusion may also include a noise variance, as represented by:

Var[Noisy _t ]＝Shot _t ·Luminance(Noisy _t )+Read _t

wherein luminence is decorrelated Noisy _t E.g. mean of a plurality of color channels, Shot _t Representing shot noise, Read _t Representing read noise. The noise variance can be represented as a single channel image, the representation being an estimate of the noise variance in the input noise map. The noise level can be determined for the temporal fusion CNN so that the task of fusing images can naturally adapt to different noise levels.

As shown in FIG. 11, the fused CNN can be used to determine decorrelated noise in combination with noise variance _t And decorrelated Fused _t-1 The weights corresponding to the noise variance and the decorrelated noise can be obtained through a large amount of sample training _t And decorrelated Fused _t-1 As input to the fused CNN, to output decorrelated Noisy _t And decorrelated Fused _t-1 Respectively corresponding weights, noise variance can be used to fuse CNN to determine noise level, and thus decorrelated noise _t And decorrelated Fused _t-1 Assign corresponding Weights, i.e., fusion Weights weightings as shown in FIG. 11 _t . Decorrelated noise output from the fused CNN _t Fused after decorrelation _t-1 Respectively corresponding weights, fusing the decorrelated Noisy _t And decorrelated Fused _t-1 If the fused image is represented as:

Fused _t ＝Fused _t-1 ·(1-Weights _t )+Noisy _t ·Weights _t 。

the fused CNN may include multiple layers of convolution operators and ReLU nonlinear operators, with the final output being two channels activated by sigmoid or softmax to ensure the convexity of the image fusion equation. The goal of the fusion stage is to take advantage of the temporal correlation inherent in natural video to minimize the noise present in the image while avoiding temporal artifacts (e.g., ghosting) to preserve as much structure and detail in the image as possible. Whereas in a static background the output fused image is time-averaged to best reduce the noise in the fused image and the final prediction weight is close to zero, whereas the output weight of the moving foreground is close to 1 and the fused image will be closer to the current noisy frame but contain less noise. In the result of fusion, the static background can gradually converge to the ideal frame average result, and the dynamic foreground is directly from the current noise frame. Fusion can minimize noise while avoiding ghost images.

Third, denoising

The denoising may be implemented by a filter, a CNN, or the like, that is, the denoising model mentioned in the embodiment of the present application may include the filter or the CNN, or the like. For example, the present application may be implemented by CNN, and CNN used for denoising is changed into denoised CNN for facilitating distinguishing, where the denoised CNN may include multiple convolutional layers and RELU activation functions, etc.

Specifically, after the decorrelation transformation described above, the resulting Fused image Fused is obtained _t Has decreased, the Fused image Fused can be used _t Outputting Denoised image Denoised as input to Denoised CNN _t 。

Further, optionally, Noisy may also be used _t Also as input to denoise CNN, thereby outputting Denoised _t 。

Optionally, the variance of the fusion map can also be calculated, and then the variance of the fusion map and Fused can be added _t All as the input of the denoised CNN, and the denoised image is output.

For example, as shown in FIG. 12, the fusion map variance Var [ Fused [ ] _t ]、Noisy _t And Fused _t All as the input of the Denoised CNN, outputting the Denoised image Denoised _t . Such as may incorporate the aforementioned Weights _t I.e. Noisy _t Corresponding weight, and Fused _t-1 And calculating the variance of each pixel point according to the corresponding weight to obtain the variance of the fusion image. For example, the fusion graph variance can also be expressed as:

in this formula, the fusion Weights _t Less than 1, which makes the fused image variance decrease with the number of frames denoised, i.e. it means that the noise in the fused image is smaller and smaller.

Therefore, in the embodiment of the application, denoising can be performed through the denoising CNN, and the denoising CNN can acquire the noise level by inputting the fusion map variance in the denoising CNN, so that a better denoising effect is realized.

Fourthly, fine treatment

The spatial denoising stage is likely to result in excessive smoothing due to the drawbacks of the denoising operation itself. This phenomenon is more severe especially in the case of low signal-to-noise ratios, or where the constraints on the computational complexity of the algorithm are particularly severe. After the denoised image is obtained, in order to avoid information loss in the denoised image caused by excessive smoothing, the inherent limitation of a denoising algorithm can be overcome by fusing the information of the image and the denoised image. If Fused can continue _t And Denoised _t To fuse by fusion operation _t Details in (1) are fused to Denoised _t Let Denoised _t The included details are richer, improving the image quality.

For example, as shown in fig. 13, the refinement processing step can also be realized by direct fusion or CNN, where CNN is used to realize the refinement step as an example, and for the sake of convenience, CNN is called refined CNN, and Var [ Fused ] is used _t ]、Fused _t And Denoised _t All as the input of CNN refinement, output Fused _t And Denoised _t Respectively corresponding weights. In determining Fused _t And Denoised _t When the weights are respectively corresponding, the noise level corresponding to the variance of the fusion graph can be referred to, and corresponding weights are distributed to the high-frequency noise and the low-frequency noise.

In general, in the step of the refinement processing, the weight value of the high frequency is smaller, and the weight value of the low frequency is higher. Generally, low frequencies contain the main structure of the image, and high frequencies contain the detail information of the image, so in order to prevent the denoised image from being excessively smooth, the main structure of the image contained in the low frequencies can be referred to more, so that the structure in the image is clearer. Determining Fused _t And Denoised _t After the corresponding weights, Fused can be performed _t And Denoised _t . For example, the refined output may be expressed as:

is Denoised _t The corresponding weight of the weight is set to be,

is Fused _t The corresponding weight.

It will be appreciated that refinement weights can be used to extract high frequency information from a fused image that is rich in detail but still somewhat noisy, and to fit the high frequency detail information into a denoised image that is noise-free but may be too smooth. As shown in fig. 13, this formula can provide high quality results even if CNN has very low complexity.

The form of the refined CNN may be various, and in the embodiment of the present application, multiple layers of convolution and RELU activation function composition may be used. In order to ensure that the weight value of the refined formula is distributed between [0,1], the final prediction weight value is activated by sigmoid or softmax.

Therefore, in the embodiments of the present application, Fused is used _t And Denoised _t To avoid Denoised _t And (4) obtaining a denoised image with richer details by performing excessive smoothing. Furthermore, the higher frequencies are assigned lower weights, and the lower frequencies are assigned higher weights, so that the high-frequency information can be smoothly packagedAnd obtaining a denoised image with better denoising effect by the noise smoothing effect.

Fifth, inverse transformation

The inverse transform is an inverse transform of the decorrelation transform, and if the decorrelation transform is performed, it is necessary to perform an inverse transform on the denoised image after the denoised image is obtained, and if the decorrelation transform is performed, it is necessary to perform an inverse wavelet transform after the denoised image is obtained.

It should be understood that if no decorrelating transform is performed, no inverse transform need be performed.

In the video processing method provided by the present application, if the denoising step is not performed, the obtained Fused image Fused can be directly processed _t Performing inverse transformation, which is exemplified by performing inverse transformation on the denoised image, may be replaced by Fused in some scenarios _t And performing inverse transformation, wherein the inverse transformation and the de-transformation are similar, and the difference is only that the input images are different, which is not described in detail herein.

For example, as shown in fig. 14, before denoising, a color decorrelation operation is performed on the input image through a color transformation to obtain a luminance component and a chrominance component, and after denoising is subsequently performed, a point-by-point convolution operation is performed on the denoised image by using an inverse color transformation matrix to output a color image, that is, a new denoised image. Equivalently, in the decorrelation step, the color of each pixel point is separated, and the color of the denoised image is restored through inverse transformation, so that the denoised image with the color is obtained.

The color transformation matrix and the inverse color transformation matrix are inverse matrixes to each other, namely the product of the color transformation matrix and the inverse color transformation matrix is a unit matrix, so that the color of the denoised image can be accurately recovered. For example, the color matrix may be represented as C _forward Since the color transformation is linear and defined by a matrix, the inverse transformation matrix C of the color transformation can be derived by inversion of the matrix _inverse Therefore, at the end of the algorithm flow, the color information can be reconstructed by point-by-point convolution using the inverse transform matrix. Because the color transformation in the positive and negative directions is linearAnd a convolution operator, so that the transformed matrix weight can be learned in a data-driven manner to adapt to a specific denoising task and input data. To ensure that the reversibility of the matrix is maintained during the training process, C may be added _forward ·C _inverse I is used as a constraint term and an identity matrix to ensure that the forward and inverse operators can be accurately reconstructed.

For another example, as shown in fig. 15, before denoising, wavelet transform is performed on an input image using wavelet coefficients, and after denoising, deconvolution operation is performed on a denoised high-frequency component and low-frequency component using an inverse wavelet kernel to obtain a low-frequency and high-frequency fused output image, i.e., a new denoised image, so that the main structure and details of the denoised image are restored. The inverse wavelet kernel is obtained by performing outer product on the inverse wavelet coefficient, the inverse wavelet coefficient and the wavelet coefficient are inverse matrixes, namely the product of the inverse wavelet coefficient and the wavelet coefficient is a unit matrix, so that the accurate recovery of the main structure and the details of the image is ensured.

For example, decorrelating spatial transforms are also linear operations and can therefore be implemented using convolution operators. Wavelet coefficients may be represented as W _forward The inverse wavelet coefficients may be represented as W _inverse ，W _forward And W _inverse Inverse matrices of each other, i.e. W _forward ·W _inverse I, I is an identity matrix. Similar to the color transform, the weights of the inverse of the de-spatial frequency dependent transform may also be learned in a data-driven manner, in the present embodiment, in updating W _inverse When W is greater than W _forward ·W _inverse And I is used as constraint, so that the high-frequency component and the low-frequency component can be accurately recovered by inverse wavelet change, and a clearer de-noised image is obtained.

If both the color-related transform and the wavelet transform are performed before denoising, inverse color transform and inverse wavelet transform are required when inverse transform is performed, so that the denoised image can recover color and frequency.

Therefore, in the embodiment of the application, after the decorrelation transformation, the time domain fusion, the denoising and the refinement steps, a denoised image with an excellent denoising effect can be obtained. It can be understood that, the correlation between adjacent frames in the video data in the time dimension is utilized, so that the noise in the image is reduced, and meanwhile, the ghost can be avoided, and the image with better denoising effect is obtained.

In addition, in order to further improve the image quality, a multi-scale processing may be performed on the current frame and the first fused image, that is, after the decorrelation operation is performed on the current frame and the first fused image, the image after the decorrelation operation is subjected to one or more times of down-sampling images of multiple scales, then the processing of the

step

702 and 705 is performed on each scale of image, such as the time domain fusion, the denoising and the up-sampling operation, and the like, the iterative denoising processing is performed on the image of each scale to obtain a denoised image with a better denoising effect, and the current iterative processing is performed by combining the fused image obtained in the previous scale (i.e., a smaller scale) or the denoised image in each processing, so that the fused image with a smaller scale can be fused, which is equivalent to the effect of smoothing noise.

Specifically, the process of the multi-scale processing may include: at least one down-sampling is carried out on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image, the scales of the images obtained by down-sampling at each time are different, at least one first down-sampled image is obtained by carrying out at least one down-sampling on the current frame, and at least one second down-sampled image is obtained by carrying out at least one down-sampling on the first fused image; denoising the at least one down-sampling frame and the at least one down-sampling fusion image to obtain a multi-scale fusion image; and fusing the denoised image and the multi-scale fusion image to obtain an updated denoised image.

Illustratively, taking the process of denoising an image of one scale (for example, a first downsampled fused image and a first downsampled frame of the same scale are obtained by downsampling a current frame and the first fused image once), determining a weight corresponding to the first downsampled frame according to the features extracted from the first downsampled frame and the features extracted from the first downsampled fused image to obtain a fifth fusion weight, and determining a weight corresponding to the first downsampled fused image to obtain a sixth fusion weight, where the first downsampled fused image and the first downsampled frame have the same size, it may also be understood that the first downsampled fused image and the first downsampled frame have the same number of downsampled times; moreover, the manner of calculating the fifth fusion weight and the sixth fusion weight is similar to the manner of calculating the first fusion weight and the second fusion weight, and the difference is only that the scales of the images are different, and the details are not repeated here. Then, determining the weight corresponding to the second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of an image of which the mesoscale is smaller than that of the first downsampling frame in at least one frame; fusing the first downsampled frame, the first downsampled fused image and the second fused image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampled fused image, wherein the upsampled image of the third downsampled image is used for being fused with at least one image of which the down-sampling frame has the middle scale larger than that of the first downsampled frame; denoising the third downsampling fusion image to obtain a first downsampling denoising image; and performing upsampling on the first downsampling denoised image to obtain an upsampling denoised image, wherein the upsampling image is used for denoising by combining the fused images with the same scale as the upsampling image to obtain an image with the scale larger than that of the first downsampling frame.

Specifically, for example, as shown in FIG. 16, the downsampling may be performed by selecting a wavelet transform to decorrelate the Noisy _t And Fused _t-1 Performing down-sampling one or more times to obtain down-sampled images of multiple scales, performing wavelet transform in a decorrelation operation to down-sample an image of H × W × C size to an image of H/2 × W/2 × 4C size, that is, a down-sampled frame or a down-sampled fusion image obtained by the first down-sampling, as shown in fig. 16; in the next downsampling, an image having a size of H/2 × W/2 × 4C is downsampled to an image having a size of H/4 × W/4 × 16C, that is, a downsampled frame or a downsampled fusion image obtained by downsampling twice, and the like.

In one of themIn the processing of the seed scale, as shown in fig. 16, the image of the seed scale may be subjected to time domain fusion, that is, fusion is performed on the decorrelated noise _t And Fused _t-1 And (3) obtaining an image after down-sampling, wherein the step of time domain fusion is similar to the step of time domain fusion in the second step, and is not repeated here.

Meanwhile, when the image of each scale is time-domain fused, the upsampled image of the image which is time-domain fused in the process of the next scale can be fused, as shown in fig. 16, in the process of processing the scale 2, the noise after one downsampling can be fused _t And Fused _t-1 Performing upsampling to obtain an upsampled image, and in the processing process of the scale 1, when time domain fusion is performed, removing Noisy after fusion decorrelation transformation _t And Fused _t-1 It is also possible to fuse the up-sampled images obtained during the processing of scale 2. For example, the upsampled image and the decorrelated transformed noise may be fused _t To obtain new Noisy _t Or, the up-sampled image is used as the input of the fused CNN, the weight corresponding to the up-sampled image is output, and then the up-sampled image and Noisy are fused according to the weight _t And Fused _t-1 And the like.

In the time domain fusion step of the scale 2, except for fusing Noisy after down-sampling once _t And Fused _t-1 The upsampled images obtained in scale 3 can also be fused, and so on.

In the processing process of the scale 2, denoising can also be performed. And after time domain fusion is carried out to obtain a scale 2 fusion image, denoising the scale 2 fusion image, wherein the denoising process is similar to that in the step three, and the difference is that the scales of the input images are different, and the details are not repeated here. After denoising, performing inverse wavelet transform on the denoised image, namely equivalent to up-sampling, and then taking the image obtained by inverse wavelet transform as the input of the denoising step of the scale 1 to obtain the denoised image.

Similarly, in the denoising process at the scale 2, the image obtained by performing inverse wavelet transform after denoising at the scale 3 can also be denoised.Taking the denoising process of scale 1 as an example, the input may include the Fused image Fused output in the time domain fusion step in scale 1 _t And the up-sampling de-noised image output after inverse wavelet transformation in the scale 2 can be used for up-sampling de-noised image and Fused _t And as Noisy _t All as the input of the denoised CNN, and the denoised image is output. Equivalent to the integration of the up-sampling de-noised image and Fused _t And as Noisy _t Therefore, the noise in the image is smoothed, and the denoised image with less noise is obtained.

Therefore, in the embodiment of the application, the current frame can be downsampled for multiple times to obtain images with multiple scales, then the images with multiple scales are subjected to iterative processing, each iterative processing process comprises a time domain fusion step and a denoising step, namely, the processing of the images with each scale utilizes the correlation between adjacent frames in video data to realize denoising, the images with better denoising effect can be obtained in each scale, the denoising effect is improved, and the denoised images with less noise, clearer noise and less ghost are obtained. Moreover, denoising is finished under lower resolution, and the denoising task is separated into a plurality of specific stages, so that the task can be greatly simplified, and high-quality output is generated by the minimum operation quantity. Through special processing stages such as time domain fusion, spatial denoising and refinement, video correlation based on time, space and space-time dimensions of the video is obviously utilized. This design makes it possible to hardly impair the picture quality of the final output even with greatly reduced complexity.

The foregoing describes in detail the flow of the video denoising method provided in the present application, and the following describes the structure of the video denoising apparatus for executing the flow based on the flow of the video denoising method.

Referring to fig. 17, a schematic structural diagram of a video denoising apparatus provided in the present application is shown as follows.

An obtaining module 1701, configured to obtain a current frame and a first fused image, where the current frame is any one frame image after a first frame arranged in a preset order in video data, and the first fused image includes information of at least one frame adjacent to the current frame in the preset order in the video data;

a time domain fusion module 1702, configured to extract a feature from the current frame to obtain a first feature, and extract a feature from the first fusion image to obtain a second feature;

a time domain fusion module 1702, further configured to determine to obtain a first fusion weight and a second fusion weight according to the first feature and the second feature, where the first fusion weight includes a weight corresponding to the current frame, the second fusion weight includes a weight corresponding to the first fusion image, the weight corresponding to the foreground in the current frame is not less than the weight corresponding to the foreground in the first fusion image, and the weight corresponding to the background in the current frame is not greater than the weight corresponding to the background in the first fusion image;

the time domain fusion module 1702 is further configured to fuse the current frame and the first fusion image according to the first fusion weight and the second fusion weight to obtain a second fusion image;

and a denoising module 1703, configured to denoise the second fusion image to obtain a denoised image.

and a refinement module 1704, configured to fuse the two fused images and the denoised image after denoising the second fused image, so as to obtain an updated denoised image.

In one possible implementation, the denoising module 1703 is specifically configured to: extracting features from the second fused image to obtain third features; extracting features from the de-noised image to obtain a fourth feature; determining a third fusion weight and a fourth fusion weight according to the third feature and the fourth feature, wherein the third fusion weight is a weight corresponding to the third feature, and the fourth fusion weight is a weight corresponding to the fourth feature, and in the third feature and the fourth feature, the frequency of each pixel point and the corresponding weight value are in a negative correlation relationship; and fusing the third characteristic and the fourth characteristic according to the third fusion weight and the fourth fusion weight to obtain an updated denoised image.

In a possible implementation, the video denoising apparatus may further include: a color-related transform removal module 1705, configured to perform color transform on the current frame through a color transform matrix before extracting features from the current frame and extracting features from the first fused image, to obtain a first chrominance component and a first luminance component, where the first chrominance component and the first luminance component form a new current frame, and perform color transform on the first fused image to obtain a second chrominance component and a second luminance component, where the second chrominance component and the second luminance component form a new first fused image, and the color transform matrix is a preset matrix or is obtained by training at least one convolution kernel;

the time domain fusion module 1702 is specifically configured to extract a feature from the new current frame to obtain a first feature, and extract a feature from the new first fusion image to obtain a second feature.

and an inverse color-related transformation module 1706, configured to perform color transformation on the denoised image through an inverse color transformation matrix after the second fused image is denoised, so as to obtain an updated denoised image, where the inverse color transformation matrix is an inverse matrix of the color transformation matrix.

a frequency-dependent transform module 1707, configured to perform wavelet transform on the current frame using a wavelet coefficient before the time-domain fusion module 1702 extracts features from the current frame and the first fused image, so as to obtain a first low-frequency component and a first high-frequency component, where the first low-frequency component and the first high-frequency component form a new current frame, and perform wavelet transform on the first fused image, so as to obtain a second low-frequency component and a second high-frequency component, where the second low-frequency component and the second high-frequency component form a new first fused image, and the wavelet coefficient is a preset coefficient or is obtained by training at least one convolution kernel.

and an inverse frequency-dependent transform module 1708, configured to perform inverse wavelet transform on the denoised image through an inverse wavelet coefficient after the denoising module 1703 denoises the second fused image, so as to obtain an updated denoised image, where the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient.

In a possible implementation, the time domain fusion module 1702 is specifically configured to: calculating shot noise and read noise according to shooting parameters of equipment used for shooting video data; and determining a first fusion weight corresponding to the current frame by combining shot noise and read noise and the first characteristic and the second characteristic.

In a possible implementation, the time domain fusion module 1702 is specifically configured to: calculating the noise variance of each pixel point in the current frame according to the shot noise and the read noise; extracting a fifth feature from the noise variance; and combining the fifth characteristic, and determining the first fusion weight corresponding to the current frame by the first characteristic and the second characteristic.

In one possible implementation, the denoising module 1703 is specifically configured to: and denoising the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain a denoised image.

In one possible implementation, the denoising module 1703 is specifically configured to: calculating the variance of each pixel point in the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain a fusion image variance; and taking the fused image variance and the second fused image as the input of a denoising model, and outputting a denoised image, wherein the denoising model is used for removing noise in the input image.

In a possible implementation, the denoising module 1703 is specifically configured to take the current frame, the variance of the fusion image, and the second fusion image as inputs of the denoising model, and output the denoised image.

In a possible implementation, the video denoising apparatus may further include: a downsampling module 1709;

a down-sampling module 1709, configured to perform at least one down-sampling on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image, where scales of images obtained through each down-sampling are different, at least one first down-sampled image is obtained by performing at least one down-sampling on the current frame, and at least one second down-sampled image is obtained by performing at least one down-sampling on the first fused image;

the denoising module 1703 is further configured to denoise the at least one down-sampling frame and the at least one down-sampling fusion image to obtain a multi-scale fusion image;

the denoising module 1703 is further configured to fuse the denoised image and the multi-scale fusion image to obtain an updated denoised image.

In a possible implementation, the denoising module 1703 may perform any one of denoising processes on the at least one down-sampled frame and the at least one down-sampled fused frame, and may include: determining the weight corresponding to the first downsampled frame according to the features extracted from the first downsampled frame and the features extracted from the first downsampled fusion image to obtain a fifth fusion weight, and determining the weight corresponding to the first downsampled fusion image to obtain a sixth fusion weight, wherein the first downsampled frame is any one of at least one downsampled frame, and the first downsampled fusion image is one of the at least one downsampled fusion image with the same scale as the first downsampled frame; determining the weight corresponding to the second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of at least one frame of image with the lower scale than the first downsampling frame; fusing the first downsampling frame, the first downsampling fusion image and the second fusion image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampling fusion image, wherein the image subjected to upsampling of the third downsampling fusion image is used for being fused with the image of which the mesoscale of at least one downsampling frame is larger than that of the first downsampling frame; denoising the third downsampling fusion image to obtain a first downsampling denoising image; and performing upsampling on the first downsampling denoised image to obtain an upsampling denoised image, wherein the upsampling image is used for denoising by combining the fused images with the same scale as the upsampling image to obtain an image with the scale larger than that of the first downsampling frame.

In a possible implementation, the down-sampling module 1709 is specifically configured to perform at least one wavelet transform on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image.

The present application further provides a video processing apparatus, configured to execute the method steps corresponding to fig. 7A. Referring to fig. 18, a schematic structural diagram of a video processing apparatus according to the present application is further provided as follows.

An obtaining module 1801, configured to obtain a current frame and a first fused image, where the current frame is any one frame image after a first frame arranged in the video data according to a preset order, and the first fused image includes information of at least one frame adjacent to the current frame in the video data according to the preset order;

a time domain fusion module 1802, configured to extract a feature from the current frame to obtain a first feature, and extract a feature from the first fusion image to obtain a second feature;

the time domain fusion module 1802 is further configured to determine to obtain a first fusion weight and a second fusion weight according to the first feature and the second feature, where the first fusion weight includes a weight corresponding to a current frame, the second fusion weight includes a weight corresponding to the first fusion image, the weight corresponding to a foreground in the current frame is not less than the weight corresponding to a foreground in the first fusion image, and the weight corresponding to a background in the current frame is not greater than the weight corresponding to a background in the first fusion image;

the time domain fusion module 1802 is further configured to fuse the current frame and the first fusion image according to the first fusion weight and the second fusion weight to obtain a second fusion image.

In a possible implementation, the apparatus may further include:

a color-related transform module 1803, configured to perform color transform on the current frame through a color transform matrix before the time-domain fusion module 1802 extracts features from the current frame and the first fused image, to obtain a first chrominance component and a first luminance component, where the first chrominance component and the first luminance component form a new current frame, and perform color transform on the first fused image to obtain a second chrominance component and a second luminance component, where the second chrominance component and the second luminance component form a new first fused image, and the color transform matrix is a preset matrix or is obtained by training at least one convolution kernel;

the time domain fusion module 1802 is specifically configured to extract a feature from the new current frame to obtain a first feature, and extract a feature from the new first fusion image to obtain a second feature.

In a possible implementation, the apparatus may further include: an inverse color correlation transformation module 1804, configured to perform color transformation on the second fused image through an inverse color transformation matrix to obtain an updated second fused image, where the inverse color transformation matrix is an inverse matrix of the color transformation matrix.

In a possible implementation, the apparatus may further include: a frequency-dependent transform module 1805, configured to perform wavelet transform on the current frame using a wavelet coefficient before the time-domain fusion module 1802 extracts features from the current frame and the first fused image, so as to obtain a first low-frequency component and a first high-frequency component, where the first low-frequency component and the first high-frequency component form a new current frame, and perform wavelet transform on the first fused image so as to obtain a second low-frequency component and a second high-frequency component, where the second low-frequency component and the second high-frequency component form a new first fused image, and the wavelet coefficient is a preset coefficient or is obtained by training at least one convolution kernel.

In a possible implementation, the apparatus may further include: an inverse frequency correlation transformation module 1806, configured to perform inverse wavelet transformation on the second fused image through an inverse wavelet coefficient to obtain an updated second fused image, where the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient.

In a possible implementation, the time domain fusion module 1802 is specifically configured to: calculating shot noise and read noise according to shooting parameters of equipment used for shooting video data; and determining a first fusion weight corresponding to the current frame by combining shot noise, reading noise, the first characteristic and the second characteristic.

In a possible implementation, the time domain fusion module 1802 is specifically configured to: calculating the noise variance of each pixel point in the current frame according to the shot noise and the read noise; extracting a fifth feature from the noise variance; in combination with the fifth feature, the first feature and the second feature determine a first fusion weight corresponding to the current frame.

In a possible implementation, the apparatus may further include: a down-sampling module 1807, configured to perform at least one down-sampling on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image, where scales of images obtained through each down-sampling are different, at least one first down-sampled image is obtained by performing at least one down-sampling on the current frame, and at least one second down-sampled image is obtained by performing at least one down-sampling on the first fused image;

the time domain fusion module 1802 is further configured to fuse each frame of the at least one frame of downsampled frames with a downsampled fusion image of the same scale to obtain a multi-scale fusion image;

the time domain fusion module 1802 is further configured to fuse the second fusion image and the multi-scale fusion image to obtain an updated second fusion image.

In one possible implementation, the temporal fusion module 1802 fusing any one of the at least one down-sampled frames with the down-sampled fusion image of the same scale may include: determining the weight corresponding to the first downsampled frame according to the features extracted from the first downsampled frame and the features extracted from the first downsampled fusion image to obtain a fifth fusion weight, and determining the weight corresponding to the first downsampled fusion image to obtain a sixth fusion weight, wherein the first downsampled frame is any one of at least one downsampled frame, and the first downsampled fusion image is one of the at least one downsampled fusion image with the same scale as the first downsampled frame; determining the weight corresponding to the second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of at least one frame of image with the lower scale than the first downsampling frame; and fusing the first downsampled frame, the first downsampled fused image and the second downsampled fused image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampled fused image, wherein the upsampled image of the third downsampled image is used for being fused with the image of which the mesoscale of at least one downsampled frame is larger than that of the first downsampled frame.

In a possible implementation, the down-sampling module 1807 is specifically configured to perform at least one wavelet transform on the current frame and the first fused image, so as to obtain at least one down-sampled frame and at least one down-sampled fused image.

Referring to fig. 19, a schematic structural diagram of another video denoising apparatus provided in the present application is shown as follows.

The video denoising apparatus may include a processor 1901 and a memory 1902. The processor 1901 and the memory 1902 are interconnected by a line. The memory 1902 has stored therein program instructions and data.

The memory 1902 stores program instructions and data corresponding to the steps of fig. 4-16 described above.

The processor 1901 is configured to perform the method steps performed by the video denoising apparatus shown in any one of the foregoing fig. 4-16.

Optionally, the video denoising apparatus may further include a transceiver 1903 for receiving or transmitting data.

Also provided in an embodiment of the present application is a computer-readable storage medium having stored therein a program for generating a running speed of a vehicle, which when running on a computer, causes the computer to execute the steps in the method as described in the foregoing embodiment shown in fig. 4 to 14.

Optionally, the video denoising device shown in fig. 19 is a chip.

Referring to fig. 20, a schematic structural diagram of another video processing apparatus provided in the present application is as follows.

The video processing apparatus may include a processor 2001 and a memory 2002. The processor 2001 and the memory 2002 are interconnected by wires. The memory 2002 has stored therein program instructions and data.

The memory 2002 stores program instructions and data corresponding to the steps of fig. 7A described above.

The processor 2001 is configured to perform the method steps performed by the video processing apparatus shown in any of the embodiments of fig. 7A described above.

Optionally, the video processing apparatus may further comprise a transceiver 2003 for receiving or transmitting data.

The embodiment of the present application also provides a computer-readable storage medium, in which a program for generating a vehicle running speed is stored, and when the program runs on a computer, the computer is caused to execute the steps in the method described in the embodiment shown in fig. 7A.

Alternatively, the aforementioned video processing apparatus shown in fig. 20 is a chip.

The embodiment of the present application further provides a video denoising device, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains a program instruction through the communication interface, the program instruction is executed by the processing unit, and the processing unit is configured to execute the method steps executed by the video denoising device shown in any one of the foregoing fig. 4 to 14.

The embodiment of the present application further provides a video processing apparatus, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute the method steps executed by the video processing apparatus shown in any one of the foregoing embodiments in fig. 7A.

The embodiment of the application also provides a digital processing chip. Integrated with circuitry and one or more interfaces to carry out the functions of the processor 1901/2001, or the processor 1901/2001, as described above. When integrated with memory, the digital processing chip may perform the method steps of any one or more of the preceding embodiments. When the digital processing chip is not integrated with the memory, the digital processing chip can be connected with the external memory through the communication interface. The digital processing chip implements the actions performed by the video denoising device or the video processing device in the above embodiments according to the program codes stored in the external memory.

The embodiment of the present application also provides a computer program product, which when running on a computer, causes the computer to execute the steps executed by the video denoising device or the video processing device in the method described in the foregoing embodiments shown in fig. 4-16.

The video denoising device or the video processing device provided by the embodiment of the application can be a chip, and the chip can include: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit can execute the computer-executable instructions stored in the storage unit to make the chip in the server execute the video denoising method or the video processing method described in the embodiments shown in fig. 4 to fig. 16. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, the aforementioned processing unit or processor may be a Central Processing Unit (CPU), a Network Processor (NPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic devices (programmable gate array), discrete gate or transistor logic devices (discrete hardware components), or the like. A general purpose processor may be a microprocessor or any conventional processor or the like.

Referring to fig. 21, fig. 21 is a schematic structural diagram of a chip according to an embodiment of the present disclosure, where the chip may be represented as a neural network processor NPU 210, and the NPU 210 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2103, and the controller 2104 controls the arithmetic circuit 2103 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 2103 internally includes a plurality of processing units (PEs). In some implementations, the operational circuit 2103 is a two-dimensional systolic array. The operational circuit 2103 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 2103 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2102 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2101 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator 2108.

The unified memory 2106 stores input data and output data. The weight data directly passes through a memory access controller (DMAC) 2105, and the DMAC is transferred to the weight memory 2102. The input data is also carried to the unified memory 2106 via the DMAC.

A Bus Interface Unit (BIU) 2110, configured to interact between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 2109.

The bus interface unit 2110 (BIU) is configured to obtain an instruction from the instruction fetch memory 2109, and further configured to obtain original data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 2105.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2106, to transfer weight data to the weight memory 2102, or to transfer input data to the input memory 2101.

The vector calculation unit 2107 includes a plurality of operation processing units, and further processes the output of the operation circuit such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as batch normalization (batch normalization), pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 2107 can store the processed output vector to the unified memory 2106. For example, the vector calculation unit 2107 may apply a linear function and/or a nonlinear function to the output of the operation circuit 2103, such as linear interpolation of the feature planes extracted by the convolution layer, and further such as a vector of accumulated values to generate the activation value. In some implementations, the vector calculation unit 2107 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the operational circuitry 2103, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 2109 connected to the controller 2104 for storing instructions used by the controller 2104;

the unified memory 2106, the input memory 2101, the weight memory 2102 and the instruction fetch memory 2109 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The operation of each layer in the recurrent neural network can be performed by the operation circuit 2103 or the vector calculation unit 2107.

Where any of the aforementioned processors may be a general purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits configured to control the execution of the programs of the methods of fig. 4-16, as described above.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, which may be specifically implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for denoising a video, comprising:

acquiring a current frame and a first fusion image, wherein the current frame is any frame image behind a first frame arranged in a preset sequence in video data, and the first fusion image comprises information of at least one frame adjacent to the current frame in the video data according to the preset sequence;

extracting features from the current frame to obtain first features, and extracting features from the first fusion image to obtain second features;

determining to obtain a first fusion weight and a second fusion weight according to the first feature and the second feature, wherein the first fusion weight comprises a weight corresponding to the current frame, the second fusion weight comprises a weight corresponding to the first fusion image, the weight corresponding to the foreground in the current frame is not less than the weight corresponding to the foreground in the first fusion image, and the weight corresponding to the background in the current frame is not more than the weight corresponding to the background in the first fusion image;

fusing the current frame and the first fused image according to the first fusion weight and the second fusion weight to obtain a second fused image;

and denoising the second fusion image to obtain a denoised image.

2. The method of claim 1, wherein after said denoising the second fused image, the method further comprises:

and fusing the two fused images and the de-noised image to obtain an updated de-noised image.

3. The method of claim 2, wherein said fusing the two fused images and the denoised image comprises:

extracting features from the second fusion image to obtain third features;

extracting features from the de-noised image to obtain fourth features;

determining a third fusion weight and a fourth fusion weight according to the third feature and the fourth feature, wherein the third fusion weight is a weight corresponding to the third feature, and the fourth fusion weight is a weight corresponding to the fourth feature, and in the third feature and the fourth feature, the frequency of each pixel point and the corresponding weight value are in a negative correlation relationship;

and fusing the third feature and the fourth feature according to the third fusion weight and the fourth fusion weight to obtain an updated denoised image.

4. The method according to any of claims 1-3, wherein prior to said extracting features from the current frame and extracting features from the first fused image, the method further comprises:

performing color transformation on the current frame through a color transformation matrix to obtain a first chrominance component and a first luminance component, wherein the first chrominance component and the first luminance component form a new current frame, performing color transformation on the first fused image to obtain a second chrominance component and a second luminance component, wherein the second chrominance component and the second luminance component form a new first fused image, and the color transformation matrix is a preset matrix or is obtained by training at least one convolution kernel;

the extracting features from the current frame to obtain first features, and extracting features from the first fused image to obtain second features, includes:

and extracting features from the new current frame to obtain the first features, and extracting features from the new first fusion image to obtain the second features.

5. The method of claim 4, wherein after said denoising the second fused image, the method further comprises:

and carrying out color transformation on the denoised image through an inverse color transformation matrix to obtain an updated denoised image, wherein the inverse color transformation matrix is the inverse matrix of the color transformation matrix.

6. The method according to any of claims 1-5, wherein prior to said extracting features from the current frame and extracting features from the first fused image, the method further comprises:

performing wavelet transformation on the current frame by using a wavelet coefficient to obtain a first low-frequency component and a first high-frequency component, wherein the first low-frequency component and the first high-frequency component form a new current frame, performing wavelet transformation on the first fused image to obtain a second low-frequency component and a second high-frequency component, the second low-frequency component and the second high-frequency component form a new first fused image, and the wavelet coefficient is a preset coefficient or is obtained by training at least one convolution kernel.

7. The method of claim 6, wherein after said denoising said second fused image, said method further comprises:

and carrying out inverse wavelet transformation on the de-noised image through an inverse wavelet coefficient to obtain an updated de-noised image, wherein the inverse wavelet coefficient is an inverse matrix of the wavelet coefficient.

8. The method according to any of claims 1-7, wherein said determining a first blending weight corresponding to the current frame according to the first feature and the second feature comprises:

calculating shot noise and read noise according to shooting parameters of equipment used for shooting the video data;

and determining a first fusion weight corresponding to the current frame by combining the shot noise, the reading noise, the first characteristic and the second characteristic.

9. The method of claim 8, wherein said combining the shot noise and the read noise, wherein the determining the first fusion weight corresponding to the current frame by the first feature and the second feature comprises:

calculating the noise variance of each pixel point in the current frame according to the shot noise and the read noise;

extracting a fifth feature from the noise variance;

and determining a first fusion weight corresponding to the current frame by combining the first feature and the second feature.

10. The method according to any one of claims 1-9, wherein said denoising the second fused image comprises:

and denoising the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain the denoised image.

11. The method of claim 10, wherein denoising the second fused image in combination with the first fused weight, the second fused weight, the first fused image, and the current frame comprises:

calculating the variance of each pixel point in the second fusion image by combining the first fusion weight, the second fusion weight, the first fusion image and the current frame to obtain a fusion image variance;

and taking the fused image variance and the second fused image as the input of a denoising model, and outputting the denoising image, wherein the denoising model is used for removing noise in the input image.

12. The method according to claim 11, wherein said inputting said fused map variance and said second fused image as a denoising model comprises:

and taking the current frame, the fusion image variance and the second fusion image as the input of a denoising model, and outputting the denoising image.

13. The method according to any one of claims 1-12, further comprising:

at least one down-sampling is carried out on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image, the scales of the images obtained by each down-sampling are different, the at least one first down-sampled image is obtained by carrying out at least one down-sampling on the current frame, and the at least one second down-sampled image is obtained by carrying out at least one down-sampling on the first fused image;

denoising the at least one down-sampling frame and the at least one down-sampling fusion image to obtain a multi-scale fusion image;

and fusing the de-noised image and the multi-scale fused image to obtain an updated de-noised image.

14. The method according to claim 13, wherein any one of the processes of denoising the at least one down-sampled frame and the at least one down-sampled fused frame comprises:

determining a weight corresponding to a first downsampled frame according to the extracted features from the first downsampled frame and the extracted features from the first downsampled fusion image to obtain a fifth fusion weight, and determining a weight corresponding to the first downsampled fusion image to obtain a sixth fusion weight, wherein the first downsampled frame is any one of the downsampled frames of the at least one frame, and the first downsampled fusion image is one of the downsampled fusion image of the at least one frame with the same scale as the first downsampled frame;

determining the weight corresponding to a second downsampling fusion image according to the features extracted from the second downsampling fusion image to obtain a seventh fusion weight, wherein the second downsampling fusion image is fused with information of an image of which the scale is smaller than that of the first downsampling frame in the downsampling frame;

fusing the first downsampled frame, the first downsampled fused image and the second fused image according to the fifth fusion weight, the sixth fusion weight and the seventh fusion weight to obtain a third downsampled fused image, wherein an upsampled image of the third downsampled image is used for being fused with an image of which the mesoscale of the downsampled frame is larger than that of the first downsampled frame;

denoising the third downsampled fusion image to obtain a first downsampled denoised image;

and performing upsampling on the first downsampling denoised image to obtain an upsampling denoised image, wherein the upsampling image is used for denoising by combining with a fusion image with the same scale as the upsampling image to obtain an image with the scale larger than that of the first downsampling frame.

15. The method according to claim 13 or 14, wherein said down-sampling said current frame and said first fused image at least once comprises:

and performing at least one wavelet transform on the current frame and the first fused image to obtain the at least one down-sampling frame and the at least one down-sampling fused image.

16. A video denoising apparatus, comprising:

an obtaining module, configured to obtain a current frame and a first fused image, where the current frame is any one frame image after a first frame arranged in a preset order in video data, and the first fused image includes information of at least one frame adjacent to the current frame in the video data according to the preset order;

the time domain fusion module is further configured to determine to obtain a first fusion weight and a second fusion weight according to the first feature and the second feature, where the first fusion weight includes a weight corresponding to the current frame, the second fusion weight includes a weight corresponding to the first fusion image, the weight corresponding to a foreground in the current frame is not less than the weight corresponding to the foreground in the first fusion image, and the weight corresponding to a background in the current frame is not greater than the weight corresponding to the background in the first fusion image;

the time domain fusion module is further configured to fuse the current frame and the first fusion image according to the first fusion weight and the second fusion weight to obtain a second fusion image;

17. The apparatus of claim 16, further comprising:

18. The apparatus of claim 17, wherein the denoising module is specifically configured to:

extracting features from the second fusion image to obtain third features;

extracting features from the de-noised image to obtain fourth features;

19. The apparatus according to any one of claims 16-18,

the device further comprises: a color-related transform module, configured to perform color transform on the current frame through a color transform matrix before the feature is extracted from the current frame and the feature is extracted from the first fused image, so as to obtain a first chrominance component and a first luminance component, where the first chrominance component and the first luminance component form a new current frame, and perform color transform on the first fused image, so as to obtain a second chrominance component and a second luminance component, where the second chrominance component and the second luminance component form a new first fused image, and the color transform matrix is a preset matrix or is obtained by training at least one convolution kernel;

the time domain fusion module is specifically configured to extract features from the new current frame to obtain the first features, and extract features from the new first fusion image to obtain the second features.

20. The apparatus of claim 19, further comprising:

and the inverse color correlation transformation module is used for carrying out color transformation on the denoised image through an inverse color transformation matrix after the second fused image is denoised to obtain an updated denoised image, wherein the inverse color transformation matrix is an inverse matrix of the color transformation matrix.

21. The apparatus according to any one of claims 16-20, further comprising:

a frequency-decorrelation transform module, configured to, before the time-domain fusion module extracts features from the current frame and features from the first fusion image, perform wavelet transform on the current frame using a wavelet coefficient to obtain a first low-frequency component and a first high-frequency component, where the first low-frequency component and the first high-frequency component form a new current frame, and perform wavelet transform on the first fusion image to obtain a second low-frequency component and a second high-frequency component, where the second low-frequency component and the second high-frequency component form a new first fusion image, and the wavelet coefficient is a preset coefficient or is obtained by training at least one convolution kernel.

22. The apparatus of claim 21, further comprising:

23. The apparatus according to any one of claims 16 to 22, wherein the time domain fusion module is specifically configured to:

and determining a first fusion weight corresponding to the current frame by combining the shot noise and the read noise, the first characteristic and the second characteristic.

24. The apparatus according to claim 23, wherein the time domain fusion module is specifically configured to:

extracting a fifth feature from the noise variance;

and combining the fifth feature, the first feature and the second feature to determine a first fusion weight corresponding to the current frame.

25. The apparatus according to any of claims 16-24, wherein the denoising module is specifically configured to:

26. The apparatus of claim 25, wherein the denoising module is specifically configured to:

27. The apparatus of claim 26,

the denoising module is specifically configured to take the current frame, the variance of the fusion image, and the second fusion image as inputs of a denoising model, and output the denoised image.

28. The apparatus according to any one of claims 16-27, further comprising: a down-sampling module;

the down-sampling module is used for performing at least one down-sampling on the current frame and the first fused image to obtain at least one down-sampled frame and at least one down-sampled fused image, the scales of the images obtained by the down-sampling at each time are different, the at least one first down-sampled image is obtained by performing at least one down-sampling on the current frame, and the at least one second down-sampled image is obtained by performing at least one down-sampling on the first fused image;

the denoising module is further configured to perform denoising processing on the at least one down-sampling frame and the at least one down-sampling fusion image to obtain a multi-scale fusion image;

the denoising module is further used for fusing the denoised image and the multi-scale fusion image to obtain an updated denoised image.

29. The apparatus according to claim 28, wherein the denoising module performs any one of denoising processes on the at least one down-sampled frame and the at least one down-sampled fused frame, and comprises:

denoising the third downsampling fusion image to obtain a first downsampling denoising image;

30. The apparatus of claim 28 or 29,

the down-sampling module is specifically configured to perform at least one wavelet transform on the current frame and the first fused image to obtain the at least one down-sampling frame and the at least one down-sampling fused image.

31. A video processing method, comprising:

acquiring a current frame and a first fusion image, wherein the current frame is any one frame image behind a first frame arranged in a preset sequence in video data, and the first fusion image comprises information of at least one frame adjacent to the current frame in the video data according to the preset sequence;

and fusing the current frame and the first fused image according to the first fusion weight and the second fusion weight to obtain a second fused image.

32. A video processing apparatus, comprising:

and the time domain fusion module is further configured to fuse the current frame and the first fusion image according to the first fusion weight and the second fusion weight to obtain a second fusion image.

33. A video denoising apparatus comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of any one of claims 1 through 15.

34. A video processing apparatus comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of claim 31.

35. A computer readable storage medium comprising a program which, when executed by a processing unit, performs the method of any of claims 1 to 15 or 31.

36. A video denoising apparatus, comprising a processing unit and a communication interface, wherein the processing unit obtains program instructions through the communication interface, and when the program instructions are executed by the processing unit, the method according to any one of claims 1 to 15 is implemented.

37. A video processing apparatus comprising a processing unit and a communication interface, the processing unit obtaining program instructions through the communication interface, the program instructions when executed by the processing unit implementing the method of claim 31.