WO2022067653A1 - 图像处理方法及装置、设备、视频处理方法及存储介质 - Google Patents

图像处理方法及装置、设备、视频处理方法及存储介质 Download PDF

Info

Publication number
WO2022067653A1
WO2022067653A1 PCT/CN2020/119363 CN2020119363W WO2022067653A1 WO 2022067653 A1 WO2022067653 A1 WO 2022067653A1 CN 2020119363 W CN2020119363 W CN 2020119363W WO 2022067653 A1 WO2022067653 A1 WO 2022067653A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
image
output
input
video
Prior art date
Application number
PCT/CN2020/119363
Other languages
English (en)
French (fr)
Inventor
高艳
陈冠男
张丽杰
陈文彬
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to US17/430,840 priority Critical patent/US20220164934A1/en
Priority to CN202080002197.6A priority patent/CN114586056A/zh
Priority to PCT/CN2020/119363 priority patent/WO2022067653A1/zh
Publication of WO2022067653A1 publication Critical patent/WO2022067653A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06V10/7747Organisation of the process, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/015High-definition television systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Definitions

  • Embodiments of the present disclosure relate to an image processing method and apparatus, device, video processing method, and storage medium.
  • Deep learning techniques based on artificial neural networks have made great progress in fields such as object classification, text processing, recommendation engines, image search, facial recognition, age and speech recognition, human-computer dialogue, and affective computing.
  • field classification object classification
  • recommendation engines image search
  • facial recognition age and speech recognition
  • human-computer dialogue human-computer dialogue
  • affective computing With the in-depth research of artificial neural network structure and the improvement of related algorithms, deep learning technology has made breakthroughs in the field of human-like data perception. Deep learning technology can be used to describe image content and identify objects in complex environments in images. , process the image as needed, etc.
  • At least one embodiment of the present disclosure provides an image processing method applicable to a convolutional neural network, wherein the method includes: receiving an input image; using the convolutional neural network to process the input image to obtain an output image, wherein , the clarity of the output image is higher than that of the input image; using the convolutional neural network to process the input image to obtain the output image includes: extracting features from the input image to obtain multiple a first image; performing stitching processing on the input image and the plurality of first images to obtain a first image group, wherein the first image group includes the input image and the plurality of first images; Perform feature extraction on the first image group to obtain multiple second images; fuse the multiple second images and the multiple first images to obtain multiple third images; Perform stitching processing on the plurality of third images to obtain a second image group, wherein the second image group includes the input image and the plurality of third images; and perform feature extraction on the second image group, to obtain the output image.
  • the number of convolution kernels in the convolutional neural network for performing feature extraction on the input image is N, 12 ⁇ N ⁇ 20 and N is an integer, so The number of convolution kernels used for feature extraction on the first image group in the convolutional neural network is M, 12 ⁇ M ⁇ 20 and M is an integer, and the convolutional neural network is used for The number of convolution kernels for feature extraction in two image groups is 3.
  • the method provided by an embodiment of the present disclosure further includes: training a second neural network to be trained based on the pre-trained first neural network to obtain a trained second neural network, thereby obtaining the convolutional neural network;
  • the parameters of the first neural network are more than the parameters of the second neural network
  • the pre-trained first neural network is configured to input the pre-trained first neural network with the first
  • the original image of the definition is transformed into a new image with a second definition, the second definition is greater than the first definition
  • the trained second neural network is the convolutional neural network
  • the to-be-trained The network structure of the second neural network is the same as that of the convolutional neural network, and the parameters of the second neural network to be trained are different from the parameters of the convolutional neural network.
  • the second neural network to be trained is trained based on the pre-trained first neural network to obtain the trained second neural network, thereby obtaining the
  • the convolutional neural network comprising: based on the pre-trained first neural network, the second neural network to be trained and the discrimination network, alternately train the discrimination network and the second neural network, and obtain the trained the second neural network, thereby obtaining the convolutional neural network.
  • training the identification network includes: inputting first sample data into the first neural network and the second neural network respectively, and obtaining data obtained from the first neural network outputting first data and outputting second data from the second neural network; setting the first data to have ground truth labels, and inputting the first data with ground truth labels to the discriminant network to Obtaining a first identification result, setting the second data to have a false value label, and inputting the second data with a false value label into the identification network to obtain a second identification result; based on the first identification result Calculate a first loss function with the second identification result; adjust the parameters of the identification network according to the first loss function to obtain an updated identification network.
  • training the second neural network includes: inputting second sample data into the first neural network and the second neural network, respectively, to obtain data obtained from the first neural network.
  • the identification network obtains a third identification result output from the identification network; an error function is calculated based on the third data and the fourth data, an identification function is calculated based on the third identification result, and based on the error
  • the function and the discrimination function calculate a second loss function; and the parameters of the second neural network are adjusted according to the second loss function to obtain an updated second neural network.
  • the second loss function is a weighted sum of the error function and the discrimination function.
  • the weight of the error function is 90-110, and the weight of the discrimination function is 0.5-2.
  • the first sample data and the second sample data are image data obtained based on multiple videos with the same bit rate.
  • the second neural network to be trained is trained based on the pre-trained first neural network to obtain the trained second neural network, thereby obtaining the
  • the convolutional neural network includes: inputting third sample data into the first neural network and the second neural network respectively, obtaining fifth data output from the first neural network and obtaining the fifth data output from the second neural network outputting sixth data; calculating a third loss function based on the fifth data and the sixth data; adjusting the parameters of the second neural network according to the third loss function to obtain an updated second neural network.
  • the first neural network includes a multi-level downsampling unit and a corresponding multi-level upsampling unit
  • the output of each level of downsampling unit is used as the input of the next level of downsampling unit
  • the input of the up-sampling unit of each level includes the output of the down-sampling unit corresponding to the up-sampling unit of this level and the output of the up-sampling unit of the previous level of the up-sampling unit of this level.
  • At least one embodiment of the present disclosure further provides a terminal device, including a processor, wherein the processor is configured to: acquire an input video bit rate and an input video, wherein the input video includes a plurality of input image frames; The input video bit rate selects a video processing method corresponding to the input video bit rate to process at least one input image frame in the plurality of input image frames to obtain at least one output image frame, wherein the at least one The definition of the output image frame is higher than the definition of the at least one input image frame; wherein, different input video bit rates correspond to different video processing methods.
  • the video processing method includes: processing the at least one input image frame based on a trained neural network to obtain the at least one output image frame; wherein, based on The trained neural network processes the at least one input image frame to obtain the at least one output image frame, including: performing feature extraction on the at least one input image frame to obtain a plurality of first output images; The at least one input image frame and the plurality of first output images are stitched together to obtain a first output image group, wherein the first output image group includes the at least one input image frame and the plurality of first output images.
  • an output image perform feature extraction on the first output image group to obtain multiple second output images; fuse the multiple second output images and the multiple first output images to obtain multiple third output images outputting an image; performing stitching processing on the at least one input image frame and the plurality of third output images to obtain a second output image group, wherein the second output image group includes the at least one input image frame and all the third output images. the plurality of third output images; and performing feature extraction on the second output image group to obtain the at least one output image frame.
  • the trained neural networks corresponding to different video processing methods are different.
  • trained neural networks corresponding to different video processing methods are obtained by training with different sample data sets, respectively, and different sample data sets are obtained based on different video sets, and each Each video set includes multiple videos, the videos in the same video set have the same bit rate, and the videos in different video sets have different bit rates.
  • At least one embodiment of the present disclosure further provides a video processing method, including: acquiring an input video bit rate and an input video, wherein the input video includes multiple input image frames;
  • the video processing sub-method corresponding to the video bit rate processes at least one input image frame in the plurality of input image frames to obtain at least one output image frame, wherein the definition of the at least one output image frame is higher than that of the Definition of at least one input image frame; wherein, different input video bit rates correspond to different video processing sub-methods.
  • At least one embodiment of the present disclosure also provides an image processing apparatus, including: a processor; a memory, including one or more computer program modules; wherein the one or more computer program modules are stored in the memory and are Configured to be executed by the processor, the one or more computer program modules include a method for implementing the image processing method described in any of the embodiments of the present disclosure.
  • At least one embodiment of the present disclosure further provides a storage medium for storing non-transitory computer-readable instructions, which, when executed by a computer, can implement the image described in any embodiment of the present disclosure Approach.
  • 1 is a schematic diagram of a convolutional neural network
  • FIG. 2 is a schematic flowchart of an image processing method provided by some embodiments of the present disclosure
  • 3A is a schematic diagram of a convolutional neural network adopted by an image processing method provided by some embodiments of the present disclosure
  • 3B is a schematic flowchart of step S20 in the image processing method shown in FIG. 2;
  • FIG. 4 is a schematic flowchart of another image processing method provided by some embodiments of the present disclosure.
  • 5A is a schematic flowchart of training an identification network in an image processing method provided by some embodiments of the present disclosure
  • FIG. 5B is a schematic diagram of the scheme for training the discrimination network shown in FIG. 5A;
  • Fig. 5C is a kind of identification network
  • 6A is a schematic flowchart of training a second neural network in an image processing method provided by some embodiments of the present disclosure
  • FIG. 6B is a schematic diagram of the scheme for training the second neural network shown in FIG. 6A;
  • Fig. 7A is a kind of neural network with denoising function
  • Fig. 7B is a kind of neural network with deblurring function
  • FIG. 8A is a schematic flowchart of training a second neural network in another image processing method provided by some embodiments of the present disclosure
  • FIG. 8B is a schematic diagram of the scheme for training the second neural network shown in FIG. 8A;
  • 9A is a schematic block diagram of a terminal device according to some embodiments of the present disclosure.
  • FIG. 9B is a schematic block diagram of another terminal device provided by some embodiments of the present disclosure.
  • FIG. 10A is a data flow diagram of a terminal device according to some embodiments of the present disclosure.
  • FIG. 10B is an operation flowchart of a terminal device according to some embodiments of the present disclosure.
  • 11A is a schematic diagram of a video screen
  • FIG. 11B is an effect diagram after the screen shown in FIG. 11A is processed by applying the terminal device provided by the embodiment of the present disclosure
  • FIG. 12 is a schematic block diagram of an image processing apparatus according to some embodiments of the present disclosure.
  • FIG. 13 is a schematic block diagram of another image processing apparatus provided by some embodiments of the present disclosure.
  • FIG. 14 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.
  • FIG. 15 is a schematic flowchart of a video processing method provided by some embodiments of the present disclosure.
  • Image enhancement techniques based on deep learning can be used for image processing, such as image denoising, image inpainting, image deblurring, image super-resolution enhancement, image deraining/dehazing, etc.
  • FIG. 1 is a schematic diagram of a convolutional neural network.
  • the convolutional neural network can be used for image processing, which uses images as input and output, and replaces scalar weights with convolution kernels.
  • FIG. 1 only shows a convolutional neural network with a 3-layer structure, which is not limited by the embodiments of the present disclosure.
  • the convolutional neural network includes an input layer a01, a hidden layer a02 and an output layer a03.
  • the input layer a01 has 4 inputs
  • the hidden layer a02 has 3 outputs
  • the output layer a03 has 2 outputs
  • the convolutional neural network finally outputs 2 images.
  • the hidden layer a02 also known as the middle layer, is mainly used to extract features, and the neurons in it can take various forms to form a bias to the output result.
  • the four inputs of the input layer a01 can be four images, or four features of one image.
  • the three outputs of the hidden layer a02 can be the feature images of the image input through the input layer a01.
  • the hidden layer a02 includes a first convolutional layer b01 and a first activation layer b03.
  • the convolutional layers have weights and bias Weights Represents the convolution kernel
  • bias is a scalar superimposed to the output of the convolutional layer, where k is a label representing the number of the input layer a01 or hidden layer a02, and i and j are the labels of the individual units in the two connected layers, respectively.
  • the first convolutional layer b01 includes the first set of convolution kernels (the ) and the first set of biases (the ).
  • the output layer a03 includes a second convolutional layer b02 and a second activation layer b04.
  • the second convolutional layer b02 includes a second set of convolution kernels (the ) and the second set of biases (the ).
  • each convolutional layer includes dozens or hundreds of convolution kernels, and if the convolutional neural network is a deep convolutional neural network, it may include at least five convolutional layers.
  • the activation layer includes an activation function, which is used to introduce nonlinear factors into the convolutional neural network, so that the convolutional neural network can better solve more complex problems.
  • the activation function may include a linear correction unit function (ReLU function), a sigmoid function (Sigmoid function), a hyperbolic tangent function (tanh function), and the like.
  • the ReLU function is an unsaturated nonlinear function, and the Sigmoid function and the tanh function are saturated nonlinear functions.
  • the activation layer can be used alone as a layer of the convolutional neural network, or the activation layer can also be included in the convolutional layer.
  • the first convolutional layer b01 first, several convolution kernels in the first set of convolution kernels are applied to each input and several offsets in the first set of offsets to obtain the output of the first convolution layer b01; then, the output of the first convolution layer b01 can be processed by the first activation layer b03 to obtain the output of the first activation layer b03.
  • the second convolution layer b02 first, several convolution kernels in the second set of convolution kernels are applied to the output of the first activation layer b03 and several offsets in the second set of offsets to obtain the output of the second convolution layer b02; then, the output of the second convolution layer b02 can be processed by the second activation layer b04 to obtain the output of the second activation layer b04.
  • the output of the first convolutional layer b01 can be a convolution kernel applied to its input Then with the bias
  • the output of the second convolutional layer b02 can be used to apply a convolution kernel to the output of the first activation layer b03 Then with the bias the result of the addition.
  • the convolutional neural network Before using the convolutional neural network for image processing, the convolutional neural network needs to be trained. After training, the convolutional neural network's convolution kernels and biases remain unchanged during image processing. During the training process, each convolution kernel and bias are adjusted through multiple sets of input/output example images and optimization algorithms to obtain an optimized convolutional neural network model.
  • the video quality played by the TV used by the user is poor.
  • the poor picture quality is mainly due to the following two reasons.
  • the resolution of the video image is low.
  • the resolution of TV channels or Internet videos is usually standard definition (360p or 720p).
  • the user's TV screen is 4K resolution (2160p)
  • it is necessary to interpolate and enlarge the video image resulting in details Massive loss of information.
  • the original video needs to be compressed at a low bit rate, and the compressed video also loses a lot of detail information and introduces a lot of noise.
  • the terminal video processor can operate the basic functions of the TV, and can also have functions such as voice control and program recommendation. With the continuous improvement of chip computing power, the terminal video processor can integrate the enhancement function of video image for the problem of poor video quality.
  • the video image enhancement function of the terminal video processor is still mainly based on traditional simple algorithms, and the image quality improvement capability is limited. Deep learning algorithms (such as convolutional neural networks) require a large amount of computation, and it is difficult for terminal video processors to support the operations of common convolutional neural networks.
  • At least one embodiment of the present disclosure provides an image processing method and apparatus, a terminal device, a video processing method, and a storage medium.
  • the convolutional neural network required by the image processing method has a simple structure, can save the computing power of the device, can remove the noise of low-quality images and videos, improve the picture clarity, realize real-time picture quality enhancement, and is easy to deploy on terminal equipment.
  • the image processing method provided by at least one embodiment also has a better neural network training effect.
  • At least one embodiment of the present disclosure provides an image processing method, which is applicable to a convolutional neural network.
  • the image processing method includes: receiving an input image; using a convolutional neural network to process the input image to obtain an output image.
  • the clarity of the output image is higher than that of the input image.
  • Using a convolutional neural network to process an input image to obtain an output image includes: extracting features from the input image to obtain multiple first images; splicing the input image and multiple first images to obtain a first image group, the first The image group includes an input image and a plurality of first images; feature extraction is performed on the first image group to obtain a plurality of second images; a plurality of second images and a plurality of first images are fused to obtain a plurality of third images; The input image and a plurality of third images are spliced to obtain a second image group, and the second image group includes the input image and a plurality of third images; and feature extraction is performed on the second image group to obtain an output image.
  • “sharpness” refers to, for example, the clarity of each detail shadow pattern and its boundary in an image. The higher the clarity, the better the perception effect of the human eye.
  • the definition of the output image is higher than that of the input image, for example, it refers to using the image processing method provided by the embodiment of the present disclosure to process the input image, such as performing denoising and/or deblurring processing, so that the output obtained after processing is The image is sharper than the input image.
  • FIG. 2 is a schematic flowchart of an image processing method provided by some embodiments of the present disclosure.
  • the image processing method is applicable to a convolutional neural network, that is, the image processing method utilizes a convolutional neural network to implement image processing.
  • the image processing method includes the following operations.
  • Step S10 receiving an input image
  • Step S20 using a convolutional neural network to process the input image to obtain an output image, wherein the definition of the output image is higher than that of the input image.
  • the input image may be an image to be processed, such as an image with lower definition.
  • the image to be processed may be a video frame extracted from a video, or a picture downloaded through a network or captured by a camera, or an image obtained through other means, which is not limited in this embodiment of the present disclosure.
  • the image processing method provided by the embodiment of the present disclosure needs to be used to denoise and/or deblur, so as to improve the definition and realize the enhancement of the image quality.
  • the input image when the input image is a color image, the input image may include a red (R) channel input image, a green (G) channel input image, and a blue (B) channel input image.
  • the input image is a color image, which includes 3 channels of RGB.
  • a convolutional neural network may be used to implement image processing, such as denoising and/or deblurring the input image, so that the clarity of the resulting output image is higher than that of the input image.
  • image processing such as denoising and/or deblurring the input image
  • the output image may include a red (R) channel output image, a green (G) channel output image image and blue (B) channel output image.
  • the output image is a color image, which includes 3 channels of RGB.
  • FIG. 3A is a schematic diagram of a convolutional neural network adopted by an image processing method provided by some embodiments of the present disclosure.
  • the input image is input into the convolutional neural network shown in FIG. 3A , and the convolutional neural network processes the input image to obtain the output image, thereby completing the image processing , so that the clarity of the output image is higher than that of the input image.
  • the convolutional neural network includes an input layer INP, a first convolutional layer C1, a first convolutional layer P1, a second convolutional layer C2, a fusion layer Fu1, and a second convolutional layer.
  • P2 and output layer OT are input layers INP, a first convolutional layer C1, a first convolutional layer P1, a second convolutional layer C2, a fusion layer Fu1, and a second convolutional layer.
  • P2 and output layer OT P2 and output layer OT.
  • the input layer INP is used to receive input images.
  • the input image may include one channel.
  • the input image may include three channels, that is, the input image is 1, but includes a red (R) channel input image, a green (G) channel input image, and a blue (B) channel input image. ) channel input image.
  • the first convolution layer C1 is used to perform a convolution operation on the input image received by the input layer INP to realize feature extraction.
  • the first convolutional layer C1 includes multiple convolution kernels, multiple biases, and activation functions, whereby multiple feature images (also referred to as feature maps) can be obtained by calculation.
  • Activation functions are used to non-linearly map the results of convolution operations to assist in expressing complex features.
  • the activation function in the first convolutional layer C1 adopts a linear rectification function (Rectified Linear Unit, ReLU), which is easier to converge and has better prediction performance using the ReLU function.
  • ReLU linear rectification function
  • the embodiment of the present disclosure is not limited thereto, and the activation function may also adopt a sigmoid function, a hyperbolic tangent (hyperbolic tangent, tanh) or any other suitable function, which may be determined according to actual requirements.
  • the first splicing layer P1 is used to use the input image received by the input layer INP and the feature image output by the first convolutional layer C1 as the input of the second convolutional layer C2, that is, the first splicing layer P1 is used for inputting.
  • the input image received by the layer INP and the feature image output by the first convolutional layer C1 are spliced and processed, and the feature image obtained after the splicing process is used as the input of the second convolutional layer C2.
  • the concat function does not change the content of each image itself, but only returns a spliced copy of multiple images.
  • the concat function please refer to the conventional design, which will not be described in detail here.
  • the information of the input image can be linked with the information of the feature image output by the first convolution layer C1, so that the feature image input to the second convolution layer C2 contains the information of the input image (that is, the original image is included) information).
  • the second convolution layer C2 is used to perform a convolution operation on the feature image output by the first stitching layer P1 to realize feature extraction.
  • the second convolution layer C2 includes multiple convolution kernels, multiple biases, and activation functions, so that multiple feature images can be calculated.
  • the activation function in the second convolutional layer C2 can also use the ReLU function.
  • the embodiments of the present disclosure are not limited thereto, and the activation function in the second convolutional layer C2 may also adopt a sigmoid function, a tanh function, or any other applicable function, which may be determined according to actual requirements.
  • the activation function in the first convolutional layer C1 and the activation function in the second convolutional layer C2 may be the same or different, which may be determined according to actual requirements, which are not limited in the embodiments of the present disclosure.
  • the fusion layer Fu1 is used to fuse the feature image output by the second convolution layer C2 with the feature image output by the first convolution layer C1.
  • the fusion layer Fu1 can fuse the features of different layers in the convolutional neural network.
  • these different layers can be two connected layers, that is, the output of one of them is used as the input of the other layer; these different layers can also be two separated layers, that is, the output of any one of them The output is not used as input to another layer.
  • any applicable fusion algorithm may be used to realize the fusion of feature images, which is not limited in the embodiments of the present disclosure.
  • the second stitching layer P2 is used to use the input image received by the input layer INP and the feature image output by the fusion layer Fu1 as the input of the output layer OT, that is, the second stitching layer P2 is used for the input image received by the input layer INP.
  • the second splicing layer P2 can use the concat function, and the concat function does not change the content of each image itself, but only returns the spliced copies of multiple images.
  • the information of the input image can be linked with the information of the feature image output by the fusion layer Fu1, so that the feature image input to the output layer OT contains the information of the input image (that is, the original image information).
  • the output layer OT is used to perform a convolution operation on the feature image output by the second stitching layer P2, so as to convert the feature image into a multi-channel or single-channel image that meets the requirements, and output the image obtained by the operation as the convolution neural network.
  • the output image of the network can perform convolution operations for feature extraction and channel transformation.
  • the output layer OT includes multiple convolution kernels, multiple biases, and activation functions, from which feature images can be calculated as output images.
  • the activation function in the output layer OT can also use the ReLU function.
  • the embodiments of the present disclosure are not limited to this, and the activation function in the output layer OT may also use a sigmoid function, a tanh function, or any other applicable function, which may be determined according to actual requirements.
  • the output layer OT may further include a fully connected layer, which is used for nonlinearly combining the extracted features to obtain the output.
  • the above-mentioned convolutional neural network may further include more types of operation layers, for example, may also include pooling layers, etc., which are not limited in the embodiments of the present disclosure.
  • the convolutional layers are not limited to two layers (not limited to the first convolutional layer C1 and the second convolutional layer C2), and more convolutional layers such as the third convolutional layer, the fourth convolutional layer, etc. can be set.
  • the splicing layer is not limited to two layers (not limited to the first splicing layer P1 and the second splicing layer P2), and more splicing layers such as a third splicing layer, a fourth splicing layer, etc. may be provided.
  • the fusion layer is not limited to one layer, and more fusion layers can be set.
  • the first convolutional layer C1, the second convolutional layer C2, and the output layer OT may also be used to perform pixel padding before the convolution operation, so that the image input to this layer is The images output by this layer are the same size.
  • pixel filling please refer to the conventional design, which will not be described in detail here.
  • the feature image information in the convolutional neural network is linked with the information of the input image, so that the subsequent The feature image in the processing process contains the information of the input image (that is, contains the original image information), so that the convolutional neural network has a good detail restoration effect.
  • the convolutional neural network is trained to have the function of denoising and/or deblurring, its denoising and/or deblurring effect is very good, so that the clarity of the image can be effectively improved and image enhancement can be achieved.
  • the convolutional neural network has a simple structure and is easy to implement, can effectively save computing power, improve computing efficiency, is easy to deploy on terminal equipment, and is suitable for real-time image enhancement on mobile terminals.
  • FIG. 3B is a schematic flowchart of step S20 in the image processing method shown in FIG. 2 .
  • step S20 shown in FIG. 2 may specifically include the following operations.
  • Step S21 performing feature extraction on the input image to obtain a plurality of first images
  • Step S22 splicing the input image and multiple first images to obtain a first image group, wherein the first image group includes the input image and multiple first images;
  • Step S23 performing feature extraction on the first image group to obtain a plurality of second images
  • Step S24 fuse multiple second images and multiple first images to obtain multiple third images
  • Step S25 splicing the input image and a plurality of third images to obtain a second image group, wherein the second image group includes the input image and a plurality of third images;
  • Step S26 Perform feature extraction on the second image group to obtain an output image.
  • steps S21-S26 may be implemented by the convolutional neural network shown in FIG. 3A, and the steps S21-S26 will be exemplarily described below with reference to the convolutional neural network shown in FIG. 3A.
  • the first convolution layer C1 is used to perform feature extraction on the input image to obtain a plurality of first images.
  • the input image may be an RGB color image, and the input image includes three channels, that is, the input image is 1, but includes a red (R) channel input image, a green (G) channel input image, and a blue (B) channel input image.
  • channel input image is a feature image (also referred to as a feature map (Feature Map)) obtained after being processed by the first convolution layer C1.
  • Feature Map feature map
  • the number of convolution kernels (that is, the convolution kernels in the first convolution layer C1) used for feature extraction on the input image in the convolutional neural network is N, 12 ⁇ N ⁇ 20 and N is an integer .
  • N 16, that is, the first convolution layer C1 has 16 convolution kernels, and at this time, the number of multiple first images calculated through the first convolution layer C1 is also 16 .
  • step S22 the input image and a plurality of first images are spliced by using the first splicing layer P1 to obtain a first image group.
  • stitching processing refers to stitching multiple images using, for example, the concat function.
  • the concat function does not change the content of each image itself, but only returns the stitched copies of multiple images.
  • the first image group includes an input image and a plurality of first images.
  • the input image is three channels and the first convolutional layer C1 has 16 convolution kernels
  • the number of first images is 16, and the three channels of the input image are combined with the 16 first images Perform stitching processing to obtain the first image group.
  • the first image group includes a total of 19 images, of which 16 are the feature images output by the first convolutional layer C1 (that is, the aforementioned first image), and 3 of them are the three images of the input image. channel.
  • the second convolution layer C2 is used to perform feature extraction on the first image group to obtain a plurality of second images.
  • the second image is a feature image (also referred to as a feature map (Feature Map)) obtained after being processed by the second convolution layer C2.
  • the number of convolution kernels that is, the convolution kernels in the second convolution layer C2 used for feature extraction on the first image group in the convolutional neural network is M, 12 ⁇ M ⁇ 20 and M is an integer.
  • M and N may be equal or unequal, that is, the number of convolution kernels in the second convolutional layer C2 and the number of convolutional kernels in the first convolutional layer C1 may be equal or unequal. It may be determined according to actual needs, which is not limited by the embodiments of the present disclosure. For example, in some examples, the number of convolution kernels in the first convolutional layer C1 and the number of convolutional kernels in the second convolutional layer C2 are both 16, which can not only take into account the computational magnitude that the mobile terminal can undertake, but also have the Better image processing effect.
  • the fusion layer Fu1 may be used to process multiple second images to obtain multiple third images. That is, a plurality of second images and a plurality of first images are fused by using the fusion layer Fu1 to obtain a plurality of third images.
  • the first convolutional layer C1 has 16 convolution kernels and the second convolutional layer C2 also has 16 convolutional kernels
  • the first image is 16 and the second image is also 16
  • the first image and the second image may form a one-to-one correspondence.
  • the fusion process may be, for example, performing an addition operation on the corresponding pixels in the corresponding first image and the second image.
  • step S24 it is not limited to use the fusion layer Fu1 to obtain the third image, but also can use one or more convolution layers, one or more splicing layers, a
  • the second image is processed by an arbitrary operation layer such as a layer or a multi-layer pooling layer to obtain a third image, which is not limited in this embodiment of the present disclosure.
  • step S25 the input image and a plurality of third images are spliced by using the second splicing layer P2 to obtain the second image group.
  • stitching processing refers to stitching multiple images using, for example, the concat function.
  • the concat function does not change the content of each image itself, but only returns the stitched copies of multiple images.
  • the second image group includes the input image and a plurality of third images.
  • the second image group includes a total of 19 images, of which 16 are It is the feature image output by the fusion layer Fu1, three of which are the three channels of the input image.
  • the output layer OT is used to perform feature extraction on the second image group to obtain an output image.
  • the output layer OT is used to convert the second image group into a multi-channel or single-channel image that meets the requirements.
  • the output image is a feature image obtained after being processed by the output layer OT.
  • the number of convolution kernels that is, the convolution kernels in the output layer OT
  • the output layer OT is calculated by The number of output images is also 1, but includes images of three channels, a red (R) channel output image, a green (G) channel output image, and a blue (B) channel output image. It should be noted that when the input image is only one channel, correspondingly, the number of convolution kernels in the output layer OT is also 1, so that the obtained output image is also one channel.
  • the size of the convolution kernels in the first convolutional layer C1, the second convolutional layer C2 and the output layer OT are all 3 ⁇ 3, which can meet the requirements of feature extraction and also take into account the calculation human resources.
  • the embodiment of the present disclosure is not limited to this, the size of the convolution kernel can also be any size such as 4 ⁇ 4, 5 ⁇ 5, etc., and the first convolutional layer C1, the second convolutional layer C2 and the output layer OT
  • the sizes of the convolution kernels may be the same or different, which may be determined according to actual requirements, which are not limited in the embodiments of the present disclosure.
  • the input image has three channels, that is, the red channel input image, the green channel input image and the blue channel input image
  • the first convolutional layer C1 and the second convolutional layer C2 can be set to include 16
  • the output layer OT includes three convolution kernels
  • the size of the convolution kernels is 3 ⁇ 3.
  • the image changes from 3 channels to 16 channels (1), and the width and height information remains unchanged.
  • multiple “channels” may refer to multiple feature images.
  • the first stitching layer P1 will link the original image information to the calculation result of the first convolutional layer C1 (2), and pass it as a common input to the second convolutional layer C2.
  • the fusion layer Fu1 fuses the output (3) of the second convolutional layer C2 and the output of the first convolutional layer C1 (4), and then links the original image information (5) through the second splicing layer P2, and then passes it to the output layer. ot.
  • the output layer OT learns the information of the three RGB channels, and the output of the output layer OT is the RGB image information (6).
  • the output of each layer can be represented as (B, H, W, F).
  • B represents batchsize
  • H represents image height
  • W represents image width
  • F represents the number of feature images.
  • the specific output results of the above stages are: 1: 1*H*W*16, 21*H*W*19, 31*H*W*16, 41*H*W*16, 51*H*W*19 , 61*H*W*3.
  • the input image is processed by using the convolutional neural network and an output image is obtained, and the definition of the output image is higher than that of the input image, thereby realizing image enhancement.
  • FIG. 4 is a schematic flowchart of another image processing method provided by some embodiments of the present disclosure.
  • the image processing method provided by this embodiment is basically the same as the image processing method shown in FIG. 2 . It is not repeated here.
  • the image processing method before performing steps S10-S20, the image processing method further includes the following operations.
  • Step S30 training the second neural network to be trained based on the pre-trained first neural network to obtain a trained second neural network, thereby obtaining a convolutional neural network.
  • the first neural network has more parameters than the second neural network, and the pre-trained first neural network is configured to transform the original image with the first definition input into the pre-trained first neural network into a The newly created image of the second definition, the second definition is greater than the first definition. That is, the first neural network has the function of improving the clarity of the image.
  • step S30 may include: based on the pre-trained first neural network, the second neural network to be trained, and the discrimination network, alternately training the discrimination network and the second neural network to obtain a trained second neural network network, resulting in a convolutional neural network.
  • the trained second neural network is the convolutional neural network described above.
  • the network structure of the second neural network to be trained is the same as the network structure of the convolutional neural network.
  • the network structure may refer to the number, setting method and setting order of each layer of the network, etc., and may also refer to the way of data transmission in the network. and method etc.
  • the parameters of the second neural network to be trained are different from the parameters of the convolutional neural network. That is, by training the second neural network to be trained and optimizing its parameters, the convolutional neural network described above can be obtained.
  • the parameters of the second neural network and the parameters of the convolutional neural network both include weight parameters in the respective convolutional layers. For example, the larger the absolute value of the weight parameter, the greater the contribution of the neuron corresponding to the weight parameter to the output of the neural network, and the more important it is to the neural network.
  • the pre-trained first neural network can be trained in the following manner. First obtain the ground-truth samples (that is, the input video with high definition, including multiple input image frames), and process the ground-truth samples (such as adding noise, blurring, etc.) to obtain the corresponding output video with low definition (including multiple output image frames), and multiple output image frames are in one-to-one correspondence with multiple input image frames. Then, the untrained first neural network is trained by using the corresponding input image frame and the output image frame as a set of training pairs to obtain a trained first neural network. For example, multiple input image frames and/or multiple output image frames obtained at this time may be used as training samples for subsequent training of the second neural network.
  • the ground-truth samples that is, the input video with high definition, including multiple input image frames
  • process the ground-truth samples such as adding noise, blurring, etc.
  • the untrained first neural network is trained by using the corresponding input image frame and the output image frame as a set of training pairs to obtain a trained first neural network. For
  • step S30 the sample data is input into the first neural network and the second neural network, and based on the output results of the first neural network and the second neural network, the discrimination network is trained first, and then the second neural network is trained. The internet. Then, the sample data is input into the first neural network and the second neural network again, and based on the output results of the first neural network and the second neural network, the discrimination network is trained first, then the second neural network is trained, and so on.
  • the sample data input in different training stages can be different.
  • the discrimination network and the second neural network are alternately trained, so that the discrimination network and the second neural network play and learn from each other in an adversarial manner, thereby producing better output results.
  • the first neural network may be a larger neural network that has already been trained.
  • the second neural network can be built based on the structure of the convolutional neural network described above, but the parameters still need to be trained.
  • the first neural network is used to train the second neural network, and the parameters of the first neural network are more than the parameters of the second neural network.
  • the parameters of the neural network include the weight parameters of each convolutional layer in the neural network. The greater the absolute value of the weight parameter, the greater the contribution of the neuron corresponding to the weight parameter to the output of the neural network, and the more important it is to the neural network.
  • the higher the complexity of a neural network with more parameters the greater its "capacity", which means that the neural network can complete more complex learning tasks.
  • the second neural network is simplified, and the second neural network has fewer parameters and a simpler network structure, so that the second neural network occupies less resources (such as computing resources, storage resources, etc.), so it can be applied to lightweight terminals.
  • the second neural network can learn the reasoning ability of the first neural network, so that the second neural network has a simple structure and strong reasoning ability.
  • FIG. 5A is a schematic flowchart of training an identification network in an image processing method provided by some embodiments of the present disclosure
  • FIG. 5B is a schematic diagram of a scheme for training an identification network shown in FIG. 5A
  • the training scheme of the discriminant network is exemplarily described below with reference to FIG. 5A and FIG. 5B .
  • training the discriminant network includes the following operations.
  • Step S31 input the first sample data into the first neural network NE1 and the second neural network NE2 respectively, to obtain the first data output from the first neural network NE1 and the second data output from the second neural network NE2;
  • Step S32 the first data is set to have a true value label, and the first data with a true value label is input into the discriminating network Disc to obtain the first identification result, the second data is set to have a false value label, and has a false value label.
  • the second data of the value label is input to the identification network Disc to obtain the second identification result;
  • Step S33 calculating a first loss function based on the first identification result and the second identification result
  • Step S34 Adjust the parameters of the discriminating network Disc according to the first loss function to obtain the updated discriminating network Disc.
  • the first sample data may be image data obtained based on video, and both the first data and the second data are image data.
  • the first sample data may also be image data obtained in other ways, which is not limited in the embodiment of the present disclosure.
  • the first sample data is image data obtained based on a plurality of videos with the same bit rate. Therefore, the convolutional neural network trained by using the first sample data can have better processing capability and processing effect on the image data of the video with the code rate, so that the convolutional neural network has strong pertinence to the code rate. .
  • the original video can be compressed with different bit rates of video quality, and Gaussian noise and quantum noise can be randomly added to form a low-quality video, and video frames can be extracted from the low-quality video, so as to obtain the first Sample data, the first sample data is a video frame of a low-quality video.
  • the first neural network NE1 may be an already trained, larger neural network with the function of improving the definition.
  • the first neural network NE1 includes a multi-level down-sampling unit and a corresponding multi-level up-sampling unit, the output of each level of down-sampling unit is used as the input of the next-level down-sampling unit, and the input of each level of up-sampling unit includes The output of the downsampling unit corresponding to the upsampling unit and the output of the upsampling unit of the previous stage of the upsampling unit.
  • the neural network performs multiple feature extraction on the image by using multiple downsampling units, and uses multiple upsampling units to perform multiple upsampling on the image, and the feature image output by each downsampling unit is input to the corresponding upsampling unit In this way, the feature information in the image can be effectively captured and the clarity of the image can be improved.
  • the first neural network NE1 may adopt the network structure shown in FIG. 7A, and the related description will be described later, and will not be repeated here. It should be noted that the first neural network NE1 may use any trained neural network, or may use any combination of neural networks, which may be determined according to actual requirements, which is not limited in the embodiments of the present disclosure.
  • the first neural network NE1 is used to improve the clarity of the sample data, so that the trained convolutional neural network also has the function of improving the clarity.
  • improving sharpness may be embodied as denoising and/or deblurring to achieve image enhancement.
  • the second neural network NE2 is established based on the network structure of the convolutional neural network described above, that is, the second neural network NE2 has the same network structure as the convolutional neural network described above, but its parameters still need to be trained and corrections.
  • the first data is set to have a true value label
  • the second data is set to have a false value label.
  • the true value label may be represented as [1] and the false value label may be represented as [0].
  • the first data with the true value label is input into the identification network Disc to obtain the first identification result
  • the second data with the false value label is input into the identification network Disc to obtain the second identification result.
  • the discriminant network Disc may employ a convolutional neural network model in a stack mode.
  • FIG. 5C is a discriminant network, such as a convolutional neural network model in a cascade mode.
  • the discriminant network Disc includes multiple operation layers, each operation layer is composed of a convolution layer and an activation layer (using the ReLU function), and finally outputs the result through a fully connected layer FC.
  • the convolution kernel size of the convolutional layers in the four operation layers are all 3 ⁇ 3, and the number of feature images output by each of the four operation layers is 32, 64, 128, and 192, respectively.
  • the final output result of the discriminating network Disc is the probability value of the binary classification, that is, the discriminating network Disc discriminates and outputs any value between 0 and 1. The closer the value is to 1, the greater the probability of judging that the input is true, and vice versa. The greater the probability of judging that the input is false.
  • a first loss function is calculated based on the first identification result and the second identification result, and the first loss function is the loss function of the identification network Disc.
  • the first loss function may employ a cross-entropy loss function that characterizes the difference between true sample labels and predicted probabilities.
  • the formula for the cross-entropy loss function is:
  • N is the number of data (that is, the total number of the first data and the second data)
  • yi is the label corresponding to each data (that is, 0 or 1)
  • p i is the identification network for each The predicted value of the data.
  • the loss function of the discriminating network Disc can adopt any type of function, and is not limited to the cross-entropy loss function, which can be determined according to actual needs, and the embodiment of the present disclosure does not limit this.
  • step S34 the parameters of the discriminating network Disc are adjusted according to the first loss function to obtain an updated discriminating network Disc, which has better discriminating ability.
  • FIG. 6A is a schematic flowchart of training a second neural network in an image processing method provided by some embodiments of the present disclosure
  • FIG. 6B is a schematic diagram of a scheme for training the second neural network shown in FIG. 6A .
  • the training scheme of the second neural network is exemplarily described below with reference to FIG. 6A and FIG. 6B .
  • training the second neural network includes the following operations.
  • Step S35 input the second sample data into the first neural network NE1 and the second neural network NE2 respectively, to obtain the third data output from the first neural network NE1 and the fourth data output from the second neural network NE2;
  • Step S36 the 4th data is set to have the true value label, and the 4th data with the true value label is input to the discriminating network Disc after the update, obtains the 3rd discrimination result output from the discriminating network Disc;
  • Step S37 calculating an error function based on the third data and the fourth data, calculating an identification function based on the third identification result, and calculating a second loss function based on the error function and the identification function;
  • Step S38 Adjust the parameters of the second neural network NE2 according to the second loss function to obtain the updated second neural network NE2.
  • steps S35-S38 may be performed after steps S31-S36 are performed.
  • the second sample data may be image data obtained based on video, and the third data and the fourth data are both image data.
  • the second sample data may also be image data obtained in other manners, which is not limited in the embodiment of the present disclosure.
  • the second sample data is image data obtained based on a plurality of videos having the same bit rate. Therefore, the convolutional neural network trained by using the second sample data can have better processing capability and processing effect on the image data of the video with the code rate, so that the convolutional neural network has strong pertinence to the code rate.
  • the original video can be compressed with different bit rates of video quality, and Gaussian noise and quantum noise can be randomly added to form a low-quality video, and video frames can be extracted from the low-quality video, so as to obtain a second Sample data, the second sample data is a video frame of a low-quality video.
  • the first sample data and the second sample data may be the same or different.
  • step S36 the fourth data is set to have a true value label (for example, represented as [1]), and the fourth data with a true value label is input to the discriminating network Disc, and the fourth data output from the discriminating network Disc is obtained.
  • a true value label for example, represented as [1]
  • the fourth data with a true value label is input to the discriminating network Disc, and the fourth data output from the discriminating network Disc is obtained.
  • Three identification results For example, the numerical range of the third discrimination result is 0-1. It should be noted that the identification network Disc at this time is the identification network updated after being trained in the above steps S31-S34.
  • an error function is calculated based on the third data and the fourth data.
  • the error function may take the mean absolute error (L1 loss).
  • L1 loss mean absolute error
  • the discrimination function D2 is calculated based on the third discrimination result. For example, any applicable method may be used to calculate the discrimination function D2, which is not limited in the embodiment of the present disclosure.
  • a second loss function is calculated based on the error function (for example, the mean absolute error) and the discrimination function D2, and the second loss function is the loss function of the second neural network NE2.
  • the weight W1 of the error function is 90 to 110 (for example, 100)
  • the weight of the discrimination function D2 is 0.5 to 2 (for example, 1).
  • step S38 the parameters of the second neural network NE2 are adjusted according to the second loss function to obtain the updated second neural network NE2.
  • an adversarial training for the discrimination network Disc and the second neural network NE2 can be completed alternately.
  • each alternate training is carried out on the basis of the previous training update, that is, when training the second neural network NE2, the discriminant network Disc based on the previous training update is used for training, and in When training the discriminating network Disc, the second neural network NE2 updated based on the previous training is used for training.
  • one or more adversarial trainings for the discriminating network Disc and the second neural network NE2 can be performed alternately. Through optimization and iteration, the image processing capability of the trained second neural network NE2 can meet the requirements.
  • a total of about 20 million times of alternate training can be performed to obtain a second neural network NE2 that meets the requirements.
  • the parameters of the trained second neural network NE2 (such as weight parameters in the convolution layer) have been optimized and corrected, and the trained second neural network NE2 is the convolutional neural network described above.
  • the first sample data and the second sample data used may be the same, that is, the same The sample data of , completes one training for the discriminant network Disc and one training for the second neural network NE2.
  • the adopted first sample data and the second sample data may be the same.
  • the first sample data used is the same as that used in the previous adversarial training.
  • the first sample data of are different, and the second sample data used is different from the second sample data used in the previous adversarial training. In this way, training efficiency can be improved, training methods can be simplified, and data utilization can be improved.
  • the convolutional neural network that meets the requirements can be quickly trained, the training effect is good, and the trained convolutional neural network can have strong image processing capabilities and Better image processing effect.
  • the adopted first neural network NE1 has the function of denoising and/or deblurring
  • the convolutional neural network obtained by training also has the function of denoising and/or deblurring, and the denoising and/or deblurring effect is very good.
  • the image processing method may further include more steps, and the execution order of each step may be adjusted according to actual requirements, which is not limited by the embodiment of the present disclosure.
  • the first neural network NE1 may use any trained neural network, for example, a larger neural network with denoising function or a larger neural network with deblurring function may be used, or, a A combination of a larger neural network with a larger neural network with deblurring.
  • FIG. 7A is a neural network with denoising function.
  • the neural network includes a multi-level down-sampling unit and a corresponding multi-level up-sampling unit, and the multi-level down-sampling unit is in one-to-one correspondence with the multi-level up-sampling unit.
  • the unit on the left is a down-sampling unit
  • the unit on the right is an up-sampling unit.
  • each level of downsampling unit is used as the input of the next level of downsampling unit, and the input of each level of upsampling unit includes the output of the downsampling unit corresponding to the upsampling unit of this level and the upsampling of the previous level of the upsampling unit of this level output of the unit. That is, the output of the down-sampling unit is provided not only to the adjacent next-level down-sampling unit, but also to the up-sampling unit corresponding to the down-sampling unit.
  • each downsampling unit includes a Conv2d convolution layer, a Relu activation function, and an operation layer such as Conv2d downsampling 2 times
  • each upsampling unit includes a Conv2d convolution layer, a Relu activation function, and Conv2d upsampling 2 times and other operation layers.
  • the neural network has a U-shaped symmetric structure, the left side is mainly a downsampling unit, and the right side is mainly an upsampling unit.
  • this neural network multiple down-sampling units and multiple up-sampling units correspond to each other, and the feature image output by each down-sampling unit on the left is input into the corresponding up-sampling unit on the right, so that the features obtained at each level The graphs are effectively used in subsequent calculations.
  • Conv2d downsampling 2 times to select the convolution kernel of (3, 3), while stride 2, that is, downsampling twice, and the output size is (B, H//2, W//2, F).
  • Conv2d upsampling 2 times to select the convolution kernel of (4, 4), stride 2, that is, upsampling 2 times, and the output size is (B, H, W, F).
  • the parameters of the convolution kernel are shared, that is, the parameters of all the Conv2d convolution kernels in the figure are the same.
  • five F1, F2, F3, F4, and F5 are output respectively. images at different scales.
  • the sizes of these 5 images of different scales are: F1(H, W, 3), F2(H//2, W//2, 3), F3(H//4, W//4, 3), F4 (H//8, W//8, 3), F5 (H//16, W//16, 3).
  • the loss function is calculated by outputting F1, F2, F3, F4, and F5 at different scales, respectively, and the true value.
  • GT1 is the true value.
  • BICUBIC downsampling on GT1 to obtain GT2 (H//2, W//2, 3), GT3 (H//4, W// 4, 3), GT4 (H//8, W//8, 3) and GT5 (H//16, W//16, 3).
  • the neural network is used after training, only the Conv2d convolutional layer with parameter sharing is used in the final output of F1, and no longer outputs F2, F3, F4, F5, that is, it is not used at F2, F3, F4, F5. convolutional layer.
  • FIG. 7B is a neural network with deblurring function.
  • the neural network includes multiple functional layer groups with different numbers of convolution kernels, and each functional layer group includes one or more Conv convolution layers, DConv convolution layers, reblocking layers, and the like.
  • skip connections are established between different functional layer groups. By setting skip connections, the problem of gradient disappearance when the number of network layers is deep can be solved, and at the same time, it is helpful for the back-propagation of gradients to speed up training. process.
  • the neural network please refer to the neural network applying skip connection and reblocking layer in the conventional design, which will not be described in detail here.
  • the convolutional neural network obtained by training can process the input image to improve the clarity of the image, but is not limited to having only denoising and/or deblurring functions.
  • the network may also have other arbitrary functions, and it is only necessary to use the first neural network NE1 with corresponding functions and use the first neural network NE1 to train the second neural network NE2.
  • FIG. 8A is a schematic flowchart of training a second neural network in another image processing method provided by some embodiments of the present disclosure
  • FIG. 8B is a schematic diagram of a scheme for training the second neural network shown in FIG. 8A
  • the training method shown in Figure 8A and Figure 8B can be used to train a convolutional neural network.
  • the discriminant network is no longer used in this example, And only the first neural network NE1 is used to train the second neural network NE2 to obtain the required convolutional neural network.
  • the training scheme of the second neural network will be exemplarily described below with reference to FIG. 8A and FIG. 8B .
  • training the second neural network includes the following operations.
  • Step S41 input the third sample data into the first neural network NE1 and the second neural network NE2, respectively, to obtain the fifth data output from the first neural network NE1 and the sixth data output from the second neural network NE2;
  • Step S42 Calculate a third loss function based on the fifth data and the sixth data
  • Step S43 Adjust the parameters of the second neural network NE2 according to the third loss function to obtain the updated second neural network NE2.
  • Video frames are extracted from the processed video data set, and the obtained video frames constitute sample data.
  • a video data set with better definition can use the AIM data set, the AIM data set contains 240 videos of 1280*720, and each video has 100 frames.
  • the first neural network NE1 includes two larger neural networks NE1a and NE1b arranged in sequence, and NE1a and NE1b are, for example, the neural network shown in FIG. 7A and the neural network shown in FIG. 7B , respectively.
  • the order of setup of the two neural networks is not restricted.
  • the third sample data is input into the first neural network NE1, and one neural network NE1a in the first neural network NE1 processes the third sample data and then inputs the processing result into another neural network NE1 in the first neural network NE1.
  • One neural network NE1b and the other neural network NE1b process the received image and output the processing result as the output of the first neural network NE1.
  • the fifth data output by the first neural network NE1 is subjected to both denoising and deblurring processing, and the first neural network NE1 has both denoising and deblurring functions.
  • the fifth data is used as the ground truth image.
  • the third sample data is input into the second neural network NE2, and the sixth data output by the second neural network NE2 is used as a false value image.
  • a third loss function is calculated based on the fifth data and the sixth data, and the third loss function is used for back-propagation to adjust the parameters of the second neural network NE2, thereby obtaining the updated third loss function.
  • the second neural network NE2 that is, obtains the desired convolutional neural network.
  • the third loss function may take the weighted sum of the mean absolute error (L1 loss) and the Sobel error (Sobel loss).
  • the weight of the mean absolute error (L1 loss) may be set to 0.5-1.5 (eg 1), and the weight of the Sobel error (Sobel loss) may be set to 1.5-2.5 (eg 2), so that it is possible to obtain better training effect.
  • Sobel error also called Sobel operator
  • Sobel operator is one of the most important operators in pixel image edge detection.
  • convolution the approximation of the horizontal and vertical luminance differences can be obtained respectively.
  • Gx and Gy to represent the approximate values of the gray partial derivatives in the horizontal and vertical directions, respectively.
  • A represents an image.
  • the gradients in the x and y directions can be obtained, and the estimated value of the gradient can be calculated by the following formula:
  • Gmax can be defined. If G is smaller than Gmax, it can be considered that the point is a boundary value, and the point is reserved and set as white, otherwise, the point is set as black. From this, the image gradient information is obtained.
  • the fifth data and the sixth data are transferred to the gray domain, the Sobel gradients are calculated respectively, and the mean difference of the gradient map is calculated as the loss value for backpropagation.
  • the trained convolutional neural network can learn denoising and deblurring functions at the same time, and can restore the clarity of video frames well, while maintaining image information.
  • the convolutional neural network that meets the requirements can be quickly trained with less sample data.
  • At least one embodiment of the present disclosure also provides a terminal device, which can process an input video based on a convolutional neural network to improve picture definition, realize real-time picture quality enhancement, and have good processing effect and high processing efficiency.
  • FIG. 9A is a schematic block diagram of a terminal device according to some embodiments of the present disclosure.
  • the terminal device 100 includes a processor 110 .
  • the processor 110 is configured to: acquire an input video bit rate and an input video, where the input video includes a plurality of input image frames; select a video processing method corresponding to the input video bit rate according to the input video bit rate at least one input image frame is processed to obtain at least one output image frame.
  • the definition of at least one output image frame is higher than that of at least one input image frame, and different input video bit rates correspond to different video processing methods.
  • the above video processing method includes: processing at least one input image frame based on a trained neural network to obtain at least one output image frame.
  • a trained neural network For example, the convolutional neural network shown in FIG. 3A can be used to process the input image frame.
  • the neural network has the functions of denoising and/or deblurring, which can effectively improve the clarity of the image and has a good detail restoration effect. , enabling image enhancement.
  • the image processing method provided by any embodiment of the present disclosure may be used to implement the processing of the input image frame.
  • processing at least one input image frame based on the trained neural network to obtain at least one output image frame may include the following operations:
  • Feature extraction is performed on the second output image group to obtain at least one output image frame.
  • the above steps of processing the input image frame are basically the same as the steps shown in FIG. 3B , and the relevant description can refer to the foregoing content, which will not be repeated here.
  • the trained neural networks corresponding to different video processing methods are different. That is, different input video bit rates correspond to different video processing methods, different video processing methods correspond to different neural networks, and input videos with different input video bit rates are processed by different neural networks.
  • the neural network is bit rate specific. By distinguishing the bit rate, the neural network can have a better processing effect for the video corresponding to the bit rate.
  • input videos of different bit rates are processed by different video processing methods (or different neural networks), so that the output videos obtained after processing the input videos of different bit rates have more consistent definition, so that the processing effect is not Affected by the code rate, the performance stability and consistency of the terminal device 100 are improved.
  • the trained neural networks corresponding to different video processing methods are obtained by training with different sample data sets, and different sample data sets are obtained based on different video sets, each video set includes multiple videos, and the same video set videos in different video sets have the same bit rate, and videos in different video sets have different bit rates. That is, for the same neural network, the sample datasets used for training come from videos with the same bit rate; for different neural networks, the sample datasets used for training come from videos with different bit rates.
  • the bit rate of video refers to the number of bits of data transmitted per unit of time during data transmission, and the unit is usually kbps, that is, kilobits per second.
  • FIG. 9B is a schematic block diagram of another terminal device provided by some embodiments of the present disclosure.
  • the terminal device 200 is, for example, implemented as a terminal video processor 200, and includes a hardware development board 210 on which a software development package 211 is deployed, and the hardware development board 210 includes a central processing unit 212 and a Neural network processor 213 .
  • the software development kit 211 may be implemented as a collection of programs and related files with a general interface or a custom interface, that is, a software development kit (Software Development Kit, SDK).
  • the software development kit 211 can be deployed in the on-chip memory (ROM) of the hardware development board 210, for example, and read from the ROM at runtime.
  • the software development package 211 includes a plurality of neural network modules UN, the plurality of neural network modules UN are obtained respectively based on a plurality of convolutional neural networks, and the plurality of neural network modules UN are in one-to-one correspondence with the plurality of convolutional neural networks.
  • the convolutional neural network can be the convolutional neural network shown in FIG. 3A .
  • the convolutional neural network has the functions of denoising and/or deblurring, which can effectively improve the clarity of the image and has a good effect of repairing details. Image enhancement can be achieved.
  • multiple neural network modules UN are obtained based on parameter quantization of multiple convolutional neural networks.
  • the data type can be converted from a 32-bit floating point number (float32) to an 8-bit integer (int8), thereby realizing parameter quantization, so as to effectively save computing power, so that the terminal video processor 200 can support the operation of the neural network module UN.
  • a convolutional neural network can be obtained by training on the precision of float32 (for example, the training method shown in Figure 5A-6B is used for training), and then the parameters of the trained convolutional neural network are quantized, and the data type is converted to int8, Thus, the neural network module UN is obtained.
  • the obtained neural network module UN has the same function as the convolutional neural network. Although the processing effect of the neural network module UN is slightly different from that of the convolutional neural network due to parameter quantization, it is difficult to detect by the human eye, and the quality loss can be can be ignored.
  • the amount of computation and the amount of data can be effectively reduced, so that the neural network module UN is suitable for running on the terminal video processor 200, and the amount of computing power and memory can be saved.
  • the output of the neural network module UN can still better maintain the image quality when the parameters are compressed by about 300 times, so the neural network module UN can be deployed to the terminal video processor 200
  • different neural network modules UN are used to process video data with different bit rates.
  • multiple convolutional neural networks as shown in Figure 3A can be pre-trained with the same structure but different parameters.
  • different convolutional neural networks are trained with different sample data sets, and different sample data sets are obtained based on different video sets.
  • Each video set includes multiple videos, and the videos in the same video set have the same code. rate, the videos in different video sets have different bit rates. That is, for the same convolutional neural network, the sample datasets used for training come from videos with the same code rate; for different convolutional neural networks, the sample datasets used for training come from videos with different code rates. video.
  • the bit rate of video refers to the number of bits of data transmitted per unit of time during data transmission, and the unit is usually kbps, that is, kilobits per second.
  • the higher the bit rate the smaller the compressed ratio of the video, the smaller the loss of image quality, the smaller the noise of the image, and the closer it is to the original video.
  • the lower the bit rate the more noisy the image.
  • the denoising strength of the neural network corresponding to the low code rate is large, and the denoising strength of the neural network corresponding to the high code rate is small. It should be noted that "high” and "low” in the above description are relative, that is, high code rate and low code rate are relative.
  • the multiple convolutional neural networks obtained by training have the code rate specificity, and have better processing effect for the video corresponding to the code rate.
  • the multiple convolutional neural networks corresponding to several common video bit rates can be trained.
  • the parameters of these convolutional neural networks are quantized, and the obtained multiple neural network modules UN also have code rate specificity. Different neural network modules UN are used to process video data with different code rates to achieve better processing effects.
  • the central processing unit 212 is configured to: call the software development package 211 and select one neural network module UN among the plurality of neural network modules UN according to the bit rate of the input video, and control the neural network processor 213 to pair the neural network module UN based on the selected neural network module UN.
  • the input video is processed to enhance the clarity of the input video.
  • a plurality of neural network modules UN are used to process video data with different code rates, and each neural network module UN is used to process video data of a corresponding code rate. According to the bit rate of the input video, select the corresponding neural network module UN, and use the neural network module UN to process the input video, thereby improving the clarity of the input video and realizing real-time image quality enhancement.
  • the hardware development board 210 may be an ARM development board, and accordingly, the central processing unit 212 may be a central processing unit (CPU) of an ARM architecture.
  • the neural network processor 213 may be a neural network processor (Neural-Network Processing Unit, NPU) suitable for an ARM development board, which adopts a data-driven parallel computing architecture and is suitable for processing massive multimedia data such as videos and images.
  • NPU neural network processor
  • the types and hardware structures of the hardware development board 210 , the central processing unit 212 and the neural network processor 213 are not limited, and any applicable hardware may be used, which may be determined according to actual needs , as long as the corresponding functions can be implemented, which is not limited by the embodiments of the present disclosure.
  • the hardware development board 210 may use an Android system or a Linux system for hardware integration, or may use an operating system in a smart TV or a TV set-top box for integration, which is not limited in the embodiments of the present disclosure.
  • the terminal video processor 200 may be implemented as a component in a TV terminal (such as a smart TV), as a TV set-top box, as a component in a video playback device, or as a component in any other form.
  • a TV terminal such as a smart TV
  • TV set-top box such as a TV set-top box
  • video playback device such as a TV set-top box
  • a component in any other form such as a TV set-top box
  • the terminal device 200 may further include more components, structures and modules, and is not limited to the situation shown in FIG. 9A and FIG. 9B , so as to achieve more comprehensive functions.
  • a selection module may also be deployed in the terminal device 200, and the selection module is configured to select the corresponding neural network module UN according to the bit rate of the input video, so as to use the selected neural network module UN to process the input video .
  • a video decoding module and a video encoding module may also be deployed in the terminal device 200 , the video decoding module is used for performing decoding operations, and the video encoding module is used for performing encoding operations.
  • At least one neural network module UN among the plurality of neural network modules UN is configured to perform the following processing: receiving an input image; processing the input image to obtain an output image.
  • processing the input image to obtain the output image includes: performing feature extraction on the input image to obtain a plurality of first images; performing splicing processing on the input image and the plurality of first images to obtain a first image group, wherein the first image The group includes an input image and a plurality of first images; feature extraction is performed on the first image group to obtain a plurality of second images; a plurality of second images and a plurality of first images are fused to obtain a plurality of third images; The input image and a plurality of third images are spliced to obtain a second image group, wherein the second image group includes the input image and a plurality of third images; and feature extraction is performed on the second image group to obtain an output image.
  • the neural network module UN is used to implement
  • FIG. 10A is a data flow diagram of a terminal device provided by some embodiments of the present disclosure
  • FIG. 10B is an operation flowchart of a terminal device provided by some embodiments of the present disclosure.
  • FIG. 10A and FIG. 10B after the software development kit 211 is started, multiple neural network modules UN can be imported first, and then the video frame and the bit rate information of the video can be read.
  • the neural network module UN corresponding to the bit rate is selected for processing, and the result is finally output.
  • the input video may be processed in the following manner.
  • the input video (eg, video file or video stream) is decoded to obtain video frames.
  • the terminal device for example, the terminal video processor 200
  • the terminal video processor 200 can realize the image quality enhancement of the offline video
  • the terminal video processor 200 can realize the real-time live video Image quality enhancement.
  • the video frame is read in, and the processing of the video frame is started.
  • select the neural network module UN corresponding to the bit rate among the plurality of neural network modules UN and then use the selected neural network module UN to perform model inference, that is, to perform image processing (such as denoising and/or image processing). or deblur) to get the processed video frame.
  • image processing such as denoising and/or image processing. or deblur
  • the resolution of the processed video frame is improved.
  • the processed video frames are then encoded, and then output to a display, which displays the processed video.
  • FIG. 11A is a schematic diagram of a video picture, and the video picture is a video frame extracted from a video of 1M bit rate and standard definition 360p.
  • FIG. 11B is an effect diagram after the screen shown in FIG. 11A is processed by applying the terminal device provided by the embodiment of the present disclosure. Comparing FIGS. 11A and 11B , it can be seen that, through the processing of the terminal device (for example, the terminal video processor 200 ), the picture definition is improved, and the picture quality is enhanced, and the terminal video processor 200 has better picture quality enhancement capability, And can realize real-time processing.
  • the terminal device for example, the terminal video processor 200
  • the terminal device for example, the terminal video processor 200
  • the terminal device can remove noise of low-quality images and videos, improve picture definition, and realize picture quality enhancement.
  • the structure of the neural network module UN deployed on the terminal video processor 200 is simple, the computing power of the device can be saved, the processing efficiency can be improved, the hardware capability of the terminal device can be supported, and the processing speed requirement of the video stream can be satisfied. , to achieve real-time image quality enhancement.
  • At least one embodiment of the present disclosure further provides an image processing apparatus.
  • the convolutional neural network adopted by the image processing apparatus has a simple structure, can save equipment computing power, can remove noise from low-quality images and videos, and improve picture clarity. Real-time image quality enhancement is realized for easy application to terminal devices.
  • the image processing apparatus provided by at least one embodiment also has a better neural network training effect.
  • FIG. 12 is a schematic block diagram of an image processing apparatus according to some embodiments of the present disclosure.
  • the image processing apparatus 300 includes a processor 310 and a memory 320 .
  • Memory 320 is used to store non-transitory computer readable instructions (eg, one or more computer program modules).
  • the processor 310 is configured to execute non-transitory computer-readable instructions, and when the non-transitory computer-readable instructions are executed by the processor 310, one or more steps in the image processing method described above can be executed.
  • Memory 320 and processor 310 may be interconnected by a bus system and/or other form of connection mechanism (not shown).
  • processor 310 may be a central processing unit (CPU), digital signal processor (DSP), or other form of processing unit with data processing capabilities and/or program execution capabilities, such as a field programmable gate array (FPGA), etc.;
  • the central processing unit (CPU) may be an X86 or ARM architecture or the like.
  • the processor 310 may be a general-purpose processor or a special-purpose processor, and may control other components in the image processing apparatus 300 to perform desired functions.
  • memory 320 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or nonvolatile memory.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory, among others.
  • Non-volatile memory may include, for example, read only memory (ROM), hard disk, erasable programmable read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, flash memory, and the like.
  • One or more computer program modules may be stored on the computer-readable storage medium, and the processor 310 may execute the one or more computer program modules to implement various functions of the image processing apparatus 300 .
  • Various application programs and various data, various data used and/or generated by the application programs, and the like may also be stored in the computer-readable storage medium.
  • FIG. 13 is a schematic block diagram of another image processing apparatus provided by some embodiments of the present disclosure.
  • the image processing apparatus 400 is, for example, suitable for implementing the image processing method provided by the embodiments of the present disclosure.
  • the image processing apparatus 400 may be a user terminal or the like. It should be noted that the image processing apparatus 400 shown in FIG. 13 is only an example, which does not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • the image processing device 400 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 410, which may be loaded into a random Various appropriate actions and processes are performed by accessing the program in the memory (RAM) 430 .
  • a processing device eg, a central processing unit, a graphics processor, etc.
  • RAM random access memory
  • various programs and data required for the operation of the image processing apparatus 400 are also stored.
  • the processing device 410, the ROM 420, and the RAM 430 are connected to each other through a bus 440.
  • An input/output (I/O) interface 450 is also connected to bus 440 .
  • the following devices may be connected to the I/O interface 450: input devices 460 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration output device 470 of a computer, etc.; storage device 480 including, for example, magnetic tape, hard disk, etc.; and communication device 490.
  • the communication apparatus 490 may allow the image processing apparatus 400 to perform wireless or wired communication with other electronic devices to exchange data.
  • FIG. 13 shows the image processing apparatus 400 having various devices, it should be understood that it is not required to implement or have all of the illustrated apparatuses, and the image processing apparatus 400 may alternatively implement or have more or less device.
  • the image processing method provided by the embodiments of the present disclosure may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program including program code for executing the above-described image processing method.
  • the computer program may be downloaded and installed from the network via the communication device 490, or from the storage device 480, or from the ROM 420.
  • the image processing method provided by the embodiment of the present disclosure can be executed.
  • At least one embodiment of the present disclosure further provides a storage medium for storing non-transitory computer-readable instructions, which can implement the image processing described in any embodiment of the present disclosure when the non-transitory computer-readable instructions are executed by a computer method.
  • image processing can be performed through a convolutional neural network.
  • the convolutional neural network has a simple structure, can save equipment computing power, can remove noise from low-quality images and videos, improve image clarity, and achieve real-time image quality enhancement. , easy to apply to terminal equipment.
  • the storage medium provided by at least one embodiment also has a better neural network training effect.
  • FIG. 14 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.
  • storage medium 500 is used to store non-transitory computer readable instructions 510 .
  • the non-transitory computer readable instructions 510 may perform one or more steps in the image processing method according to the above when executed by a computer.
  • the storage medium 500 can be applied to the image processing apparatus 300 described above.
  • the storage medium 500 may be the memory 320 in the image processing apparatus 300 shown in FIG. 12 .
  • the relevant description of the storage medium 500 reference may be made to the corresponding description of the memory 320 in the image processing apparatus 300 shown in FIG. 12, and details are not repeated here.
  • At least one embodiment of the present disclosure also provides a video processing method, which can process an input video based on a convolutional neural network to improve picture definition, realize real-time picture quality enhancement, and has good processing effect and high processing efficiency.
  • FIG. 15 is a schematic flowchart of a video processing method provided by some embodiments of the present disclosure.
  • the video processing method may include the following operations.
  • Step S61 acquiring the input video bit rate and the input video, wherein the input video includes a plurality of input image frames;
  • Step S62 Select a video processing sub-method corresponding to the input video code rate according to the input video code rate to process at least one input image frame in the plurality of input image frames to obtain at least one output image frame, wherein at least one output image frame The resolution is higher than the resolution of at least one input image frame.
  • video processing sub-method may be an image processing method described in any embodiment of the present disclosure, or a method for processing video at a certain bit rate in a terminal device provided in any embodiment of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种图像处理方法及装置、设备、视频处理方法及存储介质,该图像处理方法包括:接收输入图像;利用卷积神经网络对输入图像进行处理得到输出图像。输出图像的清晰度高于输入图像的清晰度。利用卷积神经网络对输入图像进行处理得到输出图像包括:对输入图像进行特征提取,得到多个第一图像;对输入图像和多个第一图像进行拼接处理,得到第一图像组;对第一图像组进行特征提取,得到多个第二图像;将多个第二图像和多个第一图像进行融合,得到多个第三图像;对输入图像和多个第三图像进行拼接处理,得到第二图像组;对第二图像组进行特征提取,得到输出图像。该图像处理方法需要的卷积神经网络的结构简单,能提升画面清晰度,便于部署在终端上。

Description

图像处理方法及装置、设备、视频处理方法及存储介质 技术领域
本公开的实施例涉及一种图像处理方法及装置、设备、视频处理方法及存储介质。
背景技术
当前,基于人工神经网络的深度学习技术已经在诸如物体分类、文本处理、推荐引擎、图像搜索、面部识别、年龄和语音识别、人机对话以及情感计算等领域取得了巨大进展。随着人工神经网络结构的研究深入和相关算法的提升,深度学习技术在类人类数据感知领域取得了突破性的进展,深度学习技术可以用于描述图像内容、识别图像中的复杂环境下的物体、根据需要对图像进行处理等。
发明内容
本公开至少一个实施例提供一种图像处理方法,适用于卷积神经网络,其中,所述方法包括:接收输入图像;利用所述卷积神经网络对所述输入图像进行处理得到输出图像,其中,所述输出图像的清晰度高于所述输入图像的清晰度;利用所述卷积神经网络对所述输入图像进行处理得到所述输出图像包括:对所述输入图像进行特征提取,得到多个第一图像;对所述输入图像和所述多个第一图像进行拼接处理,得到第一图像组,其中,所述第一图像组包括所述输入图像和所述多个第一图像;对所述第一图像组进行特征提取,得到多个第二图像;将所述多个第二图像和所述多个第一图像进行融合,得到多个第三图像;对所述输入图像和所述多个第三图像进行拼接处理,得到第二图像组,其中,所述第二图像组包括所述输入图像和所述多个第三图像;对所述第二图像组进行特征提取,得到所述输出图像。
例如,在本公开一实施例提供的方法中,所述卷积神经网络中用于对所述输入图像进行特征提取的卷积核的数量为N,12≤N≤20且N为整数,所述卷积神经网络中用于对所述第一图像组进行特征提取的卷积核的数量为M,12≤M≤20且M为整数,所述卷积神经网络中用于对所述第二图像组进行特征提取的卷积核的数量为3。
例如,在本公开一实施例提供的方法中,N=M=16;用于对所述输入图像进行特征提取的所述卷积核的尺寸、用于对所述第一图像组进行特征提取的所述卷积核的尺寸以及用于对所述第二图像组进行特征提取的所述卷积核的尺寸均为3×3;所述卷积神经网络中用于特征提取的激活函数为:y=max(0, x),其中,x表示所述激活函数的输入,y表示所述激活函数的输出;所述输入图像包括红色通道输入图像、绿色通道输入图像和蓝色通道输入图像,所述输出图像包括红色通道输出图像、绿色通道输出图像和蓝色通道输出图像。
例如,本公开一实施例提供的方法还包括:基于预先训练好的第一神经网络对待训练的第二神经网络进行训练得到经过训练的第二神经网络,由此得到所述卷积神经网络;其中,所述第一神经网络的参数多于所述第二神经网络的参数,预先训练好的所述第一神经网络被配置为将输入预先训练好的所述第一神经网络的具有第一清晰度的原始图像变换为具有第二清晰度的新建图像,所述第二清晰度大于所述第一清晰度,经过训练的所述第二神经网络为所述卷积神经网络,待训练的所述第二神经网络的网络结构与所述卷积神经网络的网络结构相同,待训练的所述第二神经网络的参数与所述卷积神经网络的参数不同。
例如,在本公开一实施例提供的方法中,基于预先训练好的所述第一神经网络对待训练的所述第二神经网络进行训练得到经过训练的所述第二神经网络,由此得到所述卷积神经网络,包括:基于预先训练好的所述第一神经网络、待训练的所述第二神经网络和鉴别网络,交替训练所述鉴别网络和所述第二神经网络,得到经过训练的所述第二神经网络,由此得到所述卷积神经网络。
例如,在本公开一实施例提供的方法中,训练所述鉴别网络包括:将第一样本数据分别输入所述第一神经网络和所述第二神经网络,得到从所述第一神经网络输出的第一数据以及从所述第二神经网络输出的第二数据;将所述第一数据设置为具有真值标签,并将具有真值标签的所述第一数据输入所述鉴别网络以得到第一鉴别结果,将所述第二数据设置为具有假值标签,并将具有假值标签的所述第二数据输入所述鉴别网络以得到第二鉴别结果;基于所述第一鉴别结果和所述第二鉴别结果计算第一损失函数;根据所述第一损失函数调整所述鉴别网络的参数以得到更新后的鉴别网络。
例如,在本公开一实施例提供的方法中,训练所述第二神经网络包括:将第二样本数据分别输入所述第一神经网络和所述第二神经网络,得到从所述第一神经网络输出的第三数据以及从所述第二神经网络输出的第四数据;将所述第四数据设置为具有真值标签,并将具有真值标签的所述第四数据输入到更新后的所述鉴别网络,得到从所述鉴别网络输出的第三鉴别结果;基于所述第三数据和所述第四数据计算误差函数,基于所述第三鉴别结果计算鉴别函数,并基于所述误差函数和所述鉴别函数计算第二损失函数;根据所述第二损失函数调整所述第二神经网络的参数以得到更新后的第二神经网络。
例如,在本公开一实施例提供的方法中,所述第二损失函数为所述误差 函数与所述鉴别函数的加权和。
例如,在本公开一实施例提供的方法中,所述误差函数的权重为90~110,所述鉴别函数的权重为0.5~2。
例如,在本公开一实施例提供的方法中,所述第一样本数据和所述第二样本数据为基于多个具有相同码率的视频得到的图像数据。
例如,在本公开一实施例提供的方法中,基于预先训练好的所述第一神经网络对待训练的所述第二神经网络进行训练得到经过训练的所述第二神经网络,由此得到所述卷积神经网络,包括:将第三样本数据分别输入所述第一神经网络和所述第二神经网络,得到从所述第一神经网络输出的第五数据以及从所述第二神经网络输出的第六数据;基于所述第五数据和所述第六数据计算第三损失函数;根据所述第三损失函数调整所述第二神经网络的参数以得到更新后的第二神经网络。
例如,在本公开一实施例提供的方法中,所述第一神经网络包括多级下采样单元和对应的多级上采样单元,每级下采样单元的输出作为下一级下采样单元的输入,每级上采样单元的输入包含与该级上采样单元对应的下采样单元的输出和该级上采样单元的上一级上采样单元的输出。
本公开至少一个实施例还提供一种终端设备,包括处理器,其中,所述处理器被配置为:获取输入视频码率和输入视频,其中,所述输入视频包括多个输入图像帧;根据所述输入视频码率选择与所述输入视频码率对应的视频处理方法对所述多个输入图像帧中的至少一个输入图像帧进行处理,得到至少一个输出图像帧,其中,所述至少一个输出图像帧的清晰度高于所述至少一个输入图像帧的清晰度;其中,不同的输入视频码率对应于不同的视频处理方法。
例如,在本公开一实施例提供的终端设备中,所述视频处理方法包括:基于训练好的神经网络对所述至少一个输入图像帧进行处理,得到所述至少一个输出图像帧;其中,基于训练好的所述神经网络对所述至少一个输入图像帧进行处理,得到所述至少一个输出图像帧,包括:对所述至少一个输入图像帧进行特征提取,得到多个第一输出图像;对所述至少一个输入图像帧和所述多个第一输出图像进行拼接处理,得到第一输出图像组,其中,所述第一输出图像组包括所述至少一个输入图像帧和所述多个第一输出图像;对所述第一输出图像组进行特征提取,得到多个第二输出图像;将所述多个第二输出图像和所述多个第一输出图像进行融合,得到多个第三输出图像;对所述至少一个输入图像帧和所述多个第三输出图像进行拼接处理,得到第二输出图像组,其中,所述第二输出图像组包括所述至少一个输入图像帧和所述多个第三输出图像;对所述第二输出图像组进行特征提取,得到所述至少一个输出图像帧。
例如,在本公开一实施例提供的终端设备中,不同的视频处理方法对应的训练好的神经网络是不同的。
例如,在本公开一实施例提供的终端设备中,不同的视频处理方法对应的训练好的神经网络分别利用不同的样本数据集训练得到,不同的样本数据集分别基于不同的视频集得到,每个视频集包括多个视频,同一个视频集中的视频具有相同的码率,不同视频集中的视频具有不同的码率。
本公开至少一个实施例还提供一种视频处理方法,包括:获取输入视频码率和输入视频,其中,所述输入视频包括多个输入图像帧;根据所述输入视频码率选择与所述输入视频码率对应的视频处理子方法对所述多个输入图像帧中的至少一个输入图像帧进行处理,得到至少一个输出图像帧,其中,所述至少一个输出图像帧的清晰度高于所述至少一个输入图像帧的清晰度;其中,不同的输入视频码率对应于不同的视频处理子方法。
本公开至少一个实施例还提供一种图像处理装置,包括:处理器;存储器,包括一个或多个计算机程序模块;其中,所述一个或多个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于实现本公开任一实施例所述的图像处理方法。
本公开至少一个实施例还提供一种存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时可以实现本公开任一实施例所述的图像处理方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1为一种卷积神经网络的示意图;
图2为本公开一些实施例提供的一种图像处理方法的流程示意图;
图3A为本公开一些实施例提供的一种图像处理方法所采用的卷积神经网络的示意图;
图3B为图2所示的图像处理方法中步骤S20的流程示意图;
图4为本公开一些实施例提供的另一种图像处理方法的流程示意图;
图5A为本公开一些实施例提供的一种图像处理方法中对鉴别网络进行训练的流程示意图;
图5B为图5A所示的对鉴别网络进行训练的方案示意图;
图5C为一种鉴别网络;
图6A为本公开一些实施例提供的一种图像处理方法中对第二神经网络进行训练的流程示意图;
图6B为图6A所示的对第二神经网络进行训练的方案示意图;
图7A为一种具有去噪功能的神经网络;
图7B为一种具有去模糊功能的神经网络;
图8A为本公开一些实施例提供的另一种图像处理方法中对第二神经网络进行训练的流程示意图;
图8B为图8A所示的对第二神经网络进行训练的方案示意图;
图9A为本公开一些实施例提供的一种终端设备的示意框图;
图9B为本公开一些实施例提供的另一种终端设备的示意框图;
图10A为本公开一些实施例提供的一种终端设备的数据流图;
图10B为本公开一些实施例提供的一种终端设备的操作流程图;
图11A为视频画面示意图;
图11B为应用本公开实施例提供的终端设备对图11A所示的画面进行处理之后的效果图;
图12为本公开一些实施例提供的一种图像处理装置的示意框图;
图13为本公开一些实施例提供的另一种图像处理装置的示意框图;
图14为本公开一些实施例提供的一种存储介质的示意图;以及
图15为本公开一些实施例提供的一种视频处理方法的流程示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“一个”、“一”或者“该”等类似词语也不表示数量限制,而是表示存在至少一个。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
随着深度学习技术的发展,深度学习技术越来越广泛地应用于图像增强领域。基于深度学习的图像增强技术可以用于图像处理,例如图像去噪、图 像修复、图像去模糊、图像超分辨率增强、图像去雨/去雾等。
卷积神经网络(Convolutional Neural Network,CNN)是一种常用的神经网络结构,可以用于对图像进行处理。图1为一种卷积神经网络的示意图。例如,该卷积神经网络可以用于图像处理,其使用图像作为输入和输出,并通过卷积核替代标量的权重。图1中仅示出了具有3层结构的卷积神经网络,本公开的实施例对此不作限制。如图1所示,卷积神经网络包括输入层a01、隐藏层a02和输出层a03。输入层a01具有4个输入,隐藏层a02具有3个输出,输出层a03具有2个输出,该卷积神经网络最终输出2幅图像。例如,隐藏层a02又称为中间层,主要用于提取特征,其中的神经元可以采用多种形式,以形成对输出结果的偏向。
例如,输入层a01的4个输入可以为4幅图像,或者1幅图像的四种特征。隐藏层a02的3个输出可以为经过输入层a01输入的图像的特征图像。
例如,如图1所示,隐藏层a02包括第一卷积层b01和第一激活层b03。在该示例中,卷积层具有权重
Figure PCTCN2020119363-appb-000001
和偏置
Figure PCTCN2020119363-appb-000002
权重
Figure PCTCN2020119363-appb-000003
表示卷积核,偏置
Figure PCTCN2020119363-appb-000004
是叠加到卷积层的输出的标量,其中,k是表示输入层a01或隐藏层a02的编号的标签,i和j分别是相连接的两层中各个单元的标签。例如,第一卷积层b01包括第一组卷积核(图1中的
Figure PCTCN2020119363-appb-000005
)和第一组偏置(图1中的
Figure PCTCN2020119363-appb-000006
)。类似地,输出层a03包括第二卷积层b02和第二激活层b04。第二卷积层b02包括第二组卷积核(图1中的
Figure PCTCN2020119363-appb-000007
)和第二组偏置(图1中的
Figure PCTCN2020119363-appb-000008
)。通常,每个卷积层包括数十个或数百个卷积核,若卷积神经网络为深度卷积神经网络,则其可以包括至少五层卷积层。
例如,如图1所示,第一激活层b03位于第一卷积层b01之后,第二激活层b04位于第二卷积层b02之后。激活层包括激活函数,激活函数用于给卷积神经网络引入非线性因素,以使卷积神经网络可以更好地解决较为复杂的问题。激活函数可以包括线性修正单元函数(ReLU函数)、S型函数(Sigmoid函数)或双曲正切函数(tanh函数)等。ReLU函数为非饱和非线性函数,Sigmoid函数和tanh函数为饱和非线性函数。例如,激活层可以单独作为卷积神经网络的一层,或者激活层也可以被包含在卷积层中。
例如,在第一卷积层b01中,首先,对每个输入应用第一组卷积核中的若干卷积核
Figure PCTCN2020119363-appb-000009
和第一组偏置中的若干偏置
Figure PCTCN2020119363-appb-000010
以得到第一卷积层b01的输出;然后,第一卷积层b01的输出可以通过第一激活层b03进行处理,以得到第一激活层b03的输出。在第二卷积层b02中,首先,对第一激活层b03的输出应用第二组卷积核中的若干卷积核
Figure PCTCN2020119363-appb-000011
和第二组偏置中的若干偏置
Figure PCTCN2020119363-appb-000012
以得到第二卷积层b02的输出;然后,第二卷积层b02的输出可以通过第二激活层b04进行处理,以得到第二激活层b04的输出。例如,第一卷积层b01的输出可以为对其输入应用卷积核
Figure PCTCN2020119363-appb-000013
后再与偏置
Figure PCTCN2020119363-appb-000014
相加的结果,第二卷积层 b02的输出可以为对第一激活层b03的输出应用卷积核
Figure PCTCN2020119363-appb-000015
后再与偏置
Figure PCTCN2020119363-appb-000016
相加的结果。
在利用卷积神经网络进行图像处理前,需要对卷积神经网络进行训练。经过训练之后,卷积神经网络的卷积核和偏置在图像处理期间保持不变。在训练过程中,各卷积核和偏置通过多组输入/输出示例图像以及优化算法进行调整,以获取优化后的卷积神经网络模型。
在视频播放领域,由于传输带宽的限制,用户使用的电视所播放的视频画质较差。画质较差主要有以下两个方面的原因。一是视频图像的分辨率较低,电视频道或互联网视频的分辨率通常为标清(360p或720p),当用户的电视屏幕为4K分辨率(2160p)时,需要对视频图像插值放大,导致细节信息大量丢失。另一方面,由于视频直播传输带宽和视频存储空间的要求,需要对原始视频进行低码率压缩,压缩后的视频也损失了大量细节信息,且会引入很多噪声。
终端视频处理器作为智能电视的中心控制部件,能够对电视的基本功能进行操作,并且还可以具有语音控制和节目推荐等功能。随着芯片算力的不断提升,针对视频画质较差的问题,终端视频处理器可以集成视频图像的增强功能。但是,终端视频处理器的视频图像增强功能目前仍然以传统简单的算法为主,画质提升能力有限。深度学习算法(例如卷积神经网络)的运算量较大,终端视频处理器很难支撑通常的卷积神经网络的运算。
因此,如何设计一种结构简单的卷积神经网络以及相关的训练方法以提升图像的清晰度,使得该卷积神经网络既可以具有较好的图像增强功能,又便于部署在终端设备上,成为了亟待解决的问题。
本公开至少一个实施例提供一种图像处理方法及装置、终端设备、视频处理方法及存储介质。该图像处理方法所需要的卷积神经网络的结构简单,可以节省设备算力,能够去除低质量图像和视频的噪声,提升画面清晰度,实现实时画质增强,便于部署在终端设备上。至少一个实施例提供的图像处理方法还具有较好的神经网络训练效果。
下面,将参考附图详细地说明本公开的实施例。应当注意的是,不同的附图中相同的附图标记将用于指代已描述的相同的元件。
本公开至少一个实施例提供一种图像处理方法,该图像处理方法适用于卷积神经网络。该图像处理方法包括:接收输入图像;利用卷积神经网络对输入图像进行处理得到输出图像。输出图像的清晰度高于输入图像的清晰度。利用卷积神经网络对输入图像进行处理得到输出图像包括:对输入图像进行特征提取,得到多个第一图像;对输入图像和多个第一图像进行拼接处理,得到第一图像组,第一图像组包括输入图像和多个第一图像;对第一图像组进行特征提取,得到多个第二图像;将多个第二图像和多个第一图像进行融 合,得到多个第三图像;对输入图像和多个第三图像进行拼接处理,得到第二图像组,第二图像组包括输入图像和多个第三图像;对第二图像组进行特征提取,得到输出图像。
需要说明的是,本公开的实施例中,“清晰度”例如是指图像中各细部影纹及其边界的清晰程度,清晰度越高,人眼的感观效果越好。输出图像的清晰度高于输入图像的清晰度,例如是指采用本公开实施例提供的图像处理方法对输入图像进行处理,例如进行去噪和/或去模糊处理,从而使处理后得到的输出图像比输入图像更清晰。
图2为本公开一些实施例提供的一种图像处理方法的流程示意图。例如,该图像处理方法适用于卷积神经网络,也即是,该图像处理方法利用卷积神经网络来实现对图像的处理。例如,如图2所示,该图像处理方法包括如下操作。
步骤S10:接收输入图像;
步骤S20:利用卷积神经网络对输入图像进行处理得到输出图像,其中,输出图像的清晰度高于输入图像的清晰度。
例如,在步骤S10中,输入图像可以为待处理的图像,例如为清晰度较低的图像。待处理图像可以是从视频中提取的视频帧,也可以是通过网络下载或者通过相机拍摄的图片,还可以为通过其他途径获取的图像,本公开的实施例对此不作限制。输入图像中例如有很多噪声,且画质比较模糊,因此需要利用本公开实施例提供的图像处理方法来去噪和/或去模糊,从而提升清晰度,实现画质增强。例如,当输入图像为彩色图像时,输入图像可以包括红色(R)通道输入图像、绿色(G)通道输入图像和蓝色(B)通道输入图像。这里,输入图像为一张彩色图像,其包括RGB3个通道。
例如,在步骤S20中,可以利用卷积神经网络来实现图像的处理,例如对输入图像去噪和/或去模糊,以使得到的输出图像的清晰度高于输入图像的清晰度。例如,当输入图像包括红色(R)通道输入图像、绿色(G)通道输入图像和蓝色(B)通道输入图像时,输出图像可以包括红色(R)通道输出图像、绿色(G)通道输出图像和蓝色(B)通道输出图像。这里,输出图像为一张彩色图像,其包括RGB3个通道。
图3A为本公开一些实施例提供的一种图像处理方法所采用的卷积神经网络的示意图。例如,在本公开实施例提供的图像处理方法中,将输入图像输入到图3A所示的卷积神经网络中,该卷积神经网络对输入图像进行处理后得到输出图像,由此完成图像处理,使得输出图像的清晰度高于输入图像的清晰度。
例如,如图3A所示,该卷积神经网络包括依序设置的输入层INP、第一卷积层C1、第一拼接层P1、第二卷积层C2、融合层Fu1、第二拼接层P2和 输出层OT。
例如,输入层INP用于接收输入图像。当该输入图像为黑白图像时,输入图像可以包括一个通道。当该输入图像为RGB彩色图像时,输入图像可以包括三个通道,也即是,输入图像为1张,但是包括红色(R)通道输入图像、绿色(G)通道输入图像和蓝色(B)通道输入图像。
第一卷积层C1用于对输入层INP接收的输入图像进行卷积运算,以实现特征提取。例如,第一卷积层C1包括多个卷积核、多个偏置以及激活函数,由此可以计算得到多个特征图像(也可以称为特征图(Feature Map))。激活函数用于对卷积运算结果进行非线性映射,以协助表达复杂特征。例如,第一卷积层C1中的激活函数采用线性整流函数(Rectified Linear Unit,ReLU),使用ReLU函数更容易收敛,并且预测性能更好。ReLU函数可以表示为:y=max(0,x),其中,x表示ReLU函数的输入,y表示ReLU函数的输出。当然,本公开的实施例不限于此,激活函数也可以采用Sigmoid函数、双曲正切函数(hyperbolic tangent,tanh)或其他任意适用的函数,这可以根据实际需求而定。
第一拼接层P1用于将输入层INP接收的输入图像和第一卷积层C1输出的特征图像共同作为第二卷积层C2的输入,也即是,第一拼接层P1用于对输入层INP接收的输入图像和第一卷积层C1输出的特征图像进行拼接处理,并将拼接处理之后得到的特征图像作为第二卷积层C2的输入。例如,第一拼接层P1可以采用concat函数,concat函数例如可以表示为:concat(str1,str2,...)=(str1,str2,...),其中,str1表示参与拼接的第一个图像,str2表示参与拼接的第二个图像,以此类推。concat函数不会改变各个图像自身的内容,只返回多个图像拼接后的副本。关于concat函数的具体说明可参考常规设计,此处不再详述。由此,可以将输入图像的信息与第一卷积层C1输出的特征图像的信息进行链接,使得输入到第二卷积层C2的特征图像包含了输入图像的信息(也即包含了原图信息)。
第二卷积层C2用于对第一拼接层P1输出的特征图像进行卷积运算,以实现特征提取。例如,第二卷积层C2包括多个卷积核、多个偏置以及激活函数,由此可以计算得到多个特征图像。例如,第二卷积层C2中的激活函数也可以采用ReLU函数。当然,本公开的实施例不限于此,第二卷积层C2中的激活函数也可以采用Sigmoid函数、tanh函数或其他任意适用的函数,这可以根据实际需求而定。第一卷积层C1中的激活函数与第二卷积层C2中的激活函数可以相同或不同,这可以根据实际需求而定,本公开的实施例对此不作限制。
融合层Fu1用于将第二卷积层C2输出的特征图像与第一卷积层C1输出的特征图像融合。融合层Fu1可以将卷积神经网络中不同层的特征进行融合。 例如,这些不同层可以是相连接的两层,也即是,其中一层的输出作为另一层的输入;这些不同层也可以是间隔开的两层,也即是,其中任意一层的输出不作为另一层的输入。例如,可以采用任意适用的融合算法来实现特征图像的融合,本公开的实施例对此不作限制。
第二拼接层P2用于将输入层INP接收的输入图像和融合层Fu1输出的特征图像共同作为输出层OT的输入,也即是,第二拼接层P2用于对输入层INP接收的输入图像和融合层Fu1输出的特征图像进行拼接处理,并将拼接处理之后得到的特征图像作为输出层OT的输入。例如,第二拼接层P2可以采用concat函数,concat函数不会改变各个图像自身的内容,只返回多个图像拼接后的副本。由此,可以将输入图像的信息与融合层Fu1输出的特征图像的信息进行链接,使得输入到输出层OT的特征图像包含了输入图像的信息(也即包含了原图信息)。
输出层OT用于对第二拼接层P2输出的特征图像进行卷积运算,以将特征图像转换为满足要求的多通道或单通道图像,并将运算得到的图像输出,以作为该卷积神经网络的输出图像。例如,输出层OT可以进行卷积运算,以实现特征提取和通道转换。例如,输出层OT包括多个卷积核、多个偏置以及激活函数,由此可以计算得到特征图像以作为输出图像。例如,输出层OT中的激活函数也可以采用ReLU函数。当然,本公开的实施例不限于此,输出层OT中的激活函数也可以采用Sigmoid函数、tanh函数或其他任意适用的函数,这可以根据实际需求而定。例如,输出层OT还可以进一步包括全连接层,全连接层用于对提取的特征进行非线性组合以得到输出。
需要说明的是,本公开的实施例中,上述卷积神经网络还可以包括更多类型的运算层,例如还可以包括池化层等,本公开的实施例对此不作限制。例如,卷积层不限于两层(不限于第一卷积层C1和第二卷积层C2),还可以设置第三卷积层、第四卷积层等更多卷积层。类似地,拼接层不限于两层(不限于第一拼接层P1和第二拼接层P2),还可以设置第三拼接层、第四拼接层等更多拼接层。类似地,融合层也不限于1层,还可以设置更多融合层。
需要说明的是,本公开的实施例中,第一卷积层C1、第二卷积层C2和输出层OT还可以用于在卷积运算之前进行像素填充,以使输入该层的图像与该层输出的图像的尺寸相同。关于像素填充的详细说明可参考常规设计,此处不再详述。
在该卷积神经网络中,通过设置多个拼接层(例如第一拼接层P1和第二拼接层P2),使得卷积神经网络中的特征图像信息与输入图像的信息进行链接,从而使后续处理过程中的特征图像包含了输入图像的信息(也即包含了原图信息),由此使得该卷积神经网络具有很好的细节修复效果。当该卷积神经网络被训练以具有去噪和/或去模糊功能时,其去噪和/或去模糊效果非常 好,从而可以有效提升图像的清晰度,实现图像增强。该卷积神经网络的结构简单,运算易于实现,能够有效节省算力,提高运算效率,便于部署在终端设备上,适用于移动端的实时图像增强。
图3B为图2所示的图像处理方法中步骤S20的流程示意图。例如,如图3B所示,在一些示例中,图2所示的步骤S20可以具体包括如下操作。
步骤S21:对输入图像进行特征提取,得到多个第一图像;
步骤S22:对输入图像和多个第一图像进行拼接处理,得到第一图像组,其中,第一图像组包括输入图像和多个第一图像;
步骤S23:对第一图像组进行特征提取,得到多个第二图像;
步骤S24:将多个第二图像和多个第一图像进行融合,得到多个第三图像;
步骤S25:对输入图像和多个第三图像进行拼接处理,得到第二图像组,其中,第二图像组包括输入图像和多个第三图像;
步骤S26:对第二图像组进行特征提取,得到输出图像。
例如,上述步骤S21-S26可以通过图3A所示的卷积神经网络实现,下面将结合图3A所示的卷积神经网络对步骤S21-S26进行示例性说明。
例如,在步骤S21中,利用第一卷积层C1对输入图像进行特征提取,得到多个第一图像。例如,输入图像可以为RGB彩色图像,输入图像包括三个通道,也即是,输入图像为1张,但是包括红色(R)通道输入图像、绿色(G)通道输入图像和蓝色(B)通道输入图像。第一图像为经过第一卷积层C1处理之后得到的特征图像(也可以称为特征图(Feature Map))。例如,该卷积神经网络中用于对输入图像进行特征提取的卷积核(也即,第一卷积层C1中的卷积核)的数量为N,12≤N≤20且N为整数。例如,在一些示例中,N=16,也即,第一卷积层C1具有16个卷积核,此时,经过第一卷积层C1计算得到的多个第一图像的数量也为16。
例如,在步骤S22中,利用第一拼接层P1对输入图像和多个第一图像进行拼接处理,得到第一图像组。这里,“拼接处理”是指利用例如concat函数进行多个图像的拼接,concat函数不会改变各个图像自身的内容,只返回多个图像拼接后的副本。例如,第一图像组包括输入图像和多个第一图像。例如,在一些示例中,当输入图像为三个通道,第一卷积层C1具有16个卷积核时,第一图像的数量为16,将输入图像的三个通道与16张第一图像进行拼接处理得到第一图像组,第一图像组共包括19张图像,其中16张为第一卷积层C1输出的特征图像(即前述的第一图像),其中3张为输入图像的三个通道。
例如,在步骤S23中,利用第二卷积层C2对第一图像组进行特征提取,得到多个第二图像。例如,第二图像为经过第二卷积层C2处理之后得到的特征图像(也可以称为特征图(Feature Map))。例如,该卷积神经网络中用于对第一图像组进行特征提取的卷积核(也即,第二卷积层C2中的卷积核)的 数量为M,12≤M≤20且M为整数。例如,在一些示例中,M=16,也即,第二卷积层C2具有16个卷积核,此时,经过第二卷积层C2计算得到的多个第二图像的数量也为16。
需要说明的是,M与N可以相等或不相等,也即是,第二卷积层C2中的卷积核数量与第一卷积层C1中的卷积核数量可以相等或不相等,这可以根据实际需求而定,本公开的实施例对此不作限制。例如,在一些示例中,第一卷积层C1中的卷积核数量与第二卷积层C2中的卷积核数量均为16,可以既兼顾移动端可以承担的运算量级,又具有较好的图像处理效果。
例如,在步骤S24中,在一些示例中,可以利用融合层Fu1对多个第二图像进行处理,得到多个第三图像。也即是,利用融合层Fu1将多个第二图像和多个第一图像进行融合,得到多个第三图像。例如,在一些示例中,当第一卷积层C1具有16个卷积核且第二卷积层C2也具有16个卷积核时,第一图像为16张,第二图像也为16张,且第一图像与第二图像可以形成一一对应的关系,此时,融合处理例如可以是将对应的第一图像和第二图像中对应的像素点进行加法运算。
需要说明的是,本公开的实施例中,在步骤S24中,不限于采用融合层Fu1来得到第三图像,也可以采用一层或多层卷积层、一层或多层拼接层、一层或多层池化层等任意的运算层来对第二图像进行处理以得到第三图像,本公开的实施例对此不作限制。
例如,在步骤S25中,利用第二拼接层P2对输入图像和多个第三图像进行拼接处理,得到第二图像组。这里,“拼接处理”是指利用例如concat函数进行多个图像的拼接,concat函数不会改变各个图像自身的内容,只返回多个图像拼接后的副本。例如,第二图像组包括输入图像和多个第三图像。例如,在一些示例中,当输入图像为三个通道,第一卷积层C1和第二卷积层C2均具有16个卷积核时,第二图像组共包括19张图像,其中16张为融合层Fu1输出的特征图像,其中3张为输入图像的三个通道。
例如,在步骤S26中,利用输出层OT对第二图像组进行特征提取,得到输出图像。例如,利用输出层OT将第二图像组转换为满足要求的多通道或单通道图像。例如,输出图像为经过输出层OT处理之后得到的特征图像。例如,该卷积神经网络中用于对第二图像组进行特征提取的卷积核(也即,输出层OT中的卷积核)的数量为3,此时,经过输出层OT计算得到的输出图像的数量也为1,但是包括红色(R)通道输出图像、绿色(G)通道输出图像和蓝色(B)通道输出图像这三个通道的图像。需要说明的是,当输入图像仅为一个通道时,对应地,使输出层OT中的卷积核的数量也为1,从而得到的输出图像也为一个通道。
例如,在一些示例中,第一卷积层C1、第二卷积层C2和输出层OT中 的卷积核的尺寸均为3×3,该尺寸可以满足特征提取的要求,同时也兼顾算力资源。当然,本公开的实施例不限于此,卷积核的尺寸也可以为4×4、5×5等任意尺寸,并且第一卷积层C1、第二卷积层C2和输出层OT中的卷积核的尺寸可以相同或不同,这可以根据实际需求而定,本公开的实施例对此不作限制。
例如,在一些示例中,输入图像为三个通道,也即红色通道输入图像、绿色通道输入图像和蓝色通道输入图像,可以设置第一卷积层C1和第二卷积层C2均包括16个卷积核,输出层OT包括3个卷积核,卷积核的尺寸均为3×3。在该示例中,设定网络的batchsize=1,卷积核步长stride=1,也即不改变特征图像的尺寸,由此使得运算和处理更简单。需要说明的是,batchsize和卷积核步长stride均不限于为1,也可以为其他数值,这可以根据实际需求而定。
如图3A所示,经过第一卷积层C1后,图像由3通道变成为16通道(①),宽高信息保持不变。这里,多个“通道”可以是指多张特征图像。在第二卷积层C2开始计算之前,第一拼接层P1会将原图信息链接在第一卷积层C1的计算结果后(②),作为共同输入传递给第二卷积层C2。融合层Fu1将第二卷积层C2的输出(③)和第一卷积层C1的输出进行融合(④),之后通过第二拼接层P2链接原图信息(⑤),再传递给输出层OT。输出层OT学习RGB三个通道的信息,输出层OT的输出即为RGB图像信息(⑥)。
例如,每一层的输出结果可以表示为(B,H,W,F)。其中,B表示batchsize,H表示图像高度,W表示图像宽度,F表示特征图像数量。上述各个阶段的输出结果具体为:①:1*H*W*16,②1*H*W*19,③1*H*W*16,④1*H*W*16,⑤1*H*W*19,⑥1*H*W*3。
经过上述步骤S21-S26的处理,利用卷积神经网络完成了对输入图像的处理并且得到了输出图像,该输出图像的清晰度高于输入图像的清晰度,从而实现了图像增强。
图4为本公开一些实施例提供的另一种图像处理方法的流程示意图。例如,如图4所示,除了还进一步包括步骤S30之外,该实施例提供的图像处理方法与图2所示的图像处理方法基本上相同,相似的步骤S10-S20可以参考前述内容,此处不再赘述。
在该实施例中,在执行步骤S10-S20之前,该图像处理方法还包括如下操作。
步骤S30:基于预先训练好的第一神经网络对待训练的第二神经网络进行训练得到经过训练的第二神经网络,由此得到卷积神经网络。
例如,第一神经网络的参数多于第二神经网络的参数,预先训练好的第一神经网络被配置为将输入预先训练好的第一神经网络的具有第一清晰度的 原始图像变换为具有第二清晰度的新建图像,第二清晰度大于第一清晰度。也即是,第一神经网络具有提升图像清晰度的功能。
例如,在一些示例中,步骤S30可以包括:基于预先训练好的第一神经网络、待训练的第二神经网络和鉴别网络,交替训练鉴别网络和第二神经网络,得到经过训练的第二神经网络,由此得到卷积神经网络。
例如,经过训练的第二神经网络即为上文描述的卷积神经网络。待训练的第二神经网络的网络结构与卷积神经网络的网络结构相同,这里,网络结构可以是指网络的各个层的数量、设置方式和设置顺序等,还可以指网络中数据传输的途径和方式等。待训练的第二神经网络的参数与卷积神经网络的参数不同。也即是,通过对待训练的第二神经网络进行训练,优化其参数,从而可以得到上文描述的卷积神经网络。例如,第二神经网络的参数和卷积神经网络的参数均包括各个卷积层中的权重参数。例如,权重参数的绝对值越大,则该权重参数对应的神经元对神经网络的输出的贡献越大,进而对神经网络来说越重要。
例如,预先训练好的第一神经网络可以采用如下方式训练。首先获得真值样本(即清晰度较高的输入视频,包含多个输入图像帧),对真值样本进行处理(例如增加噪声、模糊虚化等)得到对应的清晰度较低的输出视频(包含多个输出图像帧),多个输出图像帧与多个输入图像帧是一一对应的。然后,将对应的输入图像帧和输出图像帧作为一组训练对来对未训练的第一神经网络进行训练,以得到训练后的第一神经网络。例如,此时得到的多个输入图像帧和/或多个输出图像帧可以作为后续对第二神经网络进行训练的训练样本。
例如,在一些示例中,在步骤S30中,将样本数据输入第一神经网络和第二神经网络,基于第一神经网络和第二神经网络输出的结果,先训练鉴别网络,接着训练第二神经网络。然后,再次将样本数据输入第一神经网络和第二神经网络,基于第一神经网络和第二神经网络输出的结果,先训练鉴别网络,接着训练第二神经网络,以此类推。例如,在不同的训练阶段输入的样本数据可以不同。经过交替训练鉴别网络和第二神经网络,使鉴别网络和第二神经网络对抗式地互相博弈学习,从而产生较好的输出结果。
例如,第一神经网络可以为已经训练好的较大神经网络。第二神经网络可以为基于上文描述的卷积神经网络的结构建立的,但是参数还需要训练的网络。例如,利用第一神经网络来训练第二神经网络,第一神经网络的参数多于第二神经网络的参数。例如,神经网络的参数包括神经网络中各个卷积层的权重参数。权重参数的绝对值越大,则该权重参数对应的神经元对神经网络的输出的贡献越大,进而对该神经网络来说越重要。通常,参数越多的神经网络的复杂度越高,其“容量”也就越大,也就意味着该神经网络能完 成更复杂的学习任务。相对于第一神经网络,第二神经网络得到了简化,第二神经网络具有更少的参数和更简单的网络结构,使得第二神经网络在其运行时占用较少的资源(例如计算资源、存储资源等),因而可以应用于轻量级的终端。采用上述训练的方式,可以使第二神经网络学习第一神经网络的推理能力,从而使第二神经网络在具备简单结构的同时具备较强的推理能力。
图5A为本公开一些实施例提供的一种图像处理方法中对鉴别网络进行训练的流程示意图,图5B为图5A所示的对鉴别网络进行训练的方案示意图。下面结合图5A和图5B对鉴别网络的训练方案进行示例性说明。
例如,在一些示例中,如图5A和图5B所示,对鉴别网络进行训练包括如下操作。
步骤S31:将第一样本数据分别输入第一神经网络NE1和第二神经网络NE2,得到从第一神经网络NE1输出的第一数据以及从第二神经网络NE2输出的第二数据;
步骤S32:将第一数据设置为具有真值标签,并将具有真值标签的第一数据输入鉴别网络Disc以得到第一鉴别结果,将第二数据设置为具有假值标签,并将具有假值标签的第二数据输入鉴别网络Disc以得到第二鉴别结果;
步骤S33:基于第一鉴别结果和第二鉴别结果计算第一损失函数;
步骤S34:根据第一损失函数调整鉴别网络Disc的参数以得到更新后的鉴别网络Disc。
例如,在步骤S31中,第一样本数据可以为基于视频得到的图像数据,第一数据和第二数据均为图像数据。当然,第一样本数据也可以为通过其他方式得到的图像数据,本公开的实施例对此不作限制。例如,在一些示例中,第一样本数据为基于多个具有相同码率的视频得到的图像数据。由此,利用该第一样本数据训练得到的卷积神经网络可以对该码率的视频的图像数据具有较好的处理能力和处理效果,使得该卷积神经网络对码率的针对性强。例如,在一些示例中,可以对原始视频进行不同码率的视频质量压缩,并随机添加高斯噪声和量子噪声,形成低画质视频,对低画质视频进行视频帧提取,从而可以得到第一样本数据,该第一样本数据即为低画质视频的视频帧。
例如,第一神经网络NE1可以为已经训练好的、较大的、具有提升清晰度功能的神经网络。例如,第一神经网络NE1包括多级下采样单元和对应的多级上采样单元,每级下采样单元的输出作为下一级下采样单元的输入,每级上采样单元的输入包含与该级上采样单元对应的下采样单元的输出和该级上采样单元的上一级上采样单元的输出。该神经网络通过利用多个下采样单元对图像进行多次特征提取,并利用多个上采样单元对图像进行多次上采样,每个下采样单元输出的特征图像被输入到对应的上采样单元中,从而可以有效捕捉图像中的特征信息,提升图像的清晰度。例如,第一神经网络NE1可 以采用图7A所示的网络结构,相关说明将在后文描述,此处不再赘述。需要说明的是,第一神经网络NE1可以采用任意的已训练好的神经网络,也可以采用任意的神经网络的组合,这可以根据实际需求而定,本公开的实施例对此不作限制。例如,第一神经网络NE1用于提升样本数据的清晰度,以使得训练得到的卷积神经网络也具有提升清晰度的功能。例如,提升清晰度可以具体实现为去噪和/或去模糊,以实现图像增强。
例如,第二神经网络NE2基于上文描述的卷积神经网络的网络结构建立,也即是,第二神经网络NE2与上文描述的卷积神经网络的网络结构相同,但是其参数还需要训练和修正。
例如,在步骤S32中,将第一数据设置为具有真值标签,将第二数据设置为具有假值标签。例如,在一些示例中,真值标签可以表示为[1],假值标签可以表示为[0]。将具有真值标签的第一数据输入鉴别网络Disc以得到第一鉴别结果,将具有假值标签的第二数据输入鉴别网络Disc以得到第二鉴别结果。
例如,鉴别网络Disc可以采用层叠(stack)模式的卷积神经网络模型。图5C为一种鉴别网络,该鉴别网络例如采用层叠模式的卷积神经网络模型。例如,如图5C所示,该鉴别网络Disc包括多个运算层,每个运算层均由卷积层和激活层(采用ReLU函数)组成,最后通过全连接层FC输出结果。例如,在该示例中,4个运算层中的卷积层的卷积核尺寸均为3×3,4个运算层各自输出的特征图像的数量分别为32、64、128、192。鉴别网络Disc最后输出的结果为二分类的概率值,也即,鉴别网络Disc判别输出0-1之间的任意数值,该数值越靠近1,则代表判断输入为真的概率越大,反之则判断输入为假的概率越大。
需要说明的是,本公开的实施例不限于此,鉴别网络Disc也可以采用任意类型的鉴别网络,相关说明可参考常规设计,此处不再详述。
例如,如图5A和图5B所示,在步骤S33中,基于第一鉴别结果和第二鉴别结果计算第一损失函数,该第一损失函数为鉴别网络Disc的损失函数。例如,在一些示例中,第一损失函数可以采用交叉熵损失函数,交叉熵损失函数表征真实样本标签和预测概率之间的差值。例如,交叉熵损失函数的公式为:
Figure PCTCN2020119363-appb-000017
其中,N为数据的个数(也即,第一数据与第二数据的总数量),y i为每个数据对应的标签(也即,0或1),p i为鉴别网络对每个数据的预测值。
需要说明的是,本公开的实施例中,鉴别网络Disc的损失函数可以采用任意类型的函数,不限于交叉熵损失函数,这可以根据实际需求而定,本公 开的实施例对此不作限制。
例如,在步骤S34中,根据第一损失函数调整鉴别网络Disc的参数,以得到更新后的鉴别网络Disc,该更新后的鉴别网络Disc具有更好的鉴别能力。
通过上述步骤S31-S34,可以完成一次针对鉴别网络Disc的训练。
图6A为本公开一些实施例提供的一种图像处理方法中对第二神经网络进行训练的流程示意图,图6B为图6A所示的对第二神经网络进行训练的方案示意图。
下面结合图6A和图6B对第二神经网络的训练方案进行示例性说明。
例如,在一些示例中,如图6A和图6B所示,对第二神经网络进行训练包括如下操作。
步骤S35:将第二样本数据分别输入第一神经网络NE1和第二神经网络NE2,得到从第一神经网络NE1输出的第三数据以及从第二神经网络NE2输出的第四数据;
步骤S36:将第四数据设置为具有真值标签,并将具有真值标签的第四数据输入到更新后的鉴别网络Disc,得到从鉴别网络Disc输出的第三鉴别结果;
步骤S37:基于第三数据和第四数据计算误差函数,基于第三鉴别结果计算鉴别函数,并基于误差函数和鉴别函数计算第二损失函数;
步骤S38:根据第二损失函数调整第二神经网络NE2的参数以得到更新后的第二神经网络NE2。
上述步骤S35-S38可以在执行完步骤S31-S36之后执行。
例如,在步骤S35中,第二样本数据可以为基于视频得到的图像数据,第三数据和第四数据均为图像数据。当然,第二样本数据也可以为通过其他方式得到的图像数据,本公开的实施例对此不作限制。例如,在一些示例中,第二样本数据为基于多个具有相同码率的视频得到的图像数据。由此,利用该第二样本数据训练得到的卷积神经网络可以对该码率的视频的图像数据具有较好的处理能力和处理效果,使得该卷积神经网络对码率的针对性强。例如,在一些示例中,可以对原始视频进行不同码率的视频质量压缩,并随机添加高斯噪声和量子噪声,形成低画质视频,对低画质视频进行视频帧提取,从而可以得到第二样本数据,该第二样本数据即为低画质视频的视频帧。例如,第一样本数据与第二样本数据可以相同或不同。
例如,在步骤S36中,将第四数据设置为具有真值标签(例如表示为[1]),并将具有真值标签的第四数据输入到鉴别网络Disc,得到从鉴别网络Disc输出的第三鉴别结果。例如,第三鉴别结果的数值范围为0~1。需要说明的是,此时的鉴别网络Disc为经过上述步骤S31-S34训练后所更新的鉴别网络。
例如,在步骤S37中,基于第三数据和第四数据计算误差函数。例如,在一些示例中,误差函数可以采用平均绝对误差(L1 loss)。例如,平均绝对 误差的计算公式如下:L1 loss=mean(|X-Y|),其中,X为第四数据,Y为第三数据。需要说明的是,误差函数不限于平均绝对误差(L1 loss),也可以为其他任意适用的误差函数,这可以根据实际需求而定,本公开的实施例对此不作限制。基于第三鉴别结果计算鉴别函数D2,例如可以采用任意适用的方法计算鉴别函数D2,本公开的实施例对此不作限制。
计算得到误差函数(例如平均绝对误差)和鉴别函数D2之后,基于误差函数(例如平均绝对误差)和鉴别函数D2计算第二损失函数,第二损失函数为第二神经网络NE2的损失函数。例如,第二损失函数为误差函数(例如平均绝对误差)与鉴别函数D2的加权和,可以表示为:NE1 loss=W1*L1 loss+W2*D2,其中,NE1 loss表示第二损失函数,W1表示误差函数的权重,W2表示鉴别函数D2的权重。例如,误差函数的权重W1为90~110(例如为100),鉴别函数D2的权重为0.5~2(例如为1)。
例如,在步骤S38中,根据第二损失函数调整第二神经网络NE2的参数以得到更新后的第二神经网络NE2。
通过上述步骤S35-S38,可以完成一次针对第二神经网络NE2的训练。
例如,通过上述步骤S31-S38,可以交替完成一次针对鉴别网络Disc和第二神经网络NE2的对抗式训练。需要说明的是,每次交替训练都是在上一次训练更新的基础上进行的,也即是,在训练第二神经网络NE2时,利用基于上一次训练更新的鉴别网络Disc进行训练,而在训练鉴别网络Disc时,利用基于上一次训练更新的第二神经网络NE2进行训练。根据需求,可以交替进行一次或多次针对鉴别网络Disc和第二神经网络NE2的对抗式训练,通过优化和迭代,使经过训练的第二神经网络NE2的图像处理能力满足需求。例如,在一些示例中,一共进行约2000万次的交替训练,可以得到满足需求的第二神经网络NE2。例如,经过训练的第二神经网络NE2的参数(例如卷积层中的权重参数)已经得到了优化和修正,经过训练的第二神经网络NE2即为上文描述的卷积神经网络。
例如,在一些示例中,在一次针对鉴别网络Disc和第二神经网络NE2的对抗式训练中,所采用的第一样本数据和第二样本数据可以是相同的,也即是,可以采用相同的样本数据完成一次针对鉴别网络Disc的训练和一次针对第二神经网络NE2的训练。例如,在进行多次对抗式训练的情形,对于同一次对抗式训练,所采用的第一样本数据和第二样本数据可以是相同的。例如,当完成一次对抗式训练后,在随后进行的第二次针对鉴别网络Disc和第二神经网络NE2的对抗式训练中,所采用的第一样本数据与前一次对抗式训练中所采用的第一样本数据不同,并且所采用的第二样本数据与前一次对抗式训练中所采用的第二样本数据不同。通过这种方式,可以提高训练效率,简化训练方式,提高数据利用率。
在本公开实施例提供的图像处理方法中,通过采用上述训练方式,可以快速训练得到满足需求的卷积神经网络,训练效果好,训练得到的卷积神经网络可以具有较强的图像处理能力和较好的图像处理效果。在所采用的第一神经网络NE1具有去噪和/或去模糊功能时,训练得到的卷积神经网络也具有去噪和/或去模糊功能,且去噪和/或去模糊效果很好。
需要说明的是,本公开的实施例中,该图像处理方法还可以包括更多的步骤,各个步骤的执行顺序可以根据实际需求而调整,本公开的实施例对此不作限制。
例如,第一神经网络NE1可以采用任意的训练好的神经网络,例如可以采用具有去噪功能的较大神经网络或具有去模糊功能的较大神经网络,或者,也可以采用具有去噪功能的较大神经网络与具有去模糊功能的较大神经网络的组合。
图7A为一种具有去噪功能的神经网络。如图7A所示,该神经网络包括多级下采样单元和对应的多级上采样单元,多级下采样单元与多级上采样单元一一对应。例如,在图7A中,左侧的单元为下采样单元,右侧的单元为上采样单元。每级下采样单元的输出作为下一级下采样单元的输入,每级上采样单元的输入包含与该级上采样单元对应的下采样单元的输出和该级上采样单元的上一级上采样单元的输出。也即是,下采样单元的输出不仅提供给相邻的下一级下采样单元,还提供给与该下采样单元对应的上采样单元。
例如,每个下采样单元包括Conv2d卷积层、Relu激活函数、Conv2d下采样2倍等运算层,每个上采样单元包括Conv2d卷积层、Relu激活函数、Conv2d上采样2倍等运算层。该神经网络同时输入连续的3帧图像,输出为输入的中间帧的结果。假设输入帧序列的尺寸为(H,W,C),若输入为RGB图像,则C=3,若输入为灰度图,则C=1。同时输入3帧图像,在C通道进行合并,则网络模型的输入尺寸为(H,W,C*3)。
该神经网络为U型对称结构,左侧主要是下采样单元,右侧主要是上采样单元。该神经网络中,多个下采样单元和多个上采样单元彼此对应,左侧每个下采样单元输出的特征图像被输入到右侧对应的上采样单元中,从而使每个层级得到的特征图都有效使用到后续计算中。
在该神经网络中,设定网络的batchsize=1(B=1),网络中每一层的特征层数为F。Conv2d卷积层选择(3,3)的卷积核,步长stride=1,即不改变特征图的尺寸,则输出尺寸为(B,H,W,F)。Conv2d下采样2倍选择(3,3)的卷积核,同时步长stride=2,即下采样两倍,输出尺寸为(B,H//2,W//2,F)。Conv2d上采样2倍选择(4,4)的卷积核,步长stride=2,即上采样2倍,输出尺寸为(B,H,W,F)。Conv2d特征转RGB图(参数共享)选择(3,3)卷积核,步长stride=1,主要目的是为了将特征层数由F变为3,得到输出的RGB图。
需要注意的是,该卷积核的参数是共享的,即图中所有Conv2d卷积核的参数都是一样的,在训练过程中,分别输出了F1,F2,F3,F4,F5这5个不同尺度的图像。例如,这5个不同尺度的图像的大小分别为:F1(H,W,3),F2(H//2,W//2,3),F3(H//4,W//4,3),F4(H//8,W//8,3),F5(H//16,W//16,3)。
在该神经网络的训练过程中,在不同尺度下输出F1、F2、F3、F4、F5分别和真值计算损失函数。例如,GT1为真值,为了得到其他尺度的真值图,对GT1进行BICUBIC下采样,分别得到GT2(H//2,W//2,3),GT3(H//4,W//4,3),GT4(H//8,W//8,3)和GT5(H//16,W//16,3)。在训练完成后使用该神经网络时,只在最终输出F1时使用参数共享的Conv2d卷积层,而不再输出F2、F3、F4、F5,即在F2、F3、F4、F5处不使用该卷积层。
图7B为一种具有去模糊功能的神经网络。如图7B所示,该神经网络包括多组具有不同卷积核数量的功能层组,每个功能层组包括一个或多个Conv卷积层、DConv卷积层、再阻塞层等。并且,在该网络中,不同的功能层组之间建立跳跃连接,通过设置跳跃连接,可以解决网络层数较深的情况下梯度消失的问题,同时有助于梯度的反向传播,加快训练过程。
关于该神经网络的说明可参考常规设计中应用跳跃连接和再阻塞层的神经网络,此处不再详述。
需要说明的是,本公开的实施例中,训练得到的卷积神经网络可以对输入图像进行处理以提升图像的清晰度,但不限于仅具有去噪和/或去模糊功能,该卷积神经网络还可以具有其他任意的功能,只需采用具有相应功能的第一神经网络NE1并利用该第一神经网络NE1训练第二神经网络NE2即可。
图8A为本公开一些实施例提供的另一种图像处理方法中对第二神经网络进行训练的流程示意图,图8B为图8A所示的对第二神经网络进行训练的方案示意图。例如,在该示例中,可以采用图8A和图8B中示出的训练方法来训练得到卷积神经网络,与图5A至图6B所示的训练方式不同,该示例中不再采用鉴别网络,而仅采用第一神经网络NE1来训练第二神经网络NE2以得到所需要的卷积神经网络。
下面结合图8A和图8B对第二神经网络的训练方案进行示例性说明。
例如,在一些示例中,如图8A和图8B所示,对第二神经网络进行训练包括如下操作。
步骤S41:将第三样本数据分别输入第一神经网络NE1和第二神经网络NE2,得到从第一神经网络NE1输出的第五数据以及从第二神经网络NE2输出的第六数据;
步骤S42:基于第五数据和第六数据计算第三损失函数;
步骤S43:根据第三损失函数调整第二神经网络NE2的参数以得到更新后的第二神经网络NE2。
例如,首先选择一组清晰度较好的视频数据集,加以一定的模糊和噪声,再压缩至1M的码率,以降低视频的质量,使得处理后的视频符合实际视频网站中的标清视频标准。对处理之后的视频数据集提取视频帧,得到的视频帧组成样本数据。例如,清晰度较好的视频数据集可以采用AIM数据集,该AIM数据集包含240个1280*720的视频,每个视频有100帧。
例如,在该示例中,第一神经网络NE1包括两个依序设置的较大神经网络NE1a和NE1b,NE1a和NE1b例如分别为图7A所示的神经网络和图7B所示的神经网络,这两个神经网络的设置顺序不受限制。
例如,在步骤S41中,将第三样本数据输入第一神经网络NE1,第一神经网络NE1中的一个神经网络NE1a对第三样本数据处理后将处理结果输入到第一神经网络NE1中的另一个神经网络NE1b,该另一个神经网络NE1b对接收到的图像进行处理并将处理结果输出,以作为该第一神经网络NE1的输出。由此,第一神经网络NE1输出的第五数据既进行了去噪处理,又进行了去模糊处理,第一神经网络NE1兼具去噪和去模糊的功能。第五数据作为真值图像。将第三样本数据输入第二神经网络NE2,第二神经网络NE2输出的第六数据作为假值图像。
例如,在步骤S42和S43中,基于第五数据和第六数据计算第三损失函数,并利用第三损失函数进行反向传播以调整第二神经网络NE2的参数,由此得到更新后的第二神经网络NE2,也即是,得到所需要的卷积神经网络。
例如,在该示例中,第三损失函数可以采用平均绝对误差(L1 loss)和索贝尔误差(Sobel loss)的加权和。平均绝对误差的计算公式如下:L1 loss=mean(|X-Y|),其中,X为第六数据,Y为第五数据。索贝尔误差的计算公式为:Sobel loss=mean(|sobel_edge(gray(X))-sobel_edge(gray(Y))|),其中,X为第六数据,Y为第五数据,sobel_edge()表示Sobel边缘,该公式将X和Y转至灰度域后提取Sobel边缘,接着做差后求均值即可。例如,在一些示例中,平均绝对误差(L1 loss)的权重可以设置为0.5~1.5(例如1),索贝尔误差(Sobel loss)的权重可以设置为1.5~2.5(例如2),从而可以取得较好的训练效果。
例如,索贝尔误差(也称为Sobel算子)是像素图像边缘检测中最重要的算子之一,该算子包含两组3*3的矩阵,分别为横向及纵向,将之与图像作卷积,即可分别得出横向及纵向的亮度差分近似值。例如,采用Gx和Gy分别表示在横向及纵向的灰度偏导的近似值,Sobel算子的计算公式如下:
Figure PCTCN2020119363-appb-000018
在上述Sobel算子中,A表示图像。对于每一个像素点,可以获得x、y两个方向的梯度,可以通过下述公式算出梯度的估计值:
Figure PCTCN2020119363-appb-000019
例如, 可以定义一个阈值Gmax,如果G比Gmax小,可以认为该点是一个边界值,则保留该点,设置为白色,否则该点设置为黑色。由此就得到了图像梯度信息。在训练过程中,将第五数据和第六数据都转至灰度域,分别求Sobel梯度,并计算梯度图的均差作为损失值进行反向传播。
在该示例中,由于第一神经网络NE1兼具去噪和去模糊的功能,因此训练得到的卷积神经网络可以同时学习去噪和去模糊功能,能够很好地恢复视频帧的清晰度,同时保持图像信息。通过采用上述训练方式,使得仅需要较少的样本数据即可快速训练得到满足需求的卷积神经网络。
本公开至少一个实施例还提供一种终端设备,该终端设备可以基于卷积神经网络对输入视频进行处理以提升画面清晰度,实现实时画质增强,处理效果好,处理效率高。
图9A为本公开一些实施例提供的一种终端设备的示意框图。例如,如图9A所示,该终端设备100包括处理器110。
例如,处理器110被配置为:获取输入视频码率和输入视频,输入视频包括多个输入图像帧;根据输入视频码率选择与输入视频码率对应的视频处理方法对多个输入图像帧中的至少一个输入图像帧进行处理,得到至少一个输出图像帧。例如,至少一个输出图像帧的清晰度高于至少一个输入图像帧的清晰度,不同的输入视频码率对应于不同的视频处理方法。
例如,上述视频处理方法包括:基于训练好的神经网络对至少一个输入图像帧进行处理,得到至少一个输出图像帧。例如,可以采用图3A所示的卷积神经网络来实现对输入图像帧的处理,该神经网络具有去噪和/或去模糊功能,可以有效提升图像的清晰度,具有很好的细节修复效果,能够实现图像增强。例如,可以采用本公开任一实施例提供的图像处理方法来实现对输入图像帧的处理。
例如,基于训练好的神经网络对至少一个输入图像帧进行处理,得到至少一个输出图像帧,可以包括如下操作:
对至少一个输入图像帧进行特征提取,得到多个第一输出图像;
对至少一个输入图像帧和多个第一输出图像进行拼接处理,得到第一输出图像组,其中,第一输出图像组包括至少一个输入图像帧和多个第一输出图像;
对第一输出图像组进行特征提取,得到多个第二输出图像;
将多个第二输出图像和多个第一输出图像进行融合,得到多个第三输出图像;
对至少一个输入图像帧和多个第三输出图像进行拼接处理,得到第二输出图像组,其中,第二输出图像组包括至少一个输入图像帧和多个第三输出图像;
对第二输出图像组进行特征提取,得到至少一个输出图像帧。
例如,上述对输入图像帧进行处理的步骤与图3B所示的步骤基本上相同,相关说明可参考前述内容,此处不再赘述。
例如,不同的视频处理方法对应的训练好的神经网络是不同的。也即是,不同的输入视频码率对应于不同的视频处理方法,不同的视频处理方法对应于不同的神经网络,具有不同输入视频码率的输入视频采用不同的神经网络来进行处理。神经网络具***率针对性,通过对码率进行区分,可以使神经网络针对对应码率的视频具有更好的处理效果。而且,不同码率的输入视频采用不同的视频处理方法(或不同的神经网络)来处理,可以使不同码率的输入视频经过处理后获得的输出视频具有较为一致的清晰度,使得处理效果不受到码率的影响,从而提高该终端设备100的性能稳定性和一致性。
例如,不同的视频处理方法对应的训练好的神经网络分别利用不同的样本数据集训练得到,不同的样本数据集分别基于不同的视频集得到,每个视频集包括多个视频,同一个视频集中的视频具有相同的码率,不同视频集中的视频具有不同的码率。也即是,对于同一个神经网络,其训练所采用的样本数据集来自具有相同码率的视频;对于不同的神经网络,其训练所采用的样本数据集来自具有不同码率的视频。视频的码率是指数据传输时单位时间传送的数据位数,单位通常为kbps,即千位每秒。码率越高,视频被压缩的比例越小,画质损失就越小,图像的噪声越小,与原始视频越接近。码率越低,图像的噪声越大。相应地,对应于低码率的神经网络的去噪强度大,对应于高码率的神经网络的去噪强度小。需要注意的是,上述描述中的“高”和“低”都是相对的,即码率高和码率低是相对比来说的。
图9B为本公开一些实施例提供的另一种终端设备的示意框图。例如,如图9B所示,该终端设备200例如实现为终端视频处理器200,且包括硬件开发板210,硬件开发板210上部署有软件开发包211,硬件开发板210包括中央处理器212和神经网络处理器213。
例如,软件开发包211可以实现为具有通用接口或自定义接口的程序及相关文件的集合,也即软件开发工具包(Software Development Kit,SDK)。软件开发包211例如可以部署在硬件开发板210的片上内存(ROM)中,运行时从ROM中读取。
例如,软件开发包211包括多个神经网络模块UN,多个神经网络模块UN分别基于多个卷积神经网络得到,多个神经网络模块UN与多个卷积神经网络一一对应。例如,该卷积神经网络可以为图3A所示的卷积神经网络,该卷积神经网络具有去噪和/或去模糊功能,可以有效提升图像的清晰度,具有很好的细节修复效果,能够实现图像增强。
例如,多个神经网络模块UN基于对多个卷积神经网络进行参数量化得 到。例如,可以将数据类型由32位浮点数(float32)转换为8位整数(int8),从而实现参数量化,以有效节省算力,使终端视频处理器200能够支持神经网络模块UN的运算。例如,可以在float32的精度上训练得到卷积神经网络(例如采用图5A-6B所示的训练方法进行训练),然后对训练好的卷积神经网络进行参数量化,将数据类型转换为int8,从而得到神经网络模块UN。得到的神经网络模块UN具有与卷积神经网络相同的功能,虽然由于参数量化导致神经网络模块UN的处理效果与卷积神经网络的处理效果有轻微差异,但是人眼难以察觉,该质量损失可以忽略不计。
通过参数量化,可以有效减少运算量和数据量,使得该神经网络模块UN适于在终端视频处理器200上运行,可以节省算力和内存量。例如,在本公开的实施例中,在参数量压缩约300倍的情况下,神经网络模块UN的输出仍然可以较好地保持图像质量,因此可以将神经网络模块UN部署到终端视频处理器200上,为实现实时图像增强提供了可能。
例如,不同神经网络模块UN用于处理具有不同码率的视频数据。例如,可以预先训练多个如图3A所示的卷积神经网络,这些卷积神经网络的结构相同,但是参数不同。例如,不同的卷积神经网络分别利用不同的样本数据集训练得到,不同的样本数据集分别基于不同的视频集得到,每个视频集包括多个视频,同一个视频集中的视频具有相同的码率,不同视频集中的视频具有不同的码率。也即是,对于同一个卷积神经网络,其训练所采用的样本数据集来自具有相同码率的视频;对于不同的卷积神经网络,其训练所采用的样本数据集来自具有不同码率的视频。视频的码率是指数据传输时单位时间传送的数据位数,单位通常为kbps,即千位每秒。码率越高,视频被压缩的比例越小,画质损失就越小,图像的噪声越小,与原始视频越接近。码率越低,图像的噪声越大。相应地,对应于低码率的神经网络的去噪强度大,对应于高码率的神经网络的去噪强度小。需要注意的是,上述描述中的“高”和“低”都是相对的,即码率高和码率低是相对比来说的。
通过对码率进行区分,使得训练得到的多个卷积神经网络具***率针对性,针对对应码率的视频具有更好的处理效果。例如,可以针对几种常见的视频码率训练得到对应的几个卷积神经网络。将这些卷积神经网络进行参数量化,得到的多个神经网络模块UN也具***率针对性,不同神经网络模块UN用于处理具有不同码率的视频数据,以达到更好的处理效果。
例如,中央处理器212配置为:调用软件开发包211并根据输入视频的码率选择多个神经网络模块UN中的一个神经网络模块UN,控制神经网络处理器213基于选择的神经网络模块UN对输入视频进行处理,以提升输入视频的清晰度。
例如,多个神经网络模块UN用于处理具有不同码率的视频数据,每个 神经网络模块UN用于处理对应码率的视频数据。根据输入视频的码率,选择对应的神经网络模块UN,利用该神经网络模块UN对输入视频进行处理,从而可以提升输入视频的清晰度,实现实时画质增强。
例如,硬件开发板210可以为ARM开发板,相应地,中央处理器212可以为ARM架构的中央处理器(CPU)。例如,神经网络处理器213可以为适用于ARM开发板的神经网络处理器(Neural-Network Processing Unit,NPU),其采用数据驱动并行计算的架构,适于处理视频、图像类的海量多媒体数据。需要说明的是,本公开的实施例中,硬件开发板210、中央处理器212和神经网络处理器213的类型和硬件结构不受限制,可以采用任意适用的硬件,这可以根据实际需求而定,只需能实现相应功能即可,本公开的实施例对此不作限制。例如,硬件开发板210可以采用Android***或Linux***进行硬件集成,也可以利用智能电视或电视机顶盒中的操作***进行集成,本公开的实施例对此不作限制。
例如,终端视频处理器200可以实现为电视终端(例如智能电视)中的部件,也可以实现为电视机顶盒,还可以实现为视频播放设备中的部件或实现为其他任意形式,本公开的实施例对此不作限制。
需要说明的是,本公开的实施例中,终端设备200还可以包括更多的部件、结构和模块,而不限于图9A和图9B所示的情形,从而实现更加全面的功能,本公开的实施例对此不作限制。例如,在一些示例中,终端设备200中还可以部署有选择模块,该选择模块用于根据输入视频的码率选择对应的神经网络模块UN,以利用选择的神经网络模块UN对输入视频进行处理。例如,在另一些示例中,终端设备200中还可以部署有视频解码模块和视频编码模块,视频解码模块用于进行解码操作,视频编码模块用于进行编码操作。
例如,多个神经网络模块UN中至少一个神经网络模块UN用于进行如下处理:接收输入图像;对输入图像进行处理得到输出图像。例如,输出图像的清晰度高于输入图像的清晰度。例如,对输入图像进行处理得到输出图像包括:对输入图像进行特征提取,得到多个第一图像;对输入图像和多个第一图像进行拼接处理,得到第一图像组,其中,第一图像组包括输入图像和多个第一图像;对第一图像组进行特征提取,得到多个第二图像;将多个第二图像和多个第一图像进行融合,得到多个第三图像;对输入图像和多个第三图像进行拼接处理,得到第二图像组,其中,第二图像组包括输入图像和多个第三图像;对第二图像组进行特征提取,得到输出图像。例如,神经网络模块UN用于实现本公开实施例提供的图像处理方法,相关说明可参考前述内容,此处不再赘述。
图10A为本公开一些实施例提供的一种终端设备的数据流图,图10B为本公开一些实施例提供的一种终端设备的操作流程图。例如,在一些示例中, 如图10A和图10B所示,软件开发包211启动后,可以先导入多个神经网络模块UN,然后读入视频帧以及视频的码率信息,由于采用了模型与视频码率绑定的方式,当视频的码率确定后,则选择与码率对应的神经网络模块UN进行处理,最后输出结果。
具体地,可以采用如下方式对输入视频进行处理。
首先,初始化模型及配置参数。完成初始化后,对输入视频(例如视频文件或视频流)进行解码,以得到视频帧。例如,当输入视频为视频文件时,该终端设备(例如终端视频处理器200)可以实现离线视频的画质增强,当输入视频为视频流时,该终端视频处理器200可以实现直播视频的实时画质增强。
接着,读入视频帧,开始对视频帧进行处理。根据输入视频的码率信息选择多个神经网络模块UN中与该码率对应的神经网络模块UN,接着利用选择的神经网络模块UN进行模型推理,也即,进行图像处理(例如去噪和/或去模糊),从而得到处理之后的视频帧。处理之后的视频帧的清晰度得到提升。然后对处理之后的视频帧进行编码,之后输出给显示器,显示器显示处理之后的视频。
图11A为视频画面示意图,该视频画面为1M码率、标清360p的视频中提取的视频帧。图11B为应用本公开实施例提供的终端设备对图11A所示的画面进行处理之后的效果图。对比图11A和图11B可知,经过终端设备(例如终端视频处理器200)的处理,画面清晰度得到了提升,实现了画质增强,该终端视频处理器200具有较好的画质增强能力,且能够实现实时处理。
通过利用神经网络模块UN,本公开实施例提供的终端设备(例如终端视频处理器200)能够去除低质量图像和视频的噪声,提升画面清晰度,实现画质增强。并且,由于部署在终端视频处理器200上的神经网络模块UN的结构简单,因此可以节省设备算力,提高处理效率,能够得到终端设备的硬件能力的支持,能够满足对视频流的处理速度需求,实现实时画质增强。
本公开至少一个实施例还提供一种图像处理装置,该图像处理装置所采用的卷积神经网络的结构简单,可以节省设备算力,能够去除低质量图像和视频的噪声,提升画面清晰度,实现实时画质增强,便于应用到终端设备中。至少一个实施例提供的图像处理装置还具有较好的神经网络训练效果。
图12为本公开一些实施例提供的一种图像处理装置的示意框图。如图12所示,该图像处理装置300包括处理器310和存储器320。存储器320用于存储非暂时性计算机可读指令(例如一个或多个计算机程序模块)。处理器310用于运行非暂时性计算机可读指令,非暂时性计算机可读指令被处理器310运行时可以执行上文所述的图像处理方法中的一个或多个步骤。存储器320和处理器310可以通过总线***和/或其它形式的连接机构(未示出)互连。
例如,处理器310可以是中央处理单元(CPU)、数字信号处理器(DSP)或者具有数据处理能力和/或程序执行能力的其它形式的处理单元,例如现场可编程门阵列(FPGA)等;例如,中央处理单元(CPU)可以为X86或ARM架构等。处理器310可以为通用处理器或专用处理器,可以控制图像处理装置300中的其它组件以执行期望的功能。
例如,存储器320可以包括一个或多个计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序模块,处理器310可以运行一个或多个计算机程序模块,以实现图像处理装置300的各种功能。在计算机可读存储介质中还可以存储各种应用程序和各种数据以及应用程序使用和/或产生的各种数据等。
需要说明的是,本公开的实施例中,图像处理装置300的具体功能和技术效果可以参考上文中关于图像处理方法的描述,此处不再赘述。
图13为本公开一些实施例提供的另一种图像处理装置的示意框图。该图像处理装置400例如适于用来实施本公开实施例提供的图像处理方法。图像处理装置400可以是用户终端等。需要注意的是,图13示出的图像处理装置400仅仅是一个示例,其不会对本公开实施例的功能和使用范围带来任何限制。
如图13所示,图像处理装置400可以包括处理装置(例如中央处理器、图形处理器等)410,其可以根据存储在只读存储器(ROM)420中的程序或者从存储装置480加载到随机访问存储器(RAM)430中的程序而执行各种适当的动作和处理。在RAM 430中,还存储有图像处理装置400操作所需的各种程序和数据。处理装置410、ROM 420以及RAM 430通过总线440彼此相连。输入/输出(I/O)接口450也连接至总线440。
通常,以下装置可以连接至I/O接口450:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置460;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置470;包括例如磁带、硬盘等的存储装置480;以及通信装置490。通信装置490可以允许图像处理装置400与其他电子设备进行无线或有线通信以交换数据。虽然图13示出了具有各种装置的图像处理装置400,但应理解的是,并不要求实施或具备所有示出的装置,图像处理装置400可以替代地实施或具备更多或更少的装置。
例如,本公开实施例提供的图像处理方法可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算 机可读介质上的计算机程序,该计算机程序包括用于执行上述图像处理方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置490从网络上被下载和安装,或者从存储装置480安装,或者从ROM 420安装。在该计算机程序被处理装置410执行时,可以执行本公开实施例提供的图像处理方法。
本公开至少一个实施例还提供一种存储介质,用于存储非暂时性计算机可读指令,当该非暂时性计算机可读指令由计算机执行时可以实现本公开任一实施例所述的图像处理方法。利用该存储介质,可以通过卷积神经网络进行图像处理,该卷积神经网络的结构简单,可以节省设备算力,能够去除低质量图像和视频的噪声,提升画面清晰度,实现实时画质增强,便于应用到终端设备中。至少一个实施例提供的存储介质还具有较好的神经网络训练效果。
图14为本公开一些实施例提供的一种存储介质的示意图。如图14所示,存储介质500用于存储非暂时性计算机可读指令510。例如,当非暂时性计算机可读指令510由计算机执行时可以执行根据上文所述的图像处理方法中的一个或多个步骤。
例如,该存储介质500可以应用于上述图像处理装置300中。例如,存储介质500可以为图12所示的图像处理装置300中的存储器320。例如,关于存储介质500的相关说明可以参考图12所示的图像处理装置300中的存储器320的相应描述,此处不再赘述。
本公开至少一个实施例还提供一种视频处理方法,该视频处理方法可以基于卷积神经网络对输入视频进行处理以提升画面清晰度,实现实时画质增强,处理效果好,处理效率高。
图15为本公开一些实施例提供的一种视频处理方法的流程示意图。例如,如图15所示,该视频处理方法可以包括如下操作。
步骤S61:获取输入视频码率和输入视频,其中,输入视频包括多个输入图像帧;
步骤S62:根据输入视频码率选择与输入视频码率对应的视频处理子方法对多个输入图像帧中的至少一个输入图像帧进行处理,得到至少一个输出图像帧,其中,至少一个输出图像帧的清晰度高于至少一个输入图像帧的清晰度。
例如,不同的输入视频码率对应于不同的视频处理子方法。这里,“视频处理子方法”可以为本公开任一实施例所述的图像处理方法,也可以为本公开任一实施例提供的终端设备中对某一码率的视频进行处理的方法,相关说明可参考前述内容,此处不再赘述。
有以下几点需要说明:
(1)本公开实施例附图只涉及到本公开实施例涉及到的结构,其他结构可参考通常设计。
(2)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (19)

  1. 一种图像处理方法,适用于卷积神经网络,其中,所述方法包括:
    接收输入图像;
    利用所述卷积神经网络对所述输入图像进行处理得到输出图像,其中,所述输出图像的清晰度高于所述输入图像的清晰度;
    利用所述卷积神经网络对所述输入图像进行处理得到所述输出图像包括:
    对所述输入图像进行特征提取,得到多个第一图像;
    对所述输入图像和所述多个第一图像进行拼接处理,得到第一图像组,其中,所述第一图像组包括所述输入图像和所述多个第一图像;
    对所述第一图像组进行特征提取,得到多个第二图像;
    将所述多个第二图像和所述多个第一图像进行融合,得到多个第三图像;
    对所述输入图像和所述多个第三图像进行拼接处理,得到第二图像组,其中,所述第二图像组包括所述输入图像和所述多个第三图像;
    对所述第二图像组进行特征提取,得到所述输出图像。
  2. 根据权利要求1所述的方法,其中,所述卷积神经网络中用于对所述输入图像进行特征提取的卷积核的数量为N,12≤N≤20且N为整数,
    所述卷积神经网络中用于对所述第一图像组进行特征提取的卷积核的数量为M,12≤M≤20且M为整数,
    所述卷积神经网络中用于对所述第二图像组进行特征提取的卷积核的数量为3。
  3. 根据权利要求2所述的方法,其中,N=M=16;
    用于对所述输入图像进行特征提取的所述卷积核的尺寸、用于对所述第一图像组进行特征提取的所述卷积核的尺寸以及用于对所述第二图像组进行特征提取的所述卷积核的尺寸均为3×3;
    所述卷积神经网络中用于特征提取的激活函数为:y=max(0,x),其中,x表示所述激活函数的输入,y表示所述激活函数的输出;
    所述输入图像包括红色通道输入图像、绿色通道输入图像和蓝色通道输入图像,所述输出图像包括红色通道输出图像、绿色通道输出图像和蓝色通道输出图像。
  4. 根据权利要求1-3任一所述的方法,还包括:
    基于预先训练好的第一神经网络对待训练的第二神经网络进行训练得到经过训练的第二神经网络,由此得到所述卷积神经网络;
    其中,所述第一神经网络的参数多于所述第二神经网络的参数,预先训 练好的所述第一神经网络被配置为将输入预先训练好的所述第一神经网络的具有第一清晰度的原始图像变换为具有第二清晰度的新建图像,所述第二清晰度大于所述第一清晰度,经过训练的所述第二神经网络为所述卷积神经网络,待训练的所述第二神经网络的网络结构与所述卷积神经网络的网络结构相同,待训练的所述第二神经网络的参数与所述卷积神经网络的参数不同。
  5. 根据权利要求4所述的方法,其中,基于预先训练好的所述第一神经网络对待训练的所述第二神经网络进行训练得到经过训练的所述第二神经网络,由此得到所述卷积神经网络,包括:
    基于预先训练好的所述第一神经网络、待训练的所述第二神经网络和鉴别网络,交替训练所述鉴别网络和所述第二神经网络,得到经过训练的所述第二神经网络,由此得到所述卷积神经网络。
  6. 根据权利要求5所述的方法,其中,训练所述鉴别网络包括:
    将第一样本数据分别输入所述第一神经网络和所述第二神经网络,得到从所述第一神经网络输出的第一数据以及从所述第二神经网络输出的第二数据;
    将所述第一数据设置为具有真值标签,并将具有真值标签的所述第一数据输入所述鉴别网络以得到第一鉴别结果,将所述第二数据设置为具有假值标签,并将具有假值标签的所述第二数据输入所述鉴别网络以得到第二鉴别结果;
    基于所述第一鉴别结果和所述第二鉴别结果计算第一损失函数;
    根据所述第一损失函数调整所述鉴别网络的参数以得到更新后的鉴别网络。
  7. 根据权利要求6所述的方法,其中,训练所述第二神经网络包括:
    将第二样本数据分别输入所述第一神经网络和所述第二神经网络,得到从所述第一神经网络输出的第三数据以及从所述第二神经网络输出的第四数据;
    将所述第四数据设置为具有真值标签,并将具有真值标签的所述第四数据输入到更新后的所述鉴别网络,得到从所述鉴别网络输出的第三鉴别结果;
    基于所述第三数据和所述第四数据计算误差函数,基于所述第三鉴别结果计算鉴别函数,并基于所述误差函数和所述鉴别函数计算第二损失函数;
    根据所述第二损失函数调整所述第二神经网络的参数以得到更新后的第二神经网络。
  8. 根据权利要求7所述的方法,其中,所述第二损失函数为所述误差函数与所述鉴别函数的加权和。
  9. 根据权利要求8所述的方法,其中,所述误差函数的权重为90~110,所述鉴别函数的权重为0.5~2。
  10. 根据权利要求7所述的方法,其中,所述第一样本数据和所述第二样本数据为基于多个具有相同码率的视频得到的图像数据。
  11. 根据权利要求4所述的方法,其中,基于预先训练好的所述第一神经网络对待训练的所述第二神经网络进行训练得到经过训练的所述第二神经网络,由此得到所述卷积神经网络,包括:
    将第三样本数据分别输入所述第一神经网络和所述第二神经网络,得到从所述第一神经网络输出的第五数据以及从所述第二神经网络输出的第六数据;
    基于所述第五数据和所述第六数据计算第三损失函数;
    根据所述第三损失函数调整所述第二神经网络的参数以得到更新后的第二神经网络。
  12. 根据权利要求4-11任一所述的方法,其中,所述第一神经网络包括多级下采样单元和对应的多级上采样单元,每级下采样单元的输出作为下一级下采样单元的输入,每级上采样单元的输入包含与该级上采样单元对应的下采样单元的输出和该级上采样单元的上一级上采样单元的输出。
  13. 一种终端设备,包括处理器,
    其中,所述处理器被配置为:
    获取输入视频码率和输入视频,其中,所述输入视频包括多个输入图像帧;
    根据所述输入视频码率选择与所述输入视频码率对应的视频处理方法对所述多个输入图像帧中的至少一个输入图像帧进行处理,得到至少一个输出图像帧,其中,所述至少一个输出图像帧的清晰度高于所述至少一个输入图像帧的清晰度;
    其中,不同的输入视频码率对应于不同的视频处理方法。
  14. 根据权利要求13所述的终端设备,其中,所述视频处理方法包括:
    基于训练好的神经网络对所述至少一个输入图像帧进行处理,得到所述至少一个输出图像帧;
    其中,基于训练好的所述神经网络对所述至少一个输入图像帧进行处理,得到所述至少一个输出图像帧,包括:
    对所述至少一个输入图像帧进行特征提取,得到多个第一输出图像;
    对所述至少一个输入图像帧和所述多个第一输出图像进行拼接处理,得到第一输出图像组,其中,所述第一输出图像组包括所述至少一个输入图像帧和所述多个第一输出图像;
    对所述第一输出图像组进行特征提取,得到多个第二输出图像;
    将所述多个第二输出图像和所述多个第一输出图像进行融合,得到多个第三输出图像;
    对所述至少一个输入图像帧和所述多个第三输出图像进行拼接处理,得到第二输出图像组,其中,所述第二输出图像组包括所述至少一个输入图像帧和所述多个第三输出图像;
    对所述第二输出图像组进行特征提取,得到所述至少一个输出图像帧。
  15. 根据权利要求14所述的终端设备,其中,不同的视频处理方法对应的训练好的神经网络是不同的。
  16. 根据权利要求15所述的终端设备,其中,不同的视频处理方法对应的训练好的神经网络分别利用不同的样本数据集训练得到,不同的样本数据集分别基于不同的视频集得到,每个视频集包括多个视频,同一个视频集中的视频具有相同的码率,不同视频集中的视频具有不同的码率。
  17. 一种视频处理方法,包括:
    获取输入视频码率和输入视频,其中,所述输入视频包括多个输入图像帧;
    根据所述输入视频码率选择与所述输入视频码率对应的视频处理子方法对所述多个输入图像帧中的至少一个输入图像帧进行处理,得到至少一个输出图像帧,其中,所述至少一个输出图像帧的清晰度高于所述至少一个输入图像帧的清晰度;
    其中,不同的输入视频码率对应于不同的视频处理子方法。
  18. 一种图像处理装置,包括:
    处理器;
    存储器,包括一个或多个计算机程序模块;
    其中,所述一个或多个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于实现权利要求1-12任一所述的图像处理方法。
  19. 一种存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时可以实现权利要求1-12任一所述的图像处理方法。
PCT/CN2020/119363 2020-09-30 2020-09-30 图像处理方法及装置、设备、视频处理方法及存储介质 WO2022067653A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US17/430,840 US20220164934A1 (en) 2020-09-30 2020-09-30 Image processing method and apparatus, device, video processing method and storage medium
CN202080002197.6A CN114586056A (zh) 2020-09-30 2020-09-30 图像处理方法及装置、设备、视频处理方法及存储介质
PCT/CN2020/119363 WO2022067653A1 (zh) 2020-09-30 2020-09-30 图像处理方法及装置、设备、视频处理方法及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/119363 WO2022067653A1 (zh) 2020-09-30 2020-09-30 图像处理方法及装置、设备、视频处理方法及存储介质

Publications (1)

Publication Number Publication Date
WO2022067653A1 true WO2022067653A1 (zh) 2022-04-07

Family

ID=80951120

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119363 WO2022067653A1 (zh) 2020-09-30 2020-09-30 图像处理方法及装置、设备、视频处理方法及存储介质

Country Status (3)

Country Link
US (1) US20220164934A1 (zh)
CN (1) CN114586056A (zh)
WO (1) WO2022067653A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034999B (zh) * 2022-07-06 2024-03-19 四川大学 一种基于雨雾分离处理和多尺度网络的图像去雨方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075581A1 (en) * 2016-09-15 2018-03-15 Twitter, Inc. Super resolution using a generative adversarial network
CN108986050A (zh) * 2018-07-20 2018-12-11 北京航空航天大学 一种基于多分支卷积神经网络的图像和视频增强方法
CN109360171A (zh) * 2018-10-26 2019-02-19 北京理工大学 一种基于神经网络的视频图像实时去模糊方法
CN110072119A (zh) * 2019-04-11 2019-07-30 西安交通大学 一种基于深度学习网络的内容感知视频自适应传输方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7206453B2 (en) * 2001-05-03 2007-04-17 Microsoft Corporation Dynamic filtering for lossy compression
US11216917B2 (en) * 2019-05-03 2022-01-04 Amazon Technologies, Inc. Video enhancement using a neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075581A1 (en) * 2016-09-15 2018-03-15 Twitter, Inc. Super resolution using a generative adversarial network
CN108986050A (zh) * 2018-07-20 2018-12-11 北京航空航天大学 一种基于多分支卷积神经网络的图像和视频增强方法
CN109360171A (zh) * 2018-10-26 2019-02-19 北京理工大学 一种基于神经网络的视频图像实时去模糊方法
CN110072119A (zh) * 2019-04-11 2019-07-30 西安交通大学 一种基于深度学习网络的内容感知视频自适应传输方法

Also Published As

Publication number Publication date
US20220164934A1 (en) 2022-05-26
CN114586056A (zh) 2022-06-03

Similar Documents

Publication Publication Date Title
US11537873B2 (en) Processing method and system for convolutional neural network, and storage medium
WO2021073493A1 (zh) 图像处理方法及装置、神经网络的训练方法、合并神经网络模型的图像处理方法、合并神经网络模型的构建方法、神经网络处理器及存储介质
WO2020239026A1 (zh) 图像处理方法及装置、神经网络的训练方法、存储介质
WO2022116856A1 (zh) 一种模型结构、模型训练方法、图像增强方法及设备
CN111402143B (zh) 图像处理方法、装置、设备及计算机可读存储介质
WO2021164731A1 (zh) 图像增强方法以及图像增强装置
CN112419151B (zh) 图像退化处理方法、装置、存储介质及电子设备
EP4090022A1 (en) Image processing method and related device
WO2022021938A1 (zh) 图像处理方法与装置、神经网络训练的方法与装置
WO2023000895A1 (zh) 图像风格转换方法、装置、电子设备和存储介质
CN112906721B (zh) 图像处理方法、装置、设备及计算机可读存储介质
WO2021135702A1 (zh) 一种视频去噪方法和电子设备
US11516538B1 (en) Techniques for detecting low image quality
CN113095470A (zh) 神经网络的训练方法、图像处理方法及装置、存储介质
US11893710B2 (en) Image reconstruction method, electronic device and computer-readable storage medium
US20240078414A1 (en) Parallelized context modelling using information shared between patches
CN113066018A (zh) 一种图像增强方法及相关装置
WO2023137915A1 (zh) 基于特征融合的行为识别方法、装置、设备及存储介质
US20230053317A1 (en) Deep palette prediction
CN116958534A (zh) 一种图像处理方法、图像处理模型的训练方法和相关装置
WO2022067653A1 (zh) 图像处理方法及装置、设备、视频处理方法及存储介质
CN115409697A (zh) 一种图像处理方法及相关装置
WO2020187029A1 (zh) 图像处理方法及装置、神经网络的训练方法、存储介质
CN116508320A (zh) 基于机器学习的图像译码中的色度子采样格式处理方法
WO2023098688A1 (zh) 图像编解码方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20955678

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 12/07/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20955678

Country of ref document: EP

Kind code of ref document: A1