CN115496656A

CN115496656A - Training method of image processing model, image processing method and device

Info

Publication number: CN115496656A
Application number: CN202211118972.1A
Authority: CN
Inventors: 磯部駿; 陶鑫; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2022-12-20

Abstract

The present disclosure relates to a training method of an image processing model, an image processing method, an apparatus, an electronic device, a storage medium, and a computer program product. The method comprises the following steps: obtaining a compressed sample video frame sequence; inputting a sample video frame, a previous sample video frame of the sample video frame and a target sample video frame in the extended sample video frame sequence into an image processing model to be trained aiming at each sample video frame except for a first sample video frame in the extended sample video frame sequence to obtain a predicted super-resolution video frame of the sample video frame; training an image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame to obtain a trained image processing model; the image processing model to be trained comprises a multi-level convolutional network. By adopting the method, the super-resolution effect of the compressed video frame can be improved.

Description

Training method of image processing model, image processing method and device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a training method for an image processing model, an image processing method, an image processing apparatus, an electronic device, a storage medium, and a computer program product.

Background

With the development of image processing technology, a video compression technology has appeared, in which a video is compressed and then transmitted, so that the video transmission is prevented from occupying a larger bandwidth, and the video transmission efficiency is improved. However, compressed video frames are prone to noise and lack of image details, and therefore super-resolution processing needs to be performed on the compressed video frames.

In the related art, at present, super-resolution processing is performed on a compressed video frame, and is mainly realized through a trained super-resolution model; when the model is trained, a compressed video frame is generally input into the model, a prediction super-resolution result for the compressed video frame is output through the model, and the model is trained by using a loss value between the prediction super-resolution result and a real super-resolution result. However, in the process of model training, only the information of the video frame itself is considered, so that the super-resolution effect of the trained super-resolution model on the compressed video frame is poor.

Disclosure of Invention

The present disclosure provides a training method of an image processing model, an image processing method, an apparatus, an electronic device, a storage medium, and a computer program product, to at least solve the problem in the related art that the super-resolution effect on compressed video frames is poor. The technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, a method for training an image processing model is provided, including:

obtaining a compressed sample video frame sequence;

inputting the sample video frame, a previous sample video frame of the sample video frame and a target sample video frame in the extended sample video frame sequence into an image processing model to be trained aiming at each sample video frame except for a first sample video frame in the extended sample video frame sequence to obtain a predicted super-resolution video frame of the sample video frame; the expanded sample video frame sequence is obtained by pasting a second sample video frame in the compressed sample video frame sequence to the front of a first sample video frame in the compressed sample video frame sequence; the target sample video frame is a sample video frame of which the corresponding video frame type in the expanded sample video frame sequence is a preset video frame type;

training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame to obtain a trained image processing model;

the image processing model to be trained comprises a plurality of stages of convolution networks, wherein each stage of convolution network comprises a reconstruction layer and a super-resolution network; and the reconstruction layer and the super-resolution network in each stage of the convolutional network except the last stage of the convolutional network are connected with the next stage of the super-resolution network.

In an exemplary embodiment, the inputting the sample video frame, the previous sample video frame of the sample video frame, and the target sample video frame in the sequence of the extended sample video frames into an image processing model to be trained to obtain a predicted super-resolution video frame of the sample video frame includes:

splicing the sample video frame, a previous sample video frame of the sample video frame and a target sample video frame in the expanded sample video frame sequence to obtain a spliced video frame corresponding to the sample video frame;

performing deformable convolution processing on the spliced video frame corresponding to the sample video frame to obtain the image characteristics of the sample video frame;

and inputting the image characteristics of the sample video frame into an image processing model to be trained to obtain a predicted super-resolution video frame of the sample video frame.

In an exemplary embodiment, the inputting the image features of the sample video frame into an image processing model to be trained to obtain a predicted super-resolution video frame of the sample video frame includes:

inputting the image characteristics of the sample video frame into an image processing model to be trained, performing super-resolution processing on the image characteristics of the sample video frame through a first-stage super-resolution network in the image processing model to be trained, inputting an obtained first-stage super-resolution result into a first-stage reconstruction layer for reconstruction processing, inputting the obtained first-stage reconstruction result and the obtained first-stage super-resolution result into a second-stage super-resolution network for super-resolution processing, and outputting a last-stage reconstruction result through a last-stage reconstruction layer;

and determining the last-stage reconstruction result as a predicted super-resolution video frame of the sample video frame.

In an exemplary embodiment, the training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame to obtain a trained image processing model includes:

obtaining a loss value corresponding to each sample video frame according to the predicted super-resolution video frame and the target super-resolution video frame of each sample video frame;

obtaining an average value of the loss values as a target loss value;

and training the image processing model to be trained according to the target loss value to obtain the trained image processing model.

In an exemplary embodiment, the compressed sample video frame sequences include N1 compressed first sample video frame sequences each including M1 sample video frames and N2 compressed second sample video frame sequences each including M2 sample video frames; n1, N2, M1 and M2 are positive integers, wherein N1 is more than N2, and M1 is less than M2;

the training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame to obtain the trained image processing model comprises the following steps:

training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame in each first sample video frame sequence to obtain an initial image processing model;

and according to the predicted super-resolution video frame of the sample video frame in each second sample video frame sequence and the target super-resolution video frame of the sample video frame, the initial image processing model is trained again to obtain a trained image processing model.

In an exemplary embodiment, the training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame in each first sample video frame sequence to obtain an initial image processing model includes:

for each first sample video frame sequence, training the image processing model to be trained according to a predicted super-resolution video frame and a target super-resolution video frame of a part of sample video frames in the first sample video frame sequence and an attenuation mode of a model parameter updating speed represented by a preset cosine attenuation strategy until a first preset time is reached;

and according to the predicted super-resolution video frames and the target super-resolution video frames of all the sample video frames in the first sample video frame sequence and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy, retraining the trained image processing model reaching the first preset time until reaching a second preset time, and determining the trained image processing module reaching the second preset time as the initial image processing model.

According to a second aspect of the embodiments of the present disclosure, there is provided an image processing method including:

acquiring a compressed video frame sequence;

inputting the video frame, the last video frame of the video frame and the target video frame in the expanded video frame sequence into a trained image processing model aiming at each video frame except the first video frame in the expanded video frame sequence to obtain a super-resolution video frame of the video frame;

wherein the expanded video frame sequence is obtained by pasting a second video frame in the compressed video frame sequence to the front of a first video frame in the compressed video frame sequence; the target video frame is a video frame of which the corresponding video frame type in the expanded video frame sequence is a preset video frame type, and the trained image processing model is obtained by training according to the training method of the image processing model in any embodiment of the first aspect.

In an exemplary embodiment, the inputting the video frame, the previous video frame of the video frame, and the target video frame in the extended video frame sequence into the trained image processing model to obtain the super-resolution video frame of the video frame includes:

splicing the video frame, the last video frame of the video frame and a target video frame in the expanded video frame sequence to obtain a spliced video frame corresponding to the video frame;

performing deformable convolution processing on the spliced video frame corresponding to the video frame to obtain the image characteristics of the video frame;

and inputting the image characteristics of the video frame into the trained image processing model to obtain the super-resolution video frame of the video frame.

In an exemplary embodiment, the inputting the image features of the video frame into the trained image processing model to obtain the super-resolution video frame of the video frame includes:

inputting the image characteristics of the video frame into a trained image processing model, performing super-resolution processing on the image characteristics of the video frame through a first-stage super-resolution network in the trained image processing model, inputting an obtained first-stage super-resolution result into a first-stage reconstruction layer for reconstruction processing, inputting the obtained first-stage reconstruction result and the first-stage super-resolution result into a second-stage super-resolution network for super-resolution processing, and outputting a last-stage reconstruction result through a last-stage reconstruction layer;

and determining the last-stage reconstruction result as a super-resolution video frame of the video frame.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for an image processing model, including:

a sample acquisition unit configured to perform acquisition of a compressed sample video frame sequence;

the sample processing unit is configured to input the sample video frame, a last sample video frame of the sample video frame and a target sample video frame in the extended sample video frame sequence into an image processing model to be trained for each sample video frame except for a first sample video frame in the extended sample video frame sequence to obtain a predicted super-resolution video frame of the sample video frame; the expanded sample video frame sequence is obtained by pasting a second sample video frame in the compressed sample video frame sequence to the front of a first sample video frame in the compressed sample video frame sequence; the target sample video frame is a sample video frame of which the corresponding video frame type in the expanded sample video frame sequence is a preset video frame type;

the model training unit is configured to execute the super-resolution video frame prediction according to the sample video frame and the target super-resolution video frame of the sample video frame, train the image processing model to be trained and obtain a trained image processing model;

the image processing model to be trained comprises a plurality of stages of convolution networks, wherein each stage of convolution network comprises a reconstruction layer and a super-resolution network; and the reconstruction layer and the super-resolution network in each level of convolutional network except the last level of convolutional network are connected with the next level of super-resolution network.

In an exemplary embodiment, the sample processing unit is further configured to perform a splicing process on the sample video frame, a previous sample video frame of the sample video frame, and a target sample video frame in the extended sample video frame sequence, so as to obtain a spliced video frame corresponding to the sample video frame; performing deformable convolution processing on the spliced video frame corresponding to the sample video frame to obtain the image characteristics of the sample video frame; and inputting the image characteristics of the sample video frame into an image processing model to be trained to obtain a predicted super-resolution video frame of the sample video frame.

In an exemplary embodiment, the sample processing unit is further configured to perform super-resolution processing on the image features of the sample video frame by using a first-level super-resolution network in the image processing model to be trained, input an obtained first-level super-resolution result into a first-level reconstruction layer for reconstruction processing, and input the obtained first-level reconstruction result and the first-level super-resolution result into a second-level super-resolution network for super-resolution processing until a last-level reconstruction result is output by a last-level reconstruction layer; and determining the last-stage reconstruction result as a predicted super-resolution video frame of the sample video frame.

In an exemplary embodiment, the model training unit is further configured to perform the super-resolution prediction video frame and the target super-resolution video frame according to each sample video frame, and obtain a loss value corresponding to each sample video frame; obtaining an average value of the loss values as a target loss value; and training the image processing model to be trained according to the target loss value to obtain the trained image processing model.

In an exemplary embodiment, the compressed sample video frame sequence includes N1 compressed first sample video frame sequences each including M1 sample video frames and N2 compressed second sample video frame sequences each including M2 sample video frames; n1, N2, M1 and M2 are positive integers, N1 is greater than N2, and M1 is less than M2;

the model training unit is further configured to execute the super-resolution prediction video frame of the sample video frame in each first sample video frame sequence and the target super-resolution video frame of the sample video frame, train the image processing model to be trained, and obtain an initial image processing model; and according to the predicted super-resolution video frame of the sample video frame in each second sample video frame sequence and the target super-resolution video frame of the sample video frame, the initial image processing model is trained again to obtain a trained image processing model.

In an exemplary embodiment, the model training unit is further configured to perform, for each first sample video frame sequence, training the image processing model to be trained according to the predicted super-resolution video frame and the target super-resolution video frame of a part of sample video frames in the first sample video frame sequence and the attenuation mode of the model parameter update speed represented by the preset cosine attenuation policy until reaching a first preset time; and according to the predicted super-resolution video frames and the target super-resolution video frames of all the sample video frames in the first sample video frame sequence and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy, retraining the trained image processing model reaching the first preset time until reaching a second preset time, and determining the trained image processing module reaching the second preset time as the initial image processing model.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an image processing apparatus comprising:

a video frame acquisition unit configured to perform acquisition of a compressed video frame sequence;

the video frame processing unit is configured to input the video frame, the last video frame of the video frame and the target video frame in the expanded video frame sequence into a trained image processing model aiming at each video frame except the first video frame in the expanded video frame sequence to obtain a super-resolution video frame of the video frame;

In an exemplary embodiment, the video frame processing unit is further configured to perform a splicing process on the video frame, a previous video frame of the video frame, and a target video frame in the expanded video frame sequence, so as to obtain a spliced video frame corresponding to the video frame; performing deformable convolution processing on the spliced video frame corresponding to the video frame to obtain the image characteristics of the video frame; and inputting the image characteristics of the video frame into the trained image processing model to obtain the super-resolution video frame of the video frame.

In an exemplary embodiment, the video frame processing unit is further configured to execute an image processing model in which image features of the video frame are input into a trained image processing model, perform super-resolution processing on the image features of the video frame through a first-level super-resolution network in the trained image processing model, input an obtained first-level super-resolution result into a first-level reconstruction layer for reconstruction processing, input an obtained first-level reconstruction result and the first-level super-resolution result into a second-level super-resolution network for super-resolution processing, and output a last-level reconstruction result through a last-level reconstruction layer; and determining the last-stage reconstruction result as a super-resolution video frame of the video frame.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a method of training an image processing model as described in any embodiment of the first aspect or a method of image processing as described in any embodiment of the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform a training method of an image processing model as described in any one of the embodiments of the first aspect, or an image processing method as described in any one of the embodiments of the second aspect.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, which includes instructions that, when executed by a processor of an electronic device, enable the electronic device to perform a training method of an image processing model as described in any embodiment of the first aspect, or an image processing method as described in any embodiment of the second aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

obtaining a compressed sample video frame sequence; then inputting the sample video frame, the last sample video frame of the sample video frame and the target sample video frame in the extended sample video frame sequence into an image processing model to be trained aiming at each sample video frame except the first sample video frame in the extended sample video frame sequence to obtain a predicted super-resolution video frame of the sample video frame; the expanded sample video frame sequence is obtained by pasting a second sample video frame in the compressed sample video frame sequence to the front of a first sample video frame in the compressed sample video frame sequence; the target sample video frame is a sample video frame of which the corresponding video frame type in the expanded sample video frame sequence is a preset video frame type; finally, training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame to obtain a trained image processing model; the image processing model to be trained comprises a plurality of stages of convolution networks, wherein each stage of convolution network comprises a reconstruction layer and a super-resolution network; and the reconstruction layer and the super-resolution network in each stage of the convolutional network except the last stage of the convolutional network are connected with the next stage of the super-resolution network. Therefore, when the image processing model is trained, because the reconstruction layer and the super-resolution network in each level of convolutional network except the last level of convolutional network are connected with the next level of super-resolution network, the reconstruction layer and the previous level of super-resolution network can be influenced in the gradient transfer process, continuous information transmission is ensured, and faults are not easy to cause, so that the defect that the gradient disappears in the model training process to cause that the super-resolution result output by the last level of network is worse on the contrary is avoided, the stability of model training is ensured, and the super-resolution effect of the image processing model finished through training on the compressed video frame is improved. Meanwhile, when each sample video frame except the first sample video frame in the expanded sample video frame sequence is processed, the reconstruction of the sample video frame is assisted by fully utilizing the target sample video frame and the last sample video frame of the sample video frame, of which the corresponding video frame types in the expanded sample video frame sequence are preset video frame types, the self structural information of the video can be fully utilized, the extracted characteristic information is related to the picture content and is unrelated to noise, and the super-resolution effect of the trained image processing model on the compressed video frame is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of training an image processing model according to an exemplary embodiment.

Fig. 2 is a schematic diagram of a sample video frame sequence shown in accordance with an example embodiment.

FIG. 3 is a block diagram illustrating a multi-stage cyclic convolutional network according to an exemplary embodiment.

FIG. 4 is a block diagram illustrating a three-stage circular convolution network in accordance with an exemplary embodiment.

Fig. 5 is a flowchart illustrating steps for deriving a predicted super-resolution video frame according to an exemplary embodiment.

FIG. 6 is a flow diagram illustrating a deformable convolution process in accordance with an exemplary embodiment.

FIG. 7 is a flowchart illustrating an image processing method according to an exemplary embodiment.

FIG. 8 is a block diagram illustrating an apparatus for training an image processing model according to an exemplary embodiment.

Fig. 9 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment.

FIG. 10 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.

FIG. 1 is a flowchart illustrating a method of training an image processing model, as shown in FIG. 1, for use in a terminal, according to an exemplary embodiment; it is understood that the method can also be applied to a server, and can also be applied to a system comprising a terminal and a server, and is implemented through interaction between the terminal and the server. In the present exemplary embodiment, the method includes the steps of:

in step S110, a compressed sample video frame sequence is obtained.

The number of the compressed sample video frame sequences is multiple, and each compressed sample video frame sequence comprises multiple compressed sample video frames. A sample video frame refers to a video frame in a sample video.

And each compressed sample video frame is matched with a corresponding target super-resolution video frame and is obtained by artificial marking. For example, a video frame is extracted from a captured high-definition video to be used as a target super-resolution video frame, and the video frame is downsampled to obtain a low-definition video frame to be used as a sample video frame.

The compressed sample video frame sequence specifically refers to a sample video frame sequence encoded and compressed by LDB (Low Delay B frame) under H265. It should be noted that the form under this encoding is that the first frame is an I frame, and the remaining frames are P frames; different frames contain different information, for example, I-frames contain more texture information (i.e., best picture quality) and less noise, and P-frames are bad frames, which are more heavily compressed. Therefore, when multi-frame fusion is performed, I-frames can be utilized to assist reconstruction of other frames, and introduction of I-frames has the following two benefits: (1) The network can learn how to reference information from good frames, making the fusion process more efficient. (2) When a multistage cyclic convolution network is adopted, the I frame is introduced to play a role in guiding propagation, so that the problem of noise error accumulation is avoided.

Specifically, the terminal acquires the compressed sample video frame sequence and the target super-resolution video frame corresponding to each sample video frame in the sample video frame sequence from the local database, so that the image processing model to be trained is conveniently trained subsequently according to the compressed sample video frame sequence and the target super-resolution video frame corresponding to each sample video frame in the sample video frame sequence.

In step S120, for each sample video frame in the extended sample video frame sequence except for the first sample video frame, the previous sample video frame of the sample video frame, and the target sample video frame in the extended sample video frame sequence are input into the image processing model to be trained, so as to obtain the predicted super-resolution video frame of the sample video frame.

The expanded sample video frame sequence is obtained by pasting a second sample video frame in the compressed sample video frame sequence to the front of a first sample video frame in the compressed sample video frame sequence; the target sample video frame is a sample video frame of which the corresponding video frame type in the expanded sample video frame sequence is a preset video frame type.

And pasting a second sample video frame in the compressed sample video frame sequence to the front of the first sample video frame in the compressed sample video frame sequence to obtain an expanded sample video frame sequence, wherein the last sample video frame of the second sample video frame in the expanded sample video frame sequence is the second sample video frame in the compressed sample video frame sequence. For example, in order to avoid that the first sample video frame in the compressed sample video frame sequence has insufficient information accumulation to generate an error, the extended sample video frame sequence may be constructed in the manner shown in fig. 2, for example, the compressed sample video frame sequence includes 6 sample video frames, the 2 nd sample video frame is pasted in front of the 1 st sample video frame, and the original sample video frame sequence including 6 sample video frames is extended to the sample video frame sequence including 7 sample video frames, so as to obtain the extended sample video frame sequence. It should be noted that, when calculating the loss function, the super-resolution result of 1,2,3,4,5,6 sample video frames in the compressed sample video frame sequence is still used to calculate the loss value.

The preset video frame type is an I frame, and the target sample video frame is a sample video frame of which the corresponding video frame type is the I frame in the expanded sample video frame sequence. For example, for a sample video frame sequence compressed by LDB coding under H265, the first sample video frame is an I frame, and the rest sample video frames are all P frames; then, for the corresponding sequence of extended sample video frames, the second sample video frame is an I frame, and the remaining sample video frames are all P frames. For example, referring to fig. 2, for a sample video frame sequence compressed by LDB coding under H265, assuming that the compressed sample video frame sequence includes 6 sample video frames, a 1 st sample video frame is an I frame, and 2-6 th sample video frames are all P frames, in the extended sample video frame sequence, a 2 nd sample video frame is an I frame, and 1 st and 3-7 th sample video frames are all P frames. The image processing model to be trained is a super-resolution model to be trained, and specifically, the image processing model to be trained is a multi-stage cyclic convolution network as shown in fig. 3, and the multi-stage cyclic convolution network comprises a plurality of stages of convolution networks, each stage of convolution network comprises a reconstruction layer and a super-resolution network, and the reconstruction layer and the super-resolution network in each stage of convolution network except the last stage of convolution network are both connected with the next stage of super-resolution network. For example, assuming that the image processing model to be trained is a 3-stage cyclic convolution network, the network structure is shown in fig. 4. Therefore, the characteristic propagation of the reconstruction layer is used as the input of the lower-level super-resolution network, so that the gradient back transmission of the main network can simultaneously influence the reconstruction layer and the upper-level super-resolution network, and the continuous information propagation is ensured.

In the conventional multi-stage cyclic convolutional network, there is no connection between the reconstruction layer at the current stage and the super-resolution network at the next stage.

The predicted super-resolution video frame of the sample video frame refers to a super-resolution result output by a last reconstruction layer in the image processing model to be trained, such as a reconstruction result output by the last reconstruction layer.

Specifically, the terminal identifies a video frame type corresponding to each sample video frame in the extended sample video frame sequence, and screens out a sample video frame of which the corresponding video frame type is a preset video frame type from the extended sample video frame sequence as a target sample video frame in the extended sample video frame sequence; for each sample video frame except for the first sample video frame in the extended sample video frame sequence, splicing the sample video frame, the last sample video frame of the sample video frame and the target sample video frame in the extended sample video frame sequence on a characteristic dimension, inputting the spliced sample video frame, the last sample video frame of the sample video frame and the target sample video frame in the extended sample video frame sequence into a deformable convolution network for characteristic extraction, inputting the extracted characteristic information into an image processing model to be trained, and outputting the predicted super-resolution video frame of the sample video frame through the image processing model to be trained.

In step S130, the image processing model to be trained is trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame, so as to obtain a trained image processing model.

Specifically, the terminal acquires a target super-resolution video frame of a sample video frame, and trains an image processing model to be trained according to a loss value between a predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame until a preset ending condition is met; and taking the trained image processing model meeting the preset ending condition as the trained image processing model.

It should be noted that the preset ending condition refers to that the current training frequency reaches the preset training frequency, and the current loss value is smaller than a preset threshold value.

In the training method of the image processing model, when the image processing model is trained, because the reconstruction layer and the super-resolution network in each level of convolutional network except the last level of convolutional network are connected with the next level of super-resolution network, the reconstruction layer and the last level of super-resolution network are influenced in the gradient transmission process, continuous information transmission is ensured, and faults are not easily caused, so that the defect that the gradient disappears in the model training process to cause that the super-resolution result output by the last level of network is worse on the contrary is avoided, the stability of model training is ensured, and the super-resolution effect of the image processing model after training on the compressed video frame is improved. Meanwhile, when each sample video frame except the first sample video frame in the expanded sample video frame sequence is processed, the reconstruction of the sample video frame is assisted by fully utilizing the target sample video frame and the last sample video frame of the sample video frame, of which the corresponding video frame types in the expanded sample video frame sequence are preset video frame types, the self structural information of the video can be fully utilized, the extracted characteristic information is related to the picture content and is unrelated to noise, and the super-resolution effect of the trained image processing model on the compressed video frame is further improved.

In an exemplary embodiment, as shown in fig. 5, in step S120, the sample video frame, the previous sample video frame of the sample video frame, and the target sample video frame in the extended sample video frame sequence are input into the image processing model to be trained, so as to obtain the predicted super-resolution video frame of the sample video frame, which may specifically be implemented by the following steps:

in step S510, the sample video frame, the previous sample video frame of the sample video frame, and the target sample video frame in the extended sample video frame sequence are spliced to obtain a spliced video frame corresponding to the sample video frame.

The splicing processing refers to splicing processing performed on a characteristic dimension.

In step S520, the spliced video frame corresponding to the sample video frame is subjected to a deformable convolution process to obtain an image feature of the sample video frame.

The deformable convolution means that a direction parameter is additionally added to each element of the convolution kernel, so that the convolution kernel can be expanded to a large range in the training process, and the defect that the target sample video frame is far away from the rest sample video frames in the expanded sample video frame sequence is avoided.

It should be noted that the deformable convolution is different from the normal convolution.

In step S530, the image features of the sample video frame are input into the image processing model to be trained, so as to obtain the predicted super-resolution video frame of the sample video frame.

Specifically, for each sample video frame except for the first sample video frame in the extended sample video frame sequence, the terminal performs splicing processing on the sample video frame, the last sample video frame of the sample video frame, and the target sample video frame in the extended sample video frame sequence in a characteristic dimension to obtain a spliced video frame corresponding to the sample video frame; inputting the spliced video frame corresponding to the sample video frame into a deformable convolution network, and performing deformable convolution processing on the spliced video frame corresponding to the sample video frame through the deformable convolution network to obtain the image characteristics of the sample video frame; and inputting the image characteristics of the sample video frame into an image processing model to be trained, and performing super-resolution processing and reconstruction processing through each level of super-resolution network and reconstruction layer in the image processing model to be trained to obtain a predicted super-resolution video frame of the sample video frame.

For example, referring to fig. 6, for a sample video frame sequence after being compressed by LDB coding under H265, assuming that a 4 th sample video frame is a current sample video frame, a 1 st sample video frame, a 3 rd sample video frame, and a 4 th sample video frame are spliced in a feature dimension by a terminal and then input to a deformable convolution network for feature extraction; since the I frame (1 st frame sample video frame) has the least noise and the best image quality, by using the image quality of the I frame to guide the 3 frames and 4 frames to perform feature extraction, the extracted features can be made more relevant to the picture content and not to the noise. After the initial multi-frame fusion, inputting the features obtained by the deformable convolution into a super-resolution network in an image processing model to be trained for further feature extraction, and finally obtaining a predicted super-resolution video frame of the sample video frame.

According to the technical scheme provided by the embodiment of the disclosure, a sample video frame, a previous sample video frame of the sample video frame and a target sample video frame in an expanded sample video frame sequence are spliced on a characteristic dimension, then deformable convolution processing is carried out, the obtained image characteristics are input into an image processing model to be trained, and a predicted super-resolution video frame of the sample video frame is obtained; therefore, when each sample video frame except the first sample video frame in the expanded sample video frame sequence is processed, the reconstruction of the sample video frame is assisted by fully utilizing the target sample video frame and the last sample video frame of the sample video frame, of which the corresponding video frame types in the expanded sample video frame sequence are preset video frame types, the self structural information of the video can be fully utilized, the extracted characteristic information is related to the picture content and is unrelated to noise, the obtained super-resolution video frame has more details and no noise, and the super-resolution effect of the trained image processing model on the compressed video frame is further improved.

In an exemplary embodiment, in step S430, the image features of the sample video frame are input into an image processing model to be trained, so as to obtain a predicted super-resolution video frame of the sample video frame, which specifically includes the following contents: inputting image characteristics of a sample video frame into an image processing model to be trained, performing super-resolution processing on the image characteristics of the sample video frame through a first-stage super-resolution network in the image processing model to be trained, inputting an obtained first-stage super-resolution result into a first-stage reconstruction layer for reconstruction processing, inputting an obtained first-stage reconstruction result and the first-stage super-resolution result into a second-stage super-resolution network for super-resolution processing, and outputting a last-stage reconstruction result through a last-stage reconstruction layer; and determining the last-stage reconstruction result as a predicted super-resolution video frame of the sample video frame.

And the last-stage reconstruction result also refers to a final super-resolution result output by the image processing model to be trained.

Specifically, referring to fig. 3, the terminal inputs image features of a sample video frame into an image processing model to be trained, performs super-resolution processing on the image features of the sample video frame through a first-stage super-resolution network in the image processing model to be trained, inputs an obtained first-stage super-resolution result into a first-stage reconstruction layer for reconstruction processing, inputs an obtained first-stage reconstruction result and the obtained first-stage super-resolution result into a second-stage super-resolution network for super-resolution processing, inputs an obtained second-stage super-resolution result into a second-stage reconstruction layer for reconstruction processing, inputs an obtained second-stage reconstruction result and the obtained second-stage super-resolution result into a third-stage super-resolution network for super-resolution processing, inputs an obtained third-stage reconstruction result and the obtained third-stage super-resolution result into a fourth-stage super-resolution network for super-resolution processing, and so on until a last-stage reconstruction result is output through a last-stage reconstruction layer; and determining the last-stage reconstruction result (namely the final output super-resolution result) as the predicted super-resolution video frame of the sample video frame.

For example, referring to fig. 4, a terminal inputs image features of a sample video frame into an image processing model to be trained, performs super-resolution processing on the image features of the sample video frame through a first-level super-resolution network in the image processing model to be trained, inputs an obtained first-level super-resolution result into a first-level reconstruction layer for reconstruction processing, inputs an obtained first-level reconstruction result and the obtained first-level super-resolution result into a second-level super-resolution network for super-resolution processing, inputs an obtained second-level super-resolution result into a second-level reconstruction layer for reconstruction processing, inputs the obtained second-level reconstruction result and the obtained second-level super-resolution result into a third-level super-resolution network for super-resolution processing, inputs the obtained third-level super-resolution result into a third-level reconstruction layer for reconstruction processing, and uses the obtained third-level reconstruction result as a predicted super-resolution video frame of the sample video frame.

According to the technical scheme, the image characteristics of the sample video frame are input into the image processing model to be trained, super-resolution processing and reconstruction processing are carried out through the multistage super-resolution network and the multistage reconstruction layer in the image processing model to be trained, and the super-resolution result and the reconstruction result of each stage are input into the next stage of super-resolution network, so that the image quality of the obtained predicted super-resolution video frame is improved, the obtained predicted super-resolution video frame does not contain noise, and the denoising effect is achieved.

In an exemplary embodiment, in step S130, the image processing model to be trained is trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame, so as to obtain a trained image processing model, which specifically includes the following contents: obtaining a loss value corresponding to each sample video frame according to the predicted super-resolution video frame and the target super-resolution video frame of each sample video frame; obtaining an average value of the loss values as a target loss value; and training the image processing model to be trained according to the target loss value to obtain the trained image processing model.

Wherein the target loss value is the average value of all loss values obtained by calculation; for example, if the calculated loss values are L1, L2, L3, L4, and L5, respectively, the target loss value is (L1 + L2+ L3+ L4+ L5)/5.

Specifically, the terminal obtains a loss value corresponding to each sample video frame according to a difference value between the predicted super-resolution video frame and the target super-resolution video frame of each sample video frame; adding the loss values corresponding to each sample video frame, and calculating an average value to obtain a target loss value; adjusting model parameters of the image processing model to be trained according to the target loss value to obtain the image processing model after model parameter adjustment; and repeating the process, and training the image processing model after the model parameters are adjusted again until the training end condition is reached, wherein the image processing model after the training which reaches the training end condition is used as the image processing model after the training is finished.

For example, when the target loss value is smaller than the preset threshold, the terminal adjusts the model parameters of the image processing model to be trained according to the target loss value, and trains the image processing model after the model parameters are adjusted again until the target loss value obtained according to the image processing model after training is smaller than the preset threshold, and then takes the image processing model after training as the image processing model after training.

According to the technical scheme provided by the embodiment of the disclosure, the image processing model to be trained is trained for multiple times according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame to obtain the trained image processing model, so that the image effect of the super-resolution video frame obtained through the trained image processing model is improved, the image details in the obtained super-resolution video frame are more, and the super-resolution effect of the model on the compressed video frame is improved.

In an exemplary embodiment, the compressed sample video frame sequence includes N1 compressed first sample video frame sequences each including M1 sample video frames and N2 compressed second sample video frame sequences each including M2 sample video frames; n1, N2, M1 and M2 are positive integers, N1 is larger than N2, and M1 is smaller than M2. Then, in step S130, training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame, so as to obtain a trained image processing model, which specifically includes the following contents: training an image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame in each first sample video frame sequence to obtain an initial image processing model; and according to the predicted super-resolution video frame of the sample video frame in each second sample video frame sequence and the target super-resolution video frame of the sample video frame, the initial image processing model is trained again to obtain a trained image processing model.

Wherein, the N1 compressed first sample video frame sequences form a first sample data set, and the N2 compressed second sample video frame sequences form a second sample data set; the formats of the first sample data set and the second sample data set are (number of samples, frame number, width, height, channel); the format of the first sample data set is (N1, M1, W, H, 3), such as (8, 10, W, H, 3); the format of the second sample data set is (N2, M2, W, H, 3), such as (4, 20, W, H, 3). Note that the number of samples is the batch size.

Specifically, the terminal calculates a first loss value according to the difference between the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame in each first sample video frame sequence, trains the image processing model to be trained according to the first loss value until a first preset training frequency is met, and takes the trained image processing model meeting the first preset training frequency as an initial image processing model; and calculating a second loss value according to the difference between the predicted super-resolution video frame of the sample video frame in each second sample video frame sequence and the target super-resolution video frame of the sample video frame, performing retraining on the initial image processing model according to the second loss value until the second preset training times are met, and taking the trained image processing model meeting the second preset training times as a trained image processing model.

For example, for a sample data set of 4 × 20 × w × h × 3, the terminal firstly trains the image processing model to be trained by using a first sample data set of 8 × 10 × w × h × 3, and then trains the trained image processing model again by using a second sample data set of 4 × 20 × h × w × 3 to obtain the trained image processing model.

According to the technical scheme provided by the embodiment of the disclosure, an image processing model to be trained is trained according to sample video frames in each first sample video frame sequence, and then the trained image processing model is retrained again by using the sample video frames in each second sample video frame sequence to obtain the trained image processing model; therefore, the model is trained by utilizing a larger sample number and fewer sample video frames, and then is trained by utilizing a smaller sample number and more sample video frames, so that the stability of the model in the training process is ensured, and the image processing effect of the model on the compressed video frames is improved.

In an exemplary embodiment, the method includes training an image processing model to be trained according to a predicted super-resolution video frame of a sample video frame and a target super-resolution video frame of the sample video frame in each first sample video frame sequence to obtain an initial image processing model, where the initial image processing model specifically includes the following contents: aiming at each first sample video frame sequence, training an image processing model to be trained according to a predicted super-resolution video frame and a target super-resolution video frame of a part of sample video frames in the first sample video frame sequence and an attenuation mode of a model parameter updating speed represented by a preset cosine attenuation strategy until first preset time is reached; according to the predicted super-resolution video frame and the target super-resolution video frame of all sample video frames in the first sample video frame sequence and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy, the trained image processing model reaching the first preset time is trained again until reaching the second preset time, and the trained image processing module reaching the second preset time is determined as the initial image processing model.

The first preset time is less than the second preset time.

The preset cosine attenuation strategy is used for describing an attenuation mode of a learning rate, and the learning rate is used for expressing the updating speed of the model parameters.

Specifically, for each first sample video frame sequence, the terminal calculates a third loss value according to the difference between the predicted super-resolution video frame and the target super-resolution video frame of a part of sample video frames in the first sample video frame sequence; adjusting the model parameters of the image processing model to be trained according to the third loss value and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy, and re-training the image processing model after the model parameters are adjusted until the first preset time is reached; calculating to obtain a fourth loss value according to the difference between the predicted super-resolution video frame and the target super-resolution video frame of all the sample video frames in the first sample video frame sequence; and adjusting the model parameters of the trained image processing model reaching the first preset time according to the fourth loss value and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy, and re-training the image processing model after the model parameters are adjusted until reaching the second preset time, and determining the trained image processing module reaching the second preset time as the initial image processing model.

For example, assuming that the sample video frame sequence includes 20 sample video frames, the terminal firstly trains the image processing model to be trained by using 5 sample video frames, trains the image processing model after training for a short period of time, trains the image processing model again by using 10 sample video frames, and trains the image processing model after training again by using the whole sample video frame sequence after a period of time to obtain the initial image processing model.

Further, according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame in each second sample video frame sequence, the initial image processing model is trained again to obtain a trained image processing model, which specifically includes the following contents: for each second sample video frame sequence, the terminal trains the initial image processing model according to the predicted super-resolution video frame and the target super-resolution video frame of partial sample video frames in the second sample video frame sequence and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy until reaching a third preset time; and training the trained image processing model reaching the third preset time again until reaching a fourth preset time according to the predicted super-resolution video frame and the target super-resolution video frame of all sample video frames in the second sample video frame sequence and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy, and determining the trained image processing module reaching the fourth preset time as the trained image processing model.

According to the technical scheme provided by the embodiment of the disclosure, for each first sample video frame sequence, an image processing model to be trained is trained according to a part of sample video frames in the first sample video frame sequence and an attenuation mode of a model parameter update speed represented by a preset cosine attenuation strategy, and then the trained image processing model is retrained again according to all sample video frames in the first sample video frame sequence and the attenuation mode of the model parameter update speed represented by the preset cosine attenuation strategy; therefore, in the initial stage of model training, a few sample video frames are used for training the model for a period of time, and in the later stage of model training, a plurality of sample video frames are used for training the model for a period of time, so that the stability of the model in the training process is ensured, and the image processing effect of the model on the compressed video frames is improved.

Fig. 7 is a flowchart illustrating an image processing method according to an exemplary embodiment, which is used in a terminal, as shown in fig. 7, and includes the steps of:

in step S710, a compressed video frame sequence is acquired.

In step S720, for each video frame in the sequence of the extended video frames except the first video frame, the previous video frame of the video frame, and the target video frame in the sequence of the extended video frames are input into the trained image processing model to obtain the super-resolution video frame of the video frame, and the trained image processing model is obtained by training according to the training method of the image processing model.

The compressed video frame sequence also refers to a video frame sequence after being subjected to LDB coding.

The expanded video frame sequence is obtained by pasting a second video frame in the compressed video frame sequence to the front of a first video frame in the compressed video frame sequence; the target video frame is a video frame of which the corresponding video frame type in the expanded video frame sequence is a preset video frame type.

Pasting a second video frame in the compressed video frame sequence to the front of a first video frame in the compressed video frame sequence to obtain an expanded video frame sequence, wherein the last video frame of the second video frame in the expanded video frame sequence is the second video frame in the compressed video frame sequence.

The preset video frame type is an I frame, and the target video frame is a video frame of which the corresponding video frame type in the expanded video frame sequence is the I frame. For example, for a video frame sequence compressed by LDB coding under H265, the first video frame is an I frame, and the remaining video frames are all P frames; then for the corresponding sequence of augmented video frames, the second video frame is referred to as an I frame and the remaining video frames are referred to as P frames.

The trained image processing model also refers to a multi-stage circular convolution network, such as the network shown in fig. 3.

Specifically, the terminal identifies a video frame type corresponding to each video frame in the expanded video frame sequence, and screens out a video frame of which the corresponding video frame type is a preset video frame type from the expanded video frame sequence as a target video frame in the expanded video frame sequence; for each video frame except for the first video frame in the expanded video frame sequence, splicing the video frame, the last video frame of the video frame and a target video frame in the expanded video frame sequence on a characteristic dimension, inputting the spliced video frame, the last video frame of the video frame and the target video frame in the expanded video frame sequence into a deformable convolution network for characteristic extraction, inputting the extracted characteristic information into a trained image processing model, and performing multiple super-resolution processing and reconstruction processing through the trained image processing model to obtain a super-resolution video frame of the video frame.

In the image processing method, when each video frame except the first video frame in the expanded video frame sequence is processed, the reconstruction of the video frame is assisted by fully utilizing the target video frame and the previous video frame of the video frame, of which the corresponding video frame type in the expanded video frame sequence is the preset video frame type, the structural information of the video can be fully utilized, the extracted characteristic information is related to the picture content and is unrelated to noise, and the super-resolution effect of the trained image processing model on the compressed video frame is further improved. Meanwhile, the reconstruction layer and the super-resolution network in each level of convolutional network except the last level of convolutional network in the trained image processing model are connected with the next level of super-resolution network, so that the stability of information transmission is ensured, the image details of the obtained super-resolution video frame are more, no noise is generated, and the super-resolution effect of the model on the compressed video frame is further improved.

In an exemplary embodiment, in step S720, the video frame, the previous video frame of the video frame, and the target video frame in the expanded video frame sequence are input into the trained image processing model, so as to obtain a super-resolution video frame of the video frame, which specifically includes the following contents: splicing the video frame, the previous video frame of the video frame and a target video frame in the expanded video frame sequence to obtain a spliced video frame corresponding to the video frame; performing deformable convolution processing on a spliced video frame corresponding to the video frame to obtain image characteristics of the video frame; and inputting the image characteristics of the video frame into the trained image processing model to obtain the super-resolution video frame of the video frame.

Specifically, for each video frame except for the first video frame in the expanded video frame sequence, the terminal performs splicing processing on the video frame, the last video frame of the video frame and a target video frame in the expanded video frame sequence in a characteristic dimension to obtain a spliced video frame corresponding to the video frame; inputting the spliced video frame corresponding to the video frame into a deformable convolution network, and performing deformable convolution processing on the spliced video frame corresponding to the video frame through the deformable convolution network to obtain the image characteristics of the video frame; and inputting the image characteristics of the video frame into an image processing model to be trained, and performing super-resolution processing and reconstruction processing through each level of super-resolution network and each level of reconstruction layer in the image processing model to be trained to obtain a super-resolution video frame of the video frame.

According to the technical scheme provided by the embodiment of the disclosure, after a video frame, a previous video frame of the video frame and a target video frame in an expanded video frame sequence are spliced on a characteristic dimension, deformable convolution processing is performed, and the obtained image characteristics are input into an image processing model to be trained to obtain a super-resolution video frame of the video frame; therefore, when each video frame except the first sample video frame in the expanded video frame sequence is processed, the reconstruction of the video frame is assisted by fully utilizing the target video frame and the last video frame of the video frame, of which the corresponding video frame types in the expanded video frame sequence are the preset video frame types, the structural information of the video can be fully utilized, the extracted characteristic information is related to the picture content and is unrelated to noise, the obtained super-resolution video frame has more image details and no noise, and the super-resolution effect of the trained image processing model on the compressed video frame is further improved.

In an exemplary embodiment, the image features of the video frame are input into the trained image processing model to obtain a super-resolution video frame of the video frame, which specifically includes the following contents: inputting the image characteristics of the video frame into a trained image processing model, performing super-resolution processing on the image characteristics of the video frame through a first-stage super-resolution network in the trained image processing model, inputting an obtained first-stage super-resolution result into a first-stage reconstruction layer for reconstruction processing, inputting the obtained first-stage reconstruction result and the first-stage super-resolution result into a second-stage super-resolution network for super-resolution processing, and outputting a final-stage reconstruction result through a final-stage reconstruction layer; and determining the last-stage reconstruction result as a super-resolution video frame of the video frame.

Specifically, referring to fig. 3, the terminal inputs the image features of the video frame into the trained image processing model, performs super-resolution processing on the image features of the video frame through a first-stage super-resolution network in the trained image processing model, inputs the obtained first-stage super-resolution result into a first-stage reconstruction layer for reconstruction processing, inputs the obtained first-stage reconstruction result and the obtained first-stage super-resolution result into a second-stage super-resolution network for super-resolution processing, inputs the obtained second-stage super-resolution result into a second-stage reconstruction layer for reconstruction processing, inputs the obtained second-stage reconstruction result and the obtained second-stage super-resolution result into a third-stage super-resolution network for super-resolution processing, inputs the obtained third-stage reconstruction result and the obtained third-stage super-resolution result into a fourth-stage super-resolution network for super-resolution processing, and so on until the last-stage reconstruction result is output by the last-stage reconstruction layer; and determining the last-stage reconstruction result (namely the final output super-resolution result) as the super-resolution video frame of the video frame.

According to the technical scheme provided by the embodiment of the disclosure, the image characteristics of the video frame are input into the trained image processing model, the image processing model is subjected to super-resolution processing and reconstruction processing through the multi-level super-resolution network and the multi-level reconstruction layer in the trained image processing model, and the super-resolution result and the reconstruction result of each level are input into the next level of super-resolution network, so that the image quality of the obtained super-resolution video frame is improved, and meanwhile, the obtained super-resolution video frame does not contain less noise, and the denoising effect is achieved.

In order to clarify the image processing method provided by the embodiments of the present disclosure more clearly, the following describes the image processing method in a specific embodiment. In an embodiment, the present disclosure further provides a processing method of a multi-stage cyclic convolutional network based on LDB coding compression, which can save training process and computation overhead of a processing module and achieve performance similar to that of the processing module by using I frame information in a video to assist reconstruction of other frames, and specifically includes the following steps:

firstly, introducing an I frame in a fusion stage during super-resolution processing each time; taking the super-resolution fourth frame as an example, the current frame and the previous frame are used first, then the first frame (I frame) is transmitted to the current moment, and the three frames are spliced in the feature dimension and then sent to the deformable convolution layer for feature extraction processing. Because the noise of the I frame is the least and the image quality is the best, the 3 and 4 frames are guided by the image quality of the I frame to carry out feature extraction, and the extracted features are implicitly related to the picture content and are not related to the noise. Considering that the I frame is the first frame in the compressed video frame sequence and is far away from the rest of the frames, the deformable convolution is adopted instead of the ordinary convolution. After the preliminary multi-frame fusion, the features obtained through the deformable convolution are sent to a super-resolution network for further feature extraction.

Secondly, an improved multi-stage cyclic convolution network is provided, in which the output of the super-resolution network and the output of the reconstruction layer of each stage are connected to the input of the super-resolution network of the next stage. For example, the feature propagation of the reconstruction layer is used as the input of the super-resolution network of the lower stage, so that the gradient back transmission of the backbone network can influence both the reconstruction layer and the super-resolution network of the upper stage, thereby ensuring the continuous information propagation and further improving the stability of the model training.

Thirdly, in the training mode, a set of stable strategies is adopted to solve the problem of unstable training caused by deepening of the network. (1) In order to avoid error caused by insufficient information accumulation of the first frame, the second frame is pasted in front of the first frame, the original n frame sequence is expanded into n +1 frame sequence, and the loss value is calculated by using the results of 1,2,3, … and n frames when the loss function is calculated. (2) In the initial period of training, the whole sequence is not used as input, but 5 frames of video frames are used for training the network for a small period of time (which can be set according to an actual period), then 10 frames of video frames are used for training for a period of time, and finally the whole sequence is used for training the network. (3) Training is performed with a larger batch size (batch) and fewer video frames, i.e., in 4 × 20 × h × w × 3, the 8 × 10 × h × w × 3 training network is used first, and then the 4 × 20 × h × w × 3 training network is used. (4) The learning rate attenuation is changed from the original fixed 30 times attenuation to the original 1/10 attenuation strategy, the total attenuation period is the period used in training, and the cosine attenuation can enable the training process of the cyclic convolution network to be more stable. It should be noted that there is no precedence relationship between the above (1) to (4), and any combination and collocation can be adopted.

In the processing method based on the multi-stage cyclic convolution network under LDB coding compression, the I frame can be used for invisibly guiding the feature extraction module to pay more attention to the content of the picture and neglect noise. Meanwhile, in order to relieve the misalignment condition of the I frame and other frames during fusion, deformable convolution is introduced to relieve the condition. In addition, the multi-stage cyclic convolution network and a set of targeted training strategies can enable information to be more stably spread in the network.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Based on the same inventive concept, the embodiment of the present disclosure further provides a training apparatus for an image processing model, which is used for implementing the above-mentioned training method for an image processing model.

FIG. 8 is a block diagram illustrating an apparatus for training an image processing model according to an exemplary embodiment. Referring to fig. 8, the apparatus includes a sample acquiring unit 810, a sample processing unit 820, and a model training unit 830.

A sample obtaining unit 810 configured to perform obtaining the compressed sample video frame sequence.

A sample processing unit 820 configured to perform input of the sample video frame, a previous sample video frame of the sample video frame, and a target sample video frame of the extended sample video frame sequence into an image processing model to be trained for each sample video frame except for a first sample video frame of the extended sample video frame sequence, so as to obtain a predicted super-resolution video frame of the sample video frame; the expanded sample video frame sequence is obtained by pasting a second sample video frame in the compressed sample video frame sequence to the front of a first sample video frame in the compressed sample video frame sequence; the target sample video frame is a sample video frame of which the corresponding video frame type in the expanded sample video frame sequence is a preset video frame type.

A model training unit 830 configured to perform training of an image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame, resulting in a trained image processing model;

In an exemplary embodiment, the sample processing unit 820 is further configured to perform a splicing process on the sample video frame, a previous sample video frame of the sample video frame, and a target sample video frame in the expanded sample video frame sequence, so as to obtain a spliced video frame corresponding to the sample video frame; performing deformable convolution processing on the spliced video frame corresponding to the sample video frame to obtain the image characteristics of the sample video frame; and inputting the image characteristics of the sample video frame into an image processing model to be trained to obtain a predicted super-resolution video frame of the sample video frame.

In an exemplary embodiment, the sample processing unit 820 is further configured to perform super-resolution processing on the image features of the sample video frame by using a first-level super-resolution network in the image processing model to be trained, input the obtained first-level super-resolution result into a first-level reconstruction layer for reconstruction processing, input the obtained first-level reconstruction result and the first-level super-resolution result into a second-level super-resolution network for super-resolution processing, and output the last-level reconstruction result by using a last-level reconstruction layer; and determining the last-stage reconstruction result as a predicted super-resolution video frame of the sample video frame.

In an exemplary embodiment, the model training unit 830 is further configured to perform the predicting of the super-resolution video frame and the target super-resolution video frame according to each sample video frame to obtain a loss value corresponding to each sample video frame; obtaining an average value of the loss values as a target loss value; and training the image processing model to be trained according to the target loss value to obtain the trained image processing model.

a model training unit 830, further configured to perform a super-resolution video frame prediction according to a sample video frame in each first sample video frame sequence and a target super-resolution video frame of the sample video frame, and train an image processing model to be trained to obtain an initial image processing model; and according to the predicted super-resolution video frame of the sample video frame in each second sample video frame sequence and the target super-resolution video frame of the sample video frame, the initial image processing model is trained again to obtain a trained image processing model.

In an exemplary embodiment, the model training unit 830 is further configured to perform, for each first sample video frame sequence, training the image processing model to be trained according to the predicted super-resolution video frame and the target super-resolution video frame of a part of the sample video frames in the first sample video frame sequence and the attenuation mode of the model parameter update speed represented by the preset cosine attenuation policy until reaching a first preset time; according to the predicted super-resolution video frame and the target super-resolution video frame of all sample video frames in the first sample video frame sequence and the attenuation mode of the model parameter updating speed represented by the preset cosine attenuation strategy, the trained image processing model reaching the first preset time is trained again until reaching the second preset time, and the trained image processing module reaching the second preset time is determined as the initial image processing model.

Fig. 9 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment. Referring to fig. 9, the apparatus includes a video frame acquisition unit 910 and a video frame processing unit 920.

A video frame acquisition unit 910 configured to perform acquiring a compressed video frame sequence.

A video frame processing unit 920 configured to perform input of the trained image processing model to each video frame except the first video frame in the expanded video frame sequence, the video frame, the previous video frame of the video frame, and the target video frame in the expanded video frame sequence, so as to obtain a super-resolution video frame of the video frame; the expanded video frame sequence is obtained by pasting a second video frame in the compressed video frame sequence to the front of a first video frame in the compressed video frame sequence; the target video frame is a video frame of which the corresponding video frame type in the expanded video frame sequence is a preset video frame type, and the trained image processing model is obtained by training according to a training method of the image processing model.

In an exemplary embodiment, the video frame processing unit 920 is further configured to perform a splicing process on the video frame, a previous video frame of the video frame, and a target video frame in the expanded video frame sequence, so as to obtain a spliced video frame corresponding to the video frame; performing deformable convolution processing on a spliced video frame corresponding to the video frame to obtain image characteristics of the video frame; and inputting the image characteristics of the video frame into the trained image processing model to obtain the super-resolution video frame of the video frame.

In an exemplary embodiment, the video frame processing unit 920 is further configured to perform inputting image features of the video frame into a trained image processing model, performing super-resolution processing on the image features of the video frame through a first-level super-resolution network in the trained image processing model, inputting an obtained first-level super-resolution result into a first-level reconstruction layer for reconstruction processing, inputting the obtained first-level reconstruction result and the obtained first-level super-resolution result into a second-level super-resolution network for super-resolution processing, and outputting a last-level reconstruction result through a last-level reconstruction layer; and determining the last-stage reconstruction result as a super-resolution video frame of the video frame.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The training device of the image processing model or each module in the image processing device may be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 10 is a block diagram illustrating an electronic device 1000 for implementing a training method or an image processing method of an image processing model according to an exemplary embodiment. For example, the electronic device 1000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.

Referring to fig. 10, electronic device 1000 may include one or more of the following components: processing component 1002, memory 1004, power component 1006, multimedia component 1008, audio component 1010, interface to input/output (I/O) 1012, sensor component 1014, and communications component 1016.

The processing component 1002 generally controls the overall operation of the electronic device 1000, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1002 may include one or more processors 1020 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 1002 may include one or more modules that facilitate interaction between processing component 1002 and other components. For example, the processing component 1002 may include a multimedia module to facilitate interaction between the multimedia component 1008 and the processing component 1002.

The memory 1004 is configured to store various types of data to support operations at the electronic device 1000. Examples of such data include instructions for any application or method operating on the electronic device 1000, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1004 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.

The power supply component 1006 provides power to the various components of the electronic device 1000. The power components 1006 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 1000.

The multimedia component 1008 includes a screen that provides an output interface between the electronic device 1000 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1008 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 1000 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1010 is configured to output and/or input audio signals. For example, the audio component 1010 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1000 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 1004 or transmitted via the communication component 1016. In some embodiments, audio component 1010 also includes a speaker for outputting audio signals.

I/O interface 1012 provides an interface between processing component 1002 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1014 includes one or more sensors for providing various aspects of status assessment for the electronic device 1000. For example, the sensor assembly 1014 may detect an open/closed state of the electronic device 1000, the relative positioning of components, such as a display and keypad of the electronic device 1000, the sensor assembly 1014 may also detect a change in the position of the electronic device 1000 or components of the electronic device 1000, the presence or absence of user contact with the electronic device 1000, orientation or acceleration/deceleration of the device 1000, and a change in the temperature of the electronic device 1000. The sensor assembly 1014 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1014 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1016 is configured to facilitate wired or wireless communication between the electronic device 1000 and other devices. The electronic device 1000 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 1016 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1016 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 1004 comprising instructions, executable by the processor 1020 of the electronic device 1000 to perform the above-described method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by the processor 1020 of the electronic device 1000 to perform the above-described method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training an image processing model, comprising:

obtaining a compressed sample video frame sequence;

inputting a sample video frame, a last sample video frame of the sample video frame and a target sample video frame in the extended sample video frame sequence into an image processing model to be trained aiming at each sample video frame except for a first sample video frame in the extended sample video frame sequence to obtain a predicted super-resolution video frame of the sample video frame; the expanded sample video frame sequence is obtained by pasting a second sample video frame in the compressed sample video frame sequence to the front of a first sample video frame in the compressed sample video frame sequence; the target sample video frame is a sample video frame of which the corresponding video frame type in the expanded sample video frame sequence is a preset video frame type;

2. The method of claim 1, wherein the inputting the sample video frame, the previous sample video frame of the sample video frame, and the target sample video frame in the sequence of the extended sample video frames into an image processing model to be trained to obtain the predicted super-resolution video frame of the sample video frame comprises:

3. The method of claim 2, wherein the inputting the image features of the sample video frame into an image processing model to be trained to obtain a predicted super-resolution video frame of the sample video frame comprises:

4. The method of claim 1, wherein the training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame to obtain a trained image processing model comprises:

obtaining an average value of the loss values as a target loss value;

5. The method of claim 1, wherein the compressed sample video frame sequences comprise N1 compressed first sample video frame sequences each comprising M1 sample video frames and N2 compressed second sample video frame sequences each comprising M2 sample video frames; n1, N2, M1 and M2 are positive integers, N1 is greater than N2, and M1 is less than M2;

and according to the predicted super-resolution video frame of the sample video frame in each second sample video frame sequence and the target super-resolution video frame of the sample video frame, re-training the initial image processing model to obtain a trained image processing model.

6. The method of claim 5, wherein the training the image processing model to be trained according to the predicted super-resolution video frame of the sample video frame and the target super-resolution video frame of the sample video frame in each first sample video frame sequence to obtain an initial image processing model comprises:

7. An image processing method, comprising:

acquiring a compressed video frame sequence;

wherein the expanded video frame sequence is obtained by pasting a second video frame in the compressed video frame sequence to the front of a first video frame in the compressed video frame sequence; the target video frame is a video frame of which the corresponding video frame type in the expanded video frame sequence is a preset video frame type, and the trained image processing model is obtained by training according to the method of any one of claims 1 to 6.

8. The method of claim 7, wherein the inputting the video frame, the previous video frame of the video frame, and the target video frame of the sequence of the extended video frames into the trained image processing model to obtain the super-resolution video frame of the video frame comprises:

9. The method according to claim 8, wherein the inputting the image features of the video frame into the trained image processing model to obtain the super-resolution video frame of the video frame comprises:

10. An apparatus for training an image processing model, comprising:

11. An image processing apparatus characterized by comprising:

12. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of training an image processing model according to any one of claims 1 to 6, or the method of image processing according to any one of claims 7 to 9.

13. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of training an image processing model of any of claims 1 to 6, or the method of image processing of any of claims 7 to 9.