CN113011562A

CN113011562A - Model training method and device

Info

Publication number: CN113011562A
Application number: CN202110292062.4A
Authority: CN
Inventors: 肖帅; 宋风龙; 熊志伟; 肖泽宇
Original assignee: University of Science and Technology of China USTC; Huawei Technologies Co Ltd
Current assignee: University of Science and Technology of China USTC; Huawei Technologies Co Ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-06-22

Abstract

The application discloses a model training method, which can be applied to the field of artificial intelligence and comprises the following steps: acquiring a video sample, a first video processing network and a second video processing network, and processing the video sample through the first video processing network and the second video processing network to respectively obtain a first intermediate characteristic diagram output and a second intermediate characteristic diagram output; and respectively processing the first intermediate characteristic diagram output and the second intermediate characteristic diagram output through a recurrent neural network to respectively obtain first inter-frame information and second inter-frame information, determining target loss for knowledge distillation according to the first inter-frame information and the second inter-frame information, and performing knowledge distillation on a second video processing network based on the target loss. The method and the device have the advantages that the interframe information is added into the target loss, and the video quality of the video obtained after the student model after knowledge distillation carries out a video processing task is improved.

Description

Model training method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to a model training method and device.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Photographing imaging and videos of terminal equipment such as a smart phone are remarkably improved, but the photographing imaging and videos are limited by the hardware performance of an optical sensor of the terminal equipment, the quality of the photographed photos and videos is still not high enough, and the problems of high noise, low resolving power, detail loss, color cast and the like are solved. Meanwhile, the hardware area and power consumption of the image signal processor are limited, and the difficulty of solving the above challenges by the conventional image processing algorithm is very high. In order to improve the picture quality of an image or video, the video may be processed.

Deep learning is a key driving force for the development of the field of artificial intelligence in recent years, and has a remarkable effect on various tasks of computer vision. In the field of video processing, a video processing model realized based on deep learning also achieves the best performance in the industry, and the effect is obviously superior to that of the traditional method.

The computing power of the mobile terminal is weak, the structure of the current model for video processing is very complex, the requirement on hardware computing resources is very high, and the application of the neural network to scenes with high real-time requirement is severely limited and the neural network is difficult to deploy to devices with weak computing power such as the mobile terminal.

Disclosure of Invention

In a first aspect, the present application provides a model training method, including:

the method comprises the steps of obtaining a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained;

in one possible implementation, the video sample may include a plurality of image frames, and the first video processing network and the second video processing network are used for implementing a video enhancement task, which may be understood as a task for enhancing the quality of the video, for example, a video denoising task, a video defogging task, a super-resolution task, a high dynamic range task, or the like, and is not limited herein;

processing the video samples through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network, respectively;

wherein the first intermediate feature map output may be a feature map output of an intermediate network layer when the first video processing network processes the video sample, the second intermediate feature map output may be a feature map output of an intermediate network layer when the second video processing network processes the video sample, and a position of the network layer outputting the first intermediate feature map output in the first video processing network is the same as a position of the network layer outputting the second intermediate feature map output in the second video processing network;

the intermediate network layer may be a network layer for outputting a feature map in the first video processing network and the second video processing network, and as long as the output feature map can carry image features of the image frame, the embodiments of the present application do not limit the positions of the intermediate network layer in the first video processing network and the second video processing network and the types of the network layers;

respectively processing the first intermediate feature map output and the second intermediate feature map output to respectively obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing the feature change relationship between each image frame of the video sample;

in one implementation, the first intermediate feature map output and the second intermediate feature map output may be processed separately by a recurrent neural network, wherein the first inter-frame information and the second inter-frame information may represent a feature variation relationship between respective image frames of the video sample, since the recurrent neural network memorizes previous information and applies to a calculation of a current output when processing sequence data. Specifically, the characteristic change relationship may refer to continuity and change information between frames, where the continuity information is a relationship between stationary regions between frames, and the change information is a relationship between objects having motion between frames;

and determining a target loss according to the first interframe information and the second interframe information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to obtain a trained second video processing network, wherein the target loss is related to the difference between the first interframe information and the second interframe information.

By the mode, on the premise that the structure of the model is not changed, the interframe information is added in the target loss for knowledge distillation, so that the teacher model can better recognize the interframe information, the capability of performing video processing by using the interframe information is transferred to the student model, and the video quality of the video obtained after the video processing is performed on the student model after the knowledge distillation is improved.

In one possible implementation, the separately processing the first intermediate feature map output and the second intermediate feature map output includes:

and processing the first intermediate feature map output and the second intermediate feature map output respectively through a recurrent neural network.

In one possible implementation, the first interframe information and the second interframe information are hidden states (hidden states) of the recurrent neural network.

In one possible implementation, the video sample includes multiple frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the first inter-frame information includes M hidden states obtained by processing, by the recurrent neural network, the first sub-intermediate feature map corresponding to the M frame of image that is later in the multiple frames of images, and the second inter-frame information is M hidden states obtained by processing, by the recurrent neural network, the second sub-intermediate feature map corresponding to the M frame of image that is later in the multiple frames of images.

In one implementation, the hidden layer state output by the LSTM may also be obtained, that is, the LSTM respectively processes the first intermediate feature map output by the first video processing network and the second intermediate feature map output by the second video processing network to obtain a first hidden layer state and a second hidden layer state, where the first hidden layer state may be a full hidden layer state or a partial hidden layer state of the LSTM when processing the first intermediate feature map output;

in one possible implementation, the recurrent neural network is a long-short term memory (LSTM) network, and the first inter-frame information and the second inter-frame information are cell states (cell states) output by the LSTM.

In one possible implementation, the video sample includes multiple frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the first inter-frame information is a cell state obtained by processing a first sub-intermediate feature map corresponding to a last frame of image in the multiple frames of images by the LSTM network, and the second inter-frame information is a cell state obtained by processing a second sub-intermediate feature map corresponding to a last frame of image in the multiple frames of images by the LSTM network.

The LSTM network may sequentially process input image frames to obtain a cell state corresponding to each image frame, and the LSTM network may generally carry more inter-frame information in hidden layer output obtained by processing a later image frame, so in order to reduce the amount of computation, a cell state obtained by RNN processing an intermediate feature map corresponding to a later image frame in a multi-frame image may be selected.

In one possible implementation, the video sample includes a plurality of frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, and the method further includes:

processing each first sub-intermediate feature map and each second sub-intermediate feature map to obtain first spatial information of each first sub-intermediate feature map and second spatial information of each second sub-intermediate feature map, where the first spatial information and the second spatial information are used to represent feature distribution of feature maps;

the determining a target loss according to the first inter-frame information and the second inter-frame information includes:

determining a target loss based on the first inter-frame information and the second inter-frame information, and the first spatial information and the second spatial information, the target loss being related to a difference between the first inter-frame information and the second inter-frame information, and a difference between the first spatial information and the second spatial information.

In one implementation, the target penalty may relate to a difference between spatial information of the first intermediate feature map output and the second intermediate feature map output in addition to a difference between the first inter-frame information and the second inter-frame information; the spatial information is used to represent a feature distribution of the feature map, and the feature distribution may include rich image content and represent image features of the corresponding image frame, such as frequency features, texture detail features, and the like.

In one possible implementation, the first spatial information is a first spatial attention diagram, the second spatial information is a second spatial attention diagram, and the performing information statistics on each of the first sub-intermediate feature maps and each of the second sub-intermediate feature maps includes:

mapping the first and second intermediate feature map outputs based on a spatial attention mechanism to obtain the first and second spatial attention maps, respectively.

In an alternative implementation, each first sub-intermediate feature map may be averaged by channel to obtain first spatial information, and each second sub-intermediate feature map may be averaged by channel to obtain second spatial information, where the information statistics is averaged by channel, the spatial information may also be referred to as a spatial attention map.

In one possible implementation, the processing the video samples by the first video processing network and the second video processing network includes:

processing the video sample through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network, a first enhanced video output of the first video processing network, and a second intermediate feature map output of the second video processing network, respectively;

acquiring a true value (ground route) corresponding to the video sample;

determining a target loss based on the first and second inter-frame information and the first video processing result and the true value, the target loss being related to a difference between the first and second inter-frame information and a difference between the first video processing result and the true value.

In an implementation, the target loss may relate to, in addition to a difference between the first inter-frame information and the second inter-frame information, a difference between a first video processing result and a true value (ground threshold) corresponding to the video sample, where the first video processing network and the second video processing network are used to implement a video enhancement task, as an example, the true value (ground threshold) corresponding to the video sample may be understood as a video sample with improved video quality, and in an implementation, the true value (ground threshold) corresponding to the video sample may also be preset or obtained after the video sample is subjected to image enhancement by the first video processing network, and is not limited herein; in one implementation, the target loss may be constructed based on a difference between the first inter-frame information and the second inter-frame information, a difference between the first spatial information and the second spatial information, and a difference between a first video processing result and a true value (ground true) corresponding to a video sample.

In one possible implementation, the first video processing network and the second video processing network are used to implement video enhancement tasks.

In one possible implementation, the video enhancement task is a video de-noising task, a video de-fogging task, a super-resolution task, or a high dynamic range task.

In one possible implementation, before the separately processing the first intermediate feature map output and the second intermediate feature map output, the method further includes:

performing deblurring processing on the first intermediate feature map and the second intermediate feature map respectively to obtain the deblurred first intermediate feature map and the deblurred second intermediate feature map;

the processing the first intermediate feature map and the second intermediate feature map by the recurrent neural network respectively includes:

and respectively processing the first intermediate feature map after the deblurring processing and the second intermediate feature map after the deblurring processing through a recurrent neural network.

In a second aspect, the present application provides a model training apparatus, the apparatus comprising:

the system comprises an acquisition module, a training module and a training module, wherein the acquisition module is used for acquiring a video sample, a first video processing network and a second video processing network, the first video processing network is a teacher model, and the second video processing network is a student model to be trained;

a video processing module, configured to process the video sample through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network, respectively;

a feature map processing module, configured to process the first intermediate feature map output and the second intermediate feature map output respectively to obtain first inter-frame information and second inter-frame information, where the first inter-frame information and the second inter-frame information are used to represent a feature change relationship between image frames of the video sample;

a knowledge distillation module, configured to determine a target loss according to the first interframe information and the second interframe information, and perform knowledge distillation on the second video processing network based on the target loss and the first video processing network to obtain a trained second video processing network, where the target loss is related to a difference between the first interframe information and the second interframe information.

In one possible implementation, the feature map processing module is configured to process the first intermediate feature map output and the second intermediate feature map output through a recurrent neural network, respectively.

In one possible implementation, the recurrent neural network is a long-short term memory (LSTM) network, and the first inter-frame information and the second inter-frame information are cell states output by the LSTM.

In one possible implementation, the video sample includes a plurality of frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, and the apparatus further includes:

an information statistics module, configured to process each of the first sub-intermediate feature maps and each of the second sub-intermediate feature maps to obtain first spatial information of each of the first sub-intermediate feature maps and second spatial information of each of the second sub-intermediate feature maps, where the first spatial information and the second spatial information are used to represent feature distribution of feature maps;

the knowledge distillation module is used for determining a target loss according to the first interframe information and the second interframe information and the first spatial information and the second spatial information, wherein the target loss is related to the difference between the first interframe information and the second interframe information and the difference between the first spatial information and the second spatial information.

In a possible implementation, the first spatial information is a first spatial attention map, the second spatial information is a second spatial attention map, and the information statistics module is configured to map the first intermediate feature map output and the second intermediate feature map output based on a spatial attention mechanism, so as to obtain the first spatial attention map and the second spatial attention map, respectively.

In one possible implementation, the video processing module is configured to process the video sample through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network, a first video processing result output by the first video processing network, and a second intermediate feature map output by the second video processing network, respectively;

the knowledge distillation module is used for acquiring a true value (ground true) corresponding to the video sample; determining a target loss based on the first and second inter-frame information and the first video processing result and the true value, the target loss being related to a difference between the first and second inter-frame information and a difference between the first video processing result and the true value.

In one possible implementation, the apparatus further comprises: a deblurring module, configured to perform deblurring processing on the first intermediate feature map and the second intermediate feature map respectively before the first intermediate feature map output and the second intermediate feature map output are processed respectively by a recurrent neural network, so as to obtain a deblurred first intermediate feature map and a deblurred second intermediate feature map;

the feature map processing module is configured to process the first intermediate feature map after the deblurring processing and the second intermediate feature map after the deblurring processing through a recurrent neural network, respectively.

In a third aspect, an embodiment of the present application provides a model training apparatus, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to perform any one of the methods described in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute any one of the methods described in the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program comprising code that, when executed, is configured to implement any one of the optional methods of the first aspect.

In a sixth aspect, the present application provides a chip system, which includes a processor, configured to support an execution device or a training device to implement the functions recited in the above aspects, for example, to transmit or process data recited in the above methods; or, information. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the execution device or the training device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

The embodiment of the application provides a model training method, which comprises the following steps: the method comprises the steps of obtaining a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained; processing the video samples through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network, respectively; respectively processing the first intermediate feature map output and the second intermediate feature map output through a recurrent neural network to respectively obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing the feature change relationship among the image frames of the video sample; and determining a target loss according to the first interframe information and the second interframe information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to obtain a trained second video processing network, wherein the target loss is related to the difference between the first interframe information and the second interframe information. By the mode, on the premise that the structure of the model is not changed, the interframe information is added in the target loss for knowledge distillation, so that the teacher model can better recognize the interframe information, the capability of performing video processing by using the interframe information is transferred to the student model, and the video quality of the video obtained after the video processing is performed on the student model after the knowledge distillation is improved.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework;

fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 6 is a block diagram of a system according to an embodiment of the present disclosure;

fig. 7 is a structural schematic diagram of a chip provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of a model training method provided in an embodiment of the present application;

fig. 9 is a schematic diagram of a video enhancement network provided in an embodiment of the present application;

fig. 10 is a schematic diagram of a super-resolution network provided in an embodiment of the present application;

FIG. 11 is a schematic diagram of an RNN according to an embodiment of the present application;

FIG. 12 is a schematic diagram of an RNN according to an embodiment of the present application;

FIG. 13 is a schematic diagram of an RNN according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a model training method provided in an embodiment of the present application;

FIG. 15 is a schematic diagram of a model training method provided in an embodiment of the present application;

fig. 16 to fig. 19 are schematic diagrams illustrating an effect of a model training method provided in an embodiment of the present application;

FIG. 20 is a schematic diagram of a model training apparatus according to an embodiment of the present application;

fig. 21 is a schematic structural diagram of an execution device according to an embodiment of the present application;

fig. 22 is a schematic structural diagram of a training apparatus according to an embodiment of the present application.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by an intelligent chip, such as a Central Processing Unit (CPU), a Network Processor (NPU), a Graphic Processor (GPU), an Application Specific Integrated Circuit (ASIC), or a Field Programmable Gate Array (FPGA), or other hardware acceleration chip; the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, intelligent city etc..

The model training method provided by the embodiment of the application can be particularly applied to data processing methods such as data training, machine learning and deep learning, symbolic and formal intelligent information modeling, extraction, preprocessing, training and the like are carried out on training data, and a trained neural network model (such as a trained second video processing network in the embodiment of the application) is finally obtained; and the trained second video processing network can be used for model reasoning, and specifically, the video can be input into the trained second video processing network to obtain a video processing result.

The trained second video processing network can be applied to intelligent vehicles for assisting driving and automatic driving, and can also be applied to the fields of needing to carry out video enhancement in the computer vision fields such as smart cities and intelligent terminals. For example, the technical solution of the present application can be applied to a video streaming scene and a video monitoring scene. A brief description of a video streaming scenario and a video surveillance scenario is provided below in conjunction with fig. 2 and 3, respectively.

Video streaming scenario:

for example, when a client using a smart terminal (e.g., in a cell phone, car, robot, tablet, desktop, smart watch, virtual reality VR, augmented reality AR device, etc.) plays a video, to reduce the bandwidth requirement of the video stream, the server may transmit a downsampled, lower resolution, low quality video stream over the network to the client. The client may then enhance the images in the low-quality video stream using the trained second video processing network. For example, the images in the video are subjected to super-resolution, noise reduction and other operations, and finally, high-quality images are presented to the user.

Video monitoring scene:

in the security field, the method is limited by adverse conditions such as the installation position of a monitoring camera, limited storage space and the like, and the image quality of part of video monitoring is poor, so that the accuracy of identifying a target by people or an identification algorithm is influenced. Therefore, the trained second video processing network provided by the embodiment of the application can be used for converting low-quality video monitoring videos into high-quality high-definition videos, so that effective recovery of a large amount of details in the monitored images is realized, and more effective and richer information is provided for subsequent target identification tasks.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, and the neural units may refer to operation units with xs (i.e. input data) and intercept 1 as inputs, and the output of the operation units may be:

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way features are extracted is location independent. The convolution kernel may be formalized as a matrix of random size, and may be learned to obtain reasonable weights during the training of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

CNN is a very common neural network, and the structure of CNN will be described in detail below with reference to fig. 4. As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 4, the Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 4 may include layers such as 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 4, may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

Fully connected layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the fully-connected layer 230. Thus, multiple hidden layers (231, 232, to 23n as shown in fig. 4) may be included in the fully-connected layer 230, and parameters included in the multiple hidden layers may be pre-trained according to the associated training data of a specific task type, for example, the task type may include … … for image recognition, image classification, image super-resolution reconstruction, and so on

After the hidden layers in the fully-connected layer 230, i.e., the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from the direction 210 to 240 in fig. 4 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e., the propagation from the direction 240 to 210 in fig. 4 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 4 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, only includes a part of the network structure shown in fig. 4, for example, the convolutional neural network employed in the embodiment of the present application may only include the input layer 210, the convolutional layer/pooling layer 220, and the output layer 240.

It should be noted that the convolutional neural network 100 shown in fig. 4 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 5, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the fully-connected layer 230 for processing.

(3) Deep neural network

Deep Neural Networks (DNNs), also known as multi-layer Neural networks, can be understood as Neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(4) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(5) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

(6) Recurrent Neural Networks (RNNs) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are all connected, and each node between every two layers is connectionless. Although the common neural network solves a plurality of problems, the common neural network still has no capability for solving a plurality of problems. For example, you would typically need to use the previous word to predict what the next word in a sentence is, because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The training for RNN is the same as for conventional CNN or DNN. The error back-propagation algorithm is also used, but with a little difference: that is, if the RNN is network-deployed, the parameters therein, such as W, are shared; this is not the case with the conventional neural networks described above by way of example. And in using the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the networks of the previous steps. This learning algorithm is referred to as the Time-based Back Propagation Through Time (BPTT).

Now that there is a convolutional neural network, why is a circular neural network? For simple reasons, in convolutional neural networks, there is a precondition assumption that: the elements are independent of each other, as are inputs and outputs, such as cats and dogs. However, in the real world, many elements are interconnected, such as stock changes over time, and for example, a person says: i like to travel, wherein the favorite place is Yunnan, and the opportunity is in future to go. Here, to fill in the blank, humans should all know to fill in "yunnan". Because humans infer from the context, but how do the machine do it? The RNN is generated. RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

(7) Pixel value

The pixel value of the image may be a Red Green Blue (RGB) color value and the pixel value may be a long integer representing a color. For example, the pixel value is 256 Red +100 Green +76Blue, where Blue represents the Blue component, Green represents the Green component, and Red represents the Red component. In each color component, the smaller the numerical value, the lower the luminance, and the larger the numerical value, the higher the luminance. For a grayscale image, the pixel values may be grayscale values.

(8) Super-resolution

Super Resolution (SR) is an image enhancement technique, in which one or a group of low-Resolution images are given, and high-frequency detail information of the images is restored by means of learning priori knowledge of the images, similarity of the images themselves, multi-frame image information complementation and the like, so as to generate a target image with higher Resolution. In the application of super-resolution, the super-resolution can be divided into single-frame image super-resolution and video super-resolution according to the number of input images. The super-resolution has important application value in the fields of high-definition televisions, monitoring equipment, satellite images, medical images and the like.

(9) Video super-resolution

Video Super Resolution (VSR) is an enhanced technique for video processing, and aims to convert low-resolution video into high-quality high-resolution video. The video super-resolution can be divided into multi-frame video super-resolution and cyclic video super-resolution according to the number of input frames.

The image processing method provided by the application can be applied to live video, video call, album management, smart cities, man-machine interaction and other scenes needing to relate to video data and the like.

(10) Noise reduction

Images are often affected by the imaging device and the external environment during digitization and transmission, resulting in images containing noise. The process of reducing noise in an image is referred to as image denoising, which is sometimes referred to as image denoising.

(11) Image features

The image features mainly include color features, texture features, shape features, spatial relationship features and the like of the image.

The color feature is a global feature describing surface properties of a scene corresponding to an image or an image area; the general color features are based on the characteristics of the pixel points, and all pixels belonging to the image or the image area have respective contributions. Since color is not sensitive to changes in the orientation, size, etc. of an image or image region, color features do not capture local features of objects in an image well.

Texture features are also global features that also describe the surface properties of the scene corresponding to the image or image area; however, since texture is only a characteristic of the surface of an object and does not completely reflect the essential attributes of the object, high-level image content cannot be obtained by using texture features alone. Unlike color features, texture features are not based on the characteristics of the pixel points, which requires statistical calculations in regions containing multiple pixel points.

The shape features are represented in two types, one is outline features, the other is region features, the outline features of the image mainly aim at the outer boundary of the object, and the region features of the image are related to the whole shape region.

The spatial relationship characteristic refers to the mutual spatial position or relative direction relationship among a plurality of targets segmented from the image, and these relationships can be also divided into a connection/adjacency relationship, an overlapping/overlapping relationship, an inclusion/containment relationship, and the like. In general, spatial location information can be divided into two categories: relative spatial position information and absolute spatial position information. The former relation emphasizes the relative situation between the objects, such as the upper, lower, left and right relations, and the latter relation emphasizes the distance and orientation between the objects.

It should be noted that the above listed image features can be taken as some examples of features in the image, and the image can also have other features, such as features of higher levels: semantic features, which are not expanded here.

(12) Image/video enhancement

Image/video enhancement refers to actions on images/video that can improve the imaging quality. For example, enhancement processing includes super-resolution, noise reduction, sharpening, or demosaicing, among others.

The system architecture provided by the embodiment of the present application is described in detail below with reference to fig. 6. Fig. 6 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 6, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.

The execution device 510 includes a computation module 511, an I/O interface 512, a pre-processing module 513, and a pre-processing module 514. The target model/rule 501 may be included in the calculation module 511, with the pre-processing module 513 and the pre-processing module 514 being optional.

The data acquisition device 560 is used to acquire training data. The training data in the embodiment of the present application includes video samples and surveillance video (or called true value). The video sample can be a low-quality video, and the surveillance video is a high-quality video corresponding to the video sample acquired in advance before model training. The video sample may be, for example, a low resolution video, and the surveillance image is a high resolution video; alternatively, the video sample may be, for example, a video containing fog or noise, and the surveillance image is a video from which fog or noise is removed. After the training data is collected, data collection facility 560 stores the training data in database 530, and training facility 520 trains target model/rule 501 based on the training data maintained in database 530.

In this embodiment, the training device 520 performs knowledge distillation on the student model (e.g., the second video processing model in this embodiment) based on the training data maintained in the database 530 and the teacher model (e.g., the first video processing model in this embodiment) to obtain the target model/rule 501 (e.g., the trained second video processing model in this embodiment).

The target model/rule 501 can be used to implement a video enhancement task, that is, a video to be processed is input into the target model/rule 501, and a processed enhanced video can be obtained. It should be noted that, in practical applications, the training data maintained in the database 530 may not necessarily all come from the collection of the data collection device 560, and may also be received from other devices. It should be noted that, the training device 520 does not necessarily perform the training of the target model/rule 501 based on the training data maintained by the database 530, and may also obtain the training data from the cloud or other places to perform the model training, and the above description should not be taken as a limitation to the embodiments of the present application.

The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, for example, the executing device 510 shown in fig. 6, where the executing device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, or a server or a cloud. In fig. 6, the execution device 510 configures an input/output (I/O) interface 512 for data interaction with an external device, and a user may input data to the I/O interface 512 through a client device 540, where the input data may include: the image to be processed is input by the client device.

The pre-processing module 513 and the pre-processing module 514 are configured to perform pre-processing according to input data (e.g., video to be processed) received by the I/O interface 512. It should be understood that there may be no pre-processing module 513 and pre-processing module 514 or only one pre-processing module. When the pre-processing module 513 and the pre-processing module 514 are not present, the input data may be processed directly using the calculation module 511.

During the process of preprocessing the input data by the execution device 510 or performing the calculation and other related processes by the calculation module 511 of the execution device 510, the execution device 510 may call the data, codes and the like in the data storage system 550 for corresponding processes, or store the data, instructions and the like obtained by corresponding processes in the data storage system 550.

Finally, the I/O interface 512 presents the processing results, such as the enhanced video resulting from the processing, to the client device 540 for presentation to the user.

It is worth noting that the training device 520 may generate corresponding target models/rules 501 for different targets or different tasks based on different training data, and the corresponding target models/rules 501 may be used to implement the video enhancement task, so as to provide the user with the required results.

In the case shown in fig. 6, the user may manually give input data (the input data may be a video to be processed), and the "manual input data" may operate through an interface provided by the I/O interface 512. Alternatively, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 540. The user can view the results output by the execution device 510 at the client device 540, and the specific presentation form can be display, sound, action, and the like. The client device 540 may also serve as a data collection terminal, collecting input data of the input I/O interface 512 and output results of the output I/O interface 512 as new sample data, as shown, and storing the new sample data in the database 530. Of course, the input data inputted to the I/O interface 512 and the output result outputted from the I/O interface 512 as shown in the figure may be directly stored in the database 530 as new sample data by the I/O interface 512 without being collected by the client device 540.

It should be noted that fig. 6 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 6, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510.

A hardware structure of a chip provided in an embodiment of the present application is described below.

Fig. 7 is a hardware structure diagram of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 700. The chip may be disposed in the execution device 510 as shown in fig. 6 to complete the calculation work of the calculation module 511. The chip may also be disposed in a training apparatus 520 as shown in fig. 6 to complete the training work of the training apparatus 520 and output the target model/rule 501. The algorithms for the various layers of the video processing network shown in fig. 6 may be implemented in a chip as shown in fig. 7.

A neural Network Processor (NPU) 700 is mounted as a coprocessor on a host central processing unit (host CPU), and tasks are allocated by the host CPU. The core portion of the NPU is an arithmetic circuit 703, and the controller 704 controls the arithmetic circuit 703 to extract data in a memory (the weight memory 702 or the input memory 701) and perform arithmetic.

In some implementations, the arithmetic circuit 703 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 703 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 703 fetches the data corresponding to the matrix B from the weight memory 702 and buffers it in each PE in the arithmetic circuit 703. The arithmetic circuit 703 takes the matrix a data from the input memory 701 and performs matrix arithmetic with the matrix B, and stores a partial result or a final result of the matrix in an accumulator (accumulator) 708.

The vector calculation unit 707 may further process the output of the operation circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 707 may be used for network calculations of non-convolution/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 707 can store the processed output vector to the unified memory 706. For example, the vector calculation unit 707 may apply a non-linear function to the output of the arithmetic circuit 703, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 707 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 703, for example for use in subsequent layers in a neural network.

The unified memory 706 is used to store input data as well as output data.

The weight data directly passes through a memory cell access controller (DMAC) 705 to transfer the input data in the external memory to the input memory 701 and/or the unified memory 706, store the weight data in the external memory into the weight memory 702, and store the data in the unified memory 706 into the external memory.

A Bus Interface Unit (BIU) 710, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 709 through a bus.

An instruction fetch buffer 709 connected to the controller 704 for storing instructions used by the controller 704.

The controller 704 is configured to call the instruction cached in the instruction fetch memory 709, so as to control the working process of the operation accelerator.

Generally, the unified memory 706, the input memory 701, the weight memory 702, and the instruction fetch memory 709 are all on-chip memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

Referring to fig. 8, fig. 8 is a schematic diagram of an embodiment of a model training method provided in an embodiment of the present application, and as shown in fig. 8, the model training method provided in the embodiment of the present application includes:

801. the method comprises the steps of obtaining a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained.

In an embodiment of the present application, a video sample may include a plurality of image frames.

In the embodiment of the present application, the first video processing network and the second video processing network may be configured to implement a video enhancement task, and the video enhancement task may be understood as a task for enhancing the quality of a video, for example, the video enhancement task may be a video denoising task, a video defogging task, a super-resolution task, or a high dynamic range task, and is not limited herein.

It should be understood that the first video processing network and the second video processing network are models for implementing the same video processing task, and the application does not limit the specific type of video processing task. Taking the video enhancement task as the super-resolution task as an example, an illustration of the network structure of the first video processing network and the second video processing network is described next.

Referring to fig. 9, fig. 9 is a structural schematic diagram of a video processing network, as shown in fig. 9, an image to be processed may be a low resolution image (LR), an image frame with low resolution may be processed by a feature extraction module to obtain image features, and then the feature map may be processed by a plurality of basic units, where a basic unit may be a network structure obtained by connecting basic modules through basic operations of a neural network, the network structure may include a preset basic operation or a combination of basic operations in a convolutional neural network, and these basic operations or combinations of basic operations may be collectively referred to as basic operations. For example, the basic operation may refer to a convolution operation, a pooling operation, a residual join, etc., and the basic operation may be used to join the basic modules, so as to obtain the network structure of the basic unit. The nonlinear transformation part is used for transforming the image characteristics of the image to be input and mapping the image characteristics to a high-dimensional characteristic space, and the mapped high-dimensional space is easier to reconstruct a hyper-resolution image under normal conditions; the reconstruction part is used for performing up-sampling and convolution processing on the image features output by the nonlinear change part to obtain a super-resolution image (high resolution, LR, HR) corresponding to the image to be input.

Taking the hyper-segmentation task model as an example, as shown in fig. 10, a low resolution video (including a plurality of low resolution image frames) is input, and then feature extraction may be performed on the low resolution image frames to obtain a low-scale feature map, where the low-scale feature map refers to a feature map including more low frequency information, or, in other words, the feature map may include less texture detail information, and in the hyper-segmentation task model, the low-scale feature map may be a Pyramid (Pyramid) structure and obtained by convolutional layer processing with a step size of 2, and each layer of the Pyramid extracts features through a plurality of residual blocks. And the hyper-segmentation task model carries out pre-deblurring on the input image frame by using a pyramid structure.

And then, aligning and/or fusing the low-scale characteristic map to obtain the denoised low-scale characteristic map.

Specifically, a deformable convolution (deformable conv) can be applied to the low-scale feature map to realize the alignment of the image, and the problem that the optical flow of the image needs to be calculated/estimated explicitly or implicitly in the conventional alignment method is effectively avoided. The input low-scale feature map can be convolved by a convolution layer with the step length of 2 to obtain a pyramid of an L layer, similar operation is carried out on each layer of the pyramid for a reference frame t and any adjacent frame t + i, namely, two feature maps are spliced and convolved to obtain an operation result (called offsets in a hyper-diversity task model) of deformable convolution, the feature map at the time of t + i is input into a deformable conv, and a new feature map at the time of t + i is output through the deformable conv; in addition, the offsets of the lower layer of the pyramid are used as the input of the upper layer of offset conv, which is used to estimate the offsets more accurately, and the feature map of the transformed conv output is also up-sampled and then fused with the corresponding features of the upper layer. Until the first layer of the pyramid, the feature map output by the configurable conv and fused with the bottom layer is spliced with the feature map of the reference frame to be used as the feature map of offsets of the new configurable conv, so that the final feature map aligned at the time t + i can be predicted.

In addition, different image frames may generate different blurring conditions due to some irresistible reasons such as hand shake, object motion, etc., and therefore, the contribution of different adjacent frames to the enhanced reference frame is different. Traditional methods generally consider them to be equally well, but not so. Therefore, the hyper-segmentation task model introduces an attention mechanism in the fusion process, and different feature maps are given different weights in two dimensions of spatial and temporal

Specifically, based on the aligned feature maps, the reference frame and the adjacent frame pass through different convolutional layers again to further extract features (parameters are shared for the adjacent frames), and the similarity between the adjacent frame and the reference frame is calculated and defined as a spatial attention map (temporal attention map) at the time. The feature map and the feature map of the reference frame at each moment are subjected to the operation, including the reference frame, so that each moment can obtain a temporal association map, and the multiplication of the temporal association map and the aligned feature map in the spatial dimension is equivalent to the adjustment of the proportion of the restoration/enhancement tasks of the feature maps at different moments; then, all feature maps are convoluted, namely, feature fusion operation is carried out; furthermore, a spatial attribute map is obtained through the pyramid structure, and a new characteristic diagram is obtained after up-sampling.

After obtaining the new low-scale feature map, the high-scale feature map may be obtained through a reconstruction process (for example, a reconstruction may be performed through several residual blocks), and finally, the final high-resolution image frame may be obtained through a convolution operation. The high-scale feature map before convolution change may be a multi-channel feature map, and the result after the convolution operation may represent a high-resolution image frame, for example, the result after the convolution operation may be a three-channel image (e.g., an RGB image).

In the embodiment of the application, a first video processing network and a second video processing network are used for realizing a video enhancement task, the first video processing network is a teacher model, the second video processing network is a student model, and the first video processing network is used as the teacher model to perform knowledge distillation on the second video processing network.

The teacher model may also be referred to as a teacher model, a guidance model, and the like, and is not limited herein.

In knowledge distillation, another simple network (the second video processing network) can be trained by using a pre-trained complex network (the first video processing network), so that the simple network (the second video processing network) can have the same or similar data processing capability as the complex network (the first video processing network). Knowledge distillation is to transfer the "knowledge" of a trained complex network to a network with a simpler structure. Wherein the simple network may have a smaller number of parameters than a complex network.

It should be noted that the same or similar data processing capability is understood that the processed results of the student model and the teacher model after the knowledge distillation are the same or similar when the same data to be processed is processed.

In knowledge distillation, loss needs to be constructed based on the output of a teacher model and the output of a student model, where the model output for constructing loss may be the output of an output layer of the model, or an intermediate feature map output of an intermediate network layer, or a result of processing the output of the output layer and/or the intermediate feature map output of the intermediate network layer, in a conventional implementation, the model output for constructing loss is spatial information for representing feature distribution of a feature map, which is obtained by counting the intermediate output of the intermediate network layer in each frame of image in a video, however, in a video-enhanced scene, the spatial information of an image frame can only represent the feature distribution of the feature map of each frame of image, and does not carry inter-frame information, where the inter-frame information may be continuous and variable information between frames, and an inter-frame static region may be continuous information, the object in which motion exists between frames may be change information.

The teacher model has large parameter quantity and data processing capacity, and can well process inter-frame continuous and variable information, that is, the teacher model can better identify inter-frame information and perform video enhancement by using the inter-frame information, the video quality of the enhanced video is high, if the loss is only related to the spatial information of each image frame, the student model cannot learn the processing capacity of the teacher model on the inter-frame information, and the video enhancement effect of the distilled student model is not high.

In the embodiment of the present application, when constructing the target loss for knowledge distillation, inter-frame information is also considered, and how to acquire the inter-frame information and how to construct the target loss based on the inter-frame information are described in detail below.

802. Processing the video samples through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network, respectively.

In the embodiment of the application, in the knowledge distillation process, a teacher model and a student model need to process a video sample, that is, a feed-forward process of the models is performed, after the video sample is processed by the first video processing network and the second video processing network, an enhanced video can be obtained, and in addition, a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network can be obtained. The first intermediate feature map output of the first video processing network and the second intermediate feature map output of the second video processing network are described next:

in an embodiment of the present application, the first intermediate feature map output may be a feature map output of an intermediate network layer when the first video processing network processes the video sample, the second intermediate feature map output may be a feature map output of an intermediate network layer when the second video processing network processes the video sample, and a location of the network layer that outputs the first intermediate feature map output in the first video processing network is the same as a location of the network layer that outputs the second intermediate feature map output in the second video processing network.

The intermediate network layer may be a network layer in the first video processing network and the second video processing network for outputting the feature map, and as long as the output feature map can carry the image features of the image frame, the embodiment of the present application does not limit the positions of the intermediate network layer in the first video processing network and the second video processing network and the type of the network layer.

Taking the video enhancement task as a super-resolution task as an example, the first intermediate feature map output and the second intermediate feature map output may be obtained by performing feature extraction on a video sample, or may be obtained by performing other processing on a feature map obtained by feature extraction, for example, a deblurred first intermediate feature map obtained by performing deblurring processing on the first intermediate feature map output and the second intermediate feature map output obtained by feature extraction and the deblurred second intermediate feature map, and taking the super-resolution task model as an example, the first intermediate feature map output and the second intermediate feature map output may be low-scale feature maps obtained by alignment and/or fusion operations or high-scale feature maps obtained by reconstruction.

803. And respectively processing the first intermediate feature map output and the second intermediate feature map output to respectively obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing the feature change relationship between the image frames of the video sample.

In one implementation, the first intermediate feature map output and the second intermediate feature map output may be processed by a recurrent neural network, so as to obtain first inter-frame information and second inter-frame information, where the first inter-frame information and the second inter-frame information are used to represent a feature change relationship between image frames of the video sample.

It should be understood that the first intermediate feature map output and the second intermediate feature map output may be processed by other network or function mapping capable of determining inter-frame information between image frames in a video, and is not limited herein.

The RNN is called a recurrent neural network, i.e. the current output of a sequence is also related to the previous output. The concrete expression is that the network will memorize the previous information and apply it to the calculation of the current output, i.e. the nodes between the hidden layers are connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The training for RNN is the same as for conventional CNN or DNN. The error back-propagation algorithm is also used, but with a little difference: that is, if the RNN is network-deployed, the parameters therein, such as W, are shared. And in using the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the networks of the previous steps. This learning algorithm is called a time-based back propagation time (BPTT).

FIG. 11 is a schematic diagram of the RNN, where each circle can be viewed as a unit, and each unit does the same thing, so it can be folded to look like the left half. RNN is a sequence-to-sequence model, where X in FIG. 12_tInput indicating time t, o_tOutput representing time t, S_tThe memory at time t is shown, U is the input layer, W is the weight, and V is the output layer. The output of the current time is determined by the memory and the output of the current time, wherein S_t＝f(U*X_t+W*S_t-1), the f () function is an activation function in a neural network.

In this embodiment, the video sample includes multiple frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, each first sub-intermediate feature map and each second sub-intermediate feature map are processed by a recurrent neural network, so as to obtain first inter-frame information and second inter-frame information, and since the recurrent neural network memorizes previous information and applies the previous information to currently output calculation when processing sequence data, the first inter-frame information and the second inter-frame information may represent a feature change relationship between respective image frames of the video sample. Specifically, the feature change relationship may refer to continuity and change information between frames, where the continuity information is a relationship between stationary regions between frames, and the change information is a relationship between objects having motion between frames.

It should be understood that the interframe information in the embodiment of the present application may also be referred to as timing information (Temporal Context).

In one implementation, the first inter-frame information and the second inter-frame information may be hidden states (hidden states) output by the recurrent neural network, and a hidden state may be an output of a hidden layer in the RNN, where the RNN may obtain a plurality of hidden states (each hidden state corresponds to one image frame) when processing an intermediate feature map output of a plurality of image frames in a video, and may obtain all or part of the hidden states obtained by the RNN when processing the intermediate feature map output.

Generally, the RNN can carry more inter-frame information in the hidden layer output obtained by processing the later image frame, so in order to reduce the amount of calculation, the hidden layer state obtained by processing the intermediate feature map corresponding to the later image frame in the multi-frame image by the RNN can be selected.

Specifically, the video sample includes multiple frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the first inter-frame information includes M hidden states obtained by processing a first sub-intermediate feature map corresponding to a later M-frame of images in the multiple frames of images by the recurrent neural network, and the second inter-frame information is M hidden states obtained by processing a second sub-intermediate feature map corresponding to the later M-frame of images in the multiple frames of images by the recurrent neural network. It should be understood that M here can be chosen flexibly, and the application is not limited thereto.

Taking a recurrent neural network as a long-term-memory (LSTM) network as an example, the first inter-frame information and the second inter-frame information may be cell states (cell states) output by the LSTM.

Next, LSTM in the embodiment of the present application is described.

Referring to fig. 13, fig. 13 is a schematic diagram of a structure of an LSTM, wherein each image frame in a video sample can be sequentially input into the LSTM unit shown in fig. 13, the LSTM processes each image frame to obtain a cell state and a hidden layer state, and transmits the cell state and the hidden layer state to a process in which the LSTM processes an adjacent next image frame, as shown in fig. 13, Ct-1 is a cell state output by an intermediate feature map of a previous frame image processed by the LSTM, the cell state is transmitted similarly to a conveyor belt, the cell state runs on the whole chain of the LSTM, and small linear operations can be applied to the cell state, and the LSTM has a capability of deleting or adding information to the cell state, which is provided by a structure called a Gate (Gate). A Gate (Gate) is a way to optionally let information through. Illustratively, it consists of a Sigmoid neural network layer that outputs a number between 0 and 1 that describes how much information each component can pass through, 0 means no information, 1 means all pass through, and LSTM may have three gates to protect and control cell state, and a point multiplication.

The first step of LSTM is to determine what information is discarded from the cell state. This decision is implemented by a Sigmoid layer called "forgetting gate". It looks at Ht-1 (previous hidden state) and Ft (current input) and outputs a number between 0 and 1 for each number in the cell state Ct-1 (previous state), with 1 representing complete retention and 0 representing complete deletion. The next step is to determine what information is to be stored in the cell state. Specifically, it can be determined by the Sigmoid layer of the input gate layer which values it need to be updated, and the tanh layer creates candidate vectors

The candidate vector

Will be added to the cell state. The two vectors can then be combined to create an updated value, specifically, the last state value is multiplied by ft to express the expected forgotten part, and the resulting value is added to

To obtain the cell state Ct of the current image frame. Finally, it is necessary to determine what is to be output as a hidden state, and the output will be based on the cell state, which parts of the cell state to be output can be determined by a sigmoid layer. The cell state is then passed through tanh (normalizing the value to between-1 and 1) and multiplied by the output of the Sigmoid gate. The specific process can refer to the following formula:

in the embodiment of the application, the LSTM network may sequentially process the input image frames to obtain a cell state corresponding to each image frame, and the LSTM network may generally carry more inter-frame information in hidden layer output obtained by processing a later image frame, so that in order to reduce the amount of computation, the LSTM network may select a cell state obtained by processing an intermediate feature map corresponding to a later image frame in a multi-frame image.

In one implementation, the cell state obtained by processing the intermediate feature map corresponding to the last image frame in the multi-frame image by the LSTM network may be directly selected, specifically, for the first intermediate feature map output by the teacher network, the first intermediate feature map output may include a plurality of first sub-intermediate feature maps, where each first sub-intermediate feature map may correspond to one image frame, the first sub-intermediate feature map corresponding to the last image frame in the video may be obtained, and the cell state obtained by processing the first sub-intermediate feature map corresponding to the last image frame by the LSTM network is determined as the first inter-frame information; similarly, for a second intermediate feature map output of the student network, the second intermediate feature map output may include a plurality of second sub-intermediate feature maps, where each second sub-intermediate feature map may correspond to an image frame, and then a second sub-intermediate feature map corresponding to a last image frame in the video may be obtained, and it is determined that the cell state obtained by processing the second sub-intermediate feature map corresponding to the last image frame by the LSTM network is second inter-frame information.

It should be understood that, in addition to the cell state, the first inter-frame information and the second inter-frame information may be determined according to the hidden layer state obtained by processing the first intermediate feature map and the second intermediate feature map by the LSTM network, and specifically, the first hidden layer state and the second hidden layer state may be obtained by processing the first intermediate feature map output by the first video processing network and the second intermediate feature map output by the second video processing network by the LSTM network, respectively, and may be used as the first inter-frame information and the second inter-frame information, respectively.

804. And determining a target loss according to the first interframe information and the second interframe information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to obtain a trained second video processing network, wherein the target loss is related to the difference between the first interframe information and the second interframe information.

In this embodiment of the application, after obtaining the first inter-frame information and the second inter-frame information, a target loss for performing knowledge distillation may be constructed based on a difference between the first inter-frame information and the second inter-frame information, and based on the target loss and the first video processing network, the second video processing network may be subjected to knowledge distillation to obtain a trained second video processing network.

Specifically, by constraining first inter-frame information corresponding to a teacher model (a first video processing network) and second inter-frame information corresponding to a student model (a second video processing network), that is:

where Ld represents a norm constraint, which may be but is not limited to using the L2-norm distance, LTD is the target loss, CT is the first inter-frame information, and CS is the second inter-frame information.

It should be understood that the loss determined based on the difference between the first inter-frame information and the second inter-frame information may also be referred to as a temporal loss.

In one implementation, the target penalty may relate to a difference between spatial information of the first intermediate feature map output and the second intermediate feature map output in addition to a difference between the first inter-frame information and the second inter-frame information, as explained in detail below:

in this embodiment of the application, the video sample may include multiple frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the multiple frames of images, and each of the first sub-intermediate feature maps and each of the second sub-intermediate feature maps are processed to obtain first spatial information of each of the first sub-intermediate feature maps and second spatial information of each of the second sub-intermediate feature maps, where the first spatial information and the second spatial information are used to represent feature distribution of feature maps; determining a target loss based on the first inter-frame information and the second inter-frame information, and the first spatial information and the second spatial information, the target loss being related to a difference between the first inter-frame information and the second inter-frame information, and a difference between the first spatial information and the second spatial information.

The spatial information is used to represent a feature distribution of the feature map, and the feature distribution may include rich image content and represent image features of the corresponding image frame, such as frequency features, texture detail features, and the like.

In an alternative implementation, each first sub-intermediate feature map may be subjected to channel-wise square summation to obtain first spatial information, and each second sub-intermediate feature map may be subjected to channel-wise square summation to obtain second spatial information, where the information statistics are subjected to channel-wise square summation, the spatial information may also be referred to as a spatial attention map. Specifically, the first spatial information may be obtained by summing squares of each first sub-intermediate feature map by channel, and the second spatial information may be obtained by summing squares of each second sub-intermediate feature map by channel.

It should be understood that the above-mentioned square sum operation is only an illustration, and in practical applications, the first spatial information and the second spatial information may also be calculated by other operations, and is not limited herein.

By the method, first spatial information corresponding to the first video processing model and second spatial information corresponding to the second video processing model can be obtained, the first spatial information and the second spatial information can be constrained, and the target loss can include a loss determined based on a difference between the first spatial information and the second spatial information. It should be understood that the loss determined based on the difference between the first spatial information and the second spatial information may also be referred to as spatial loss. Reference may be made to fig. 14 for the calculation of the spatial and temporal losses.

In one implementation, the target loss may relate to a difference between a first video processing result processed by the first video processing model and a true value (ground true) corresponding to the video sample, in addition to a difference between the first inter-frame information and the second inter-frame information, and will be described in detail below:

in this embodiment of the application, the video samples may be processed by the first video processing network and the second video processing network to obtain a first intermediate characteristic diagram output of the first video processing network, a first video processing result output of the first video processing network, and a second intermediate characteristic diagram output of the second video processing network, respectively, so as to obtain a true value (ground route) corresponding to the video sample; determining a target loss based on the first and second inter-frame information and the first video processing result and the true value, the target loss being related to a difference between the first and second inter-frame information and a difference between the first video processing result and the true value. For example, the first video processing network and the second video processing network are used to implement a video enhancement task, so that a true value (ground threshold) corresponding to a video sample may be understood as a video sample with improved video quality, and in one implementation, the true value (ground threshold) corresponding to the video sample may also be preset or obtained by performing image enhancement on the video sample through the first video processing network, which is not limited herein.

The first video processing result output by the first video processing model may be obtained through the foregoing manner, and then the first video processing result and the truth value (ground route) corresponding to the video sample may be constrained, and the target loss may include a loss determined based on a difference between the first video processing result and the truth value (ground route) corresponding to the video sample. It should be understood that the loss determined based on the difference between the first video processing result and the corresponding true value (ground true) of the video sample may also be referred to as a reconstruction loss.

Referring to fig. 15, in an implementation, the target loss may be constructed based on a difference between the first inter-frame information and the second inter-frame information, a difference between the first spatial information and the second spatial information, and a difference between a first video processing result and a true value (ground channel) corresponding to a video sample, and specifically, the following formula may be referred to:

wherein, the target loss is L, and λ 1 and λ 2 are hyper-parameters. LSD is the spatial domain loss, LTD is the temporal domain loss, and Lrec is the reconstruction loss.

The embodiment of the application provides a model training method, which comprises the following steps: the method comprises the steps of obtaining a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained; processing the video samples through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network, respectively; respectively processing the first intermediate feature map output and the second intermediate feature map output through a recurrent neural network to respectively obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing the feature change relationship among the image frames of the video sample; and determining a target loss according to the first interframe information and the second interframe information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to obtain a trained second video processing network, wherein the target loss is related to the difference between the first interframe information and the second interframe information. By the mode, on the premise that the structure of the model is not changed, the interframe information is added in the target loss for knowledge distillation, so that the teacher model can better recognize the interframe information, the capability of enhancing the video by using the interframe information is transferred to the student model, and the video quality of the enhanced video obtained after the student model after knowledge distillation performs a video enhancement task is improved.

The advantageous effects of the embodiments of the present application are described next based on the experimental results.

According to the process provided by the embodiment of the application, a published standard data set Vimeo90K data set is used as a training set, and Vimeo90K-Test and Vid4 testset are used as a Test set. Specifically, by using the process provided by the embodiment of the present application, EDVR is used as a teacher model, and VDSR, VESPCN, VSRNet, and FastDVDnet models are tested on Vid4 and Vimeo90K-Test video data sets, respectively, referring to tables 1 and 2, where table 1 and table 2 are quantitative evaluation results: as can be seen from the peak signal-to-noise ratio (PSNR) and the Structural Similarity (SSIM) index, the results of the embodiments of the present application are improved to some extent compared to those without distillation. The quincunx logo represents the spatial distillation based on the autoencoder and statistics, and the examples of this application show 0.17dB and 0.55dB improvement over the method at Vid4 and Vimeo90K-Test, respectively.

TABLE 1 PSNR quantization index result ({ major } represents the method provided in the embodiments of the present application)

TABLE 2 SSIM quantification index result ({ } represents the method provided in the embodiments of the present application)

As shown in fig. 16 and fig. 17, it can be seen that the model (the trained second video processing model) obtained based on the model training method provided in the embodiment of the present application has better detail and texture recovery capability, such as the recovery effect of the window of the building and the tablecloth grid in the figure. As shown in fig. 18, fig. 18 is a comparison graph of inter-frame consistency, where STD is a processing result of a model obtained based on the model training method provided in the embodiment of the present application, and as shown in fig. 19, the distillation effect of the model training method provided in the embodiment of the present application on models with different magnitudes of calculated quantities/parameters is tested, and the calculated quantities/parameters are reduced by modifying the number of convolution channels of the student model.

Referring to fig. 20, fig. 20 is a schematic diagram of a model training apparatus 2000 according to an embodiment of the present application, and as shown in fig. 20, the model training apparatus 2000 according to the present application includes:

an obtaining module 2001, configured to obtain a video sample, a first video processing network and a second video processing network, where the first video processing network is a teacher model, and the second video processing network is a student model to be trained;

for a detailed description of the obtaining module 2001, reference may be made to the description of step 801, which is not described herein again.

A video processing module 2002, configured to process the video sample through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network, respectively;

for a detailed description of the video processing module 2002, reference may be made to the description of step 802, which is not described herein again.

A feature map processing module 2003, configured to process the first intermediate feature map output and the second intermediate feature map output respectively to obtain first inter-frame information and second inter-frame information, where the first inter-frame information and the second inter-frame information are used to represent a feature change relationship between image frames of the video sample;

for a detailed description of the feature map processing module 2003, reference may be made to the description of step 803, which is not described herein again.

A knowledge distillation module 2004 configured to determine a target loss according to the first interframe information and the second interframe information, and perform knowledge distillation on the second video processing network based on the target loss and the first video processing network to obtain a trained second video processing network, where the target loss is related to a difference between the first interframe information and the second interframe information.

For a detailed description of the knowledge distillation module 2004, reference may be made to the description of step 804, which is not repeated here.

the knowledge distillation module is configured to determine a target loss based on the first interframe information and the second interframe information, and the first video processing result and the truth value, the target loss being related to a difference between the first interframe information and the second interframe information, and a difference between the first video processing result and the truth value.

Referring to fig. 21, fig. 21 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 2100 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. The execution device 2100 may run the trained second video processing network processed by the embodiment corresponding to fig. 8. Specifically, the execution apparatus 2100 includes: a receiver 2101, a transmitter 2102, a processor 2103 and a memory 2104 (wherein the number of processors 2103 in the execution device 2100 may be one or more, for example one processor in fig. 21), wherein the processor 2103 may comprise an application processor 21031 and a communication processor 21032. In some embodiments of the present application, the receiver 2101, the transmitter 2102, the processor 2103, and the memory 2104 may be connected by a bus or other means.

Memory 2104 may include read-only memory and random access memory, and provides instructions and data to processor 2103. A portion of memory 2104 may also include non-volatile random access memory (NVRAM). The memory 2104 stores a processor and operating instructions, executable modules or data structures, or a subset or an expanded set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.

The processor 2103 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiments of the present application may be applied to the processor 2103 or implemented by the processor 2103. The processor 2103 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2103. The processor 2103 may be a general purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, a Vision Processor (VPU), a Tensor Processing Unit (TPU), or other processors suitable for AI operation, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, and a discrete hardware component. The processor 2103 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 2104, and the processor 2103 reads the information in the memory 2104 and performs the steps of the above method in combination with the hardware thereof.

The receiver 2101 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. Transmitter 2102 may be configured to output numeric or character information via a first interface; the transmitter 2102 may also be configured to send instructions to the disk groups via the first interface to modify data in the disk groups; the transmitter 2102 may also include a display device such as a display screen.

The execution device may obtain the trained second video processing network obtained by training through the model training method in the embodiment corresponding to fig. 8, and perform model inference.

Referring to fig. 22, fig. 22 is a schematic structural diagram of a training device provided in an embodiment of the present application, specifically, the training device 2200 is implemented by one or more servers, and the training device 2200 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 2219 (e.g., one or more processors) and a memory 2232, and one or more storage media 2230 (e.g., one or more mass storage devices) storing an application 2242 or data 2244. The memory 2232 and the storage medium 2230 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 2230 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the exercise device. Still further, central processor 2219 may be configured to communicate with storage medium 2230 to perform a series of instruction operations in storage medium 2230 on training device 2200.

Training device 2200 may also include one or more power supplies 2226, one or more wired or wireless network interfaces 2250, one or more input-output interfaces 2258; or one or more operating systems 2241, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Specifically, the training apparatus may execute the model training method in the embodiment corresponding to fig. 8.

The model training apparatus 2000 depicted in fig. 20 may be a module in the training device 2200, and a processor in the training device 2200 may execute the model training method executed by the model training apparatus 2000.

Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored by the storage unit to cause the chip in the execution device to execute the data processing method described in the above embodiment, or to cause the chip in the training device to execute the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein the separately processing the first and second intermediate feature map outputs comprises:

3. The method of claim 2, wherein the first inter-frame information and the second inter-frame information are hidden states (hidden states) of the recurrent neural network.

4. The method of claim 2 or 3, wherein the recurrent neural network is a long-short term memory (LSTM) network, and the first inter-frame information and the second inter-frame information are cell states (cell states) of the LSTM output.

5. The method according to claim 4, wherein the video sample comprises a plurality of frames of images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the first inter-frame information is a cell state obtained by the LSTM network processing a first sub-intermediate feature map corresponding to a last frame of image in the plurality of frames of images, and the second inter-frame information is a cell state obtained by the LSTM network processing a second sub-intermediate feature map corresponding to a last frame of image in the plurality of frames of images.

6. The method according to claim 4 or 5, wherein the video sample includes a plurality of frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the first inter-frame information includes M hidden states obtained by processing, by the recurrent neural network, the first sub-intermediate feature map corresponding to the M frames of images later in the plurality of frames of images, and the second inter-frame information is M hidden states obtained by processing, by the recurrent neural network, the second sub-intermediate feature map corresponding to the M frames of images later in the plurality of frames of images.

7. The method of any of claims 1 to 6, wherein the video sample comprises a plurality of frames of images, wherein the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, wherein the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, and wherein the method further comprises:

8. The method of claim 7, wherein the first spatial information is a first spatial attention map, the second spatial information is a second spatial attention map, and the performing information statistics on each of the first sub-intermediate feature maps and each of the second sub-intermediate feature maps comprises:

9. The method of any of claims 1 to 8, wherein said processing said video samples over said first video processing network and said second video processing network comprises:

processing the video sample through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network, a first video processing result output by the first video processing network, and a second intermediate feature map output by the second video processing network, respectively;

acquiring a true value (ground route) corresponding to the video sample;

10. The method of any of claims 1 to 9, wherein the first video processing network and the second video processing network are used to implement video enhancement tasks.

11. The method of any of claims 1 to 10, wherein prior to processing the first and second intermediate feature map outputs separately, the method further comprises:

and respectively processing the first intermediate feature map after the deblurring processing and the second intermediate feature map after the deblurring processing.

12. A model training apparatus, the apparatus comprising:

13. The apparatus of claim 12, wherein the feature map processing module is configured to process the first and second intermediate feature map outputs separately through a recurrent neural network.

14. The apparatus of claim 13, wherein the first inter-frame information and the second inter-frame information are hidden states (hidden states) of the recurrent neural network.

15. The apparatus of claim 13 or 14, wherein the recurrent neural network is a long-short term memory (LSTM) network, and the first inter-frame information and the second inter-frame information are cell states of the LSTM output.

16. The apparatus of claim 15, wherein the video sample comprises a plurality of frames of images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the first inter-frame information is a cell state obtained by the LSTM network processing a first sub-intermediate feature map corresponding to a last frame of image in the plurality of frames of images, and the second inter-frame information is a cell state obtained by the LSTM network processing a second sub-intermediate feature map corresponding to a last frame of image in the plurality of frames of images.

17. The apparatus according to claim 15 or 16, wherein the video sample includes a plurality of frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the first inter-frame information includes M hidden states obtained by the recurrent neural network processing a first sub-intermediate feature map corresponding to a next M frames of images in the plurality of frames of images, and the second inter-frame information is M hidden states obtained by the recurrent neural network processing a second sub-intermediate feature map corresponding to a next M frames of images in the plurality of frames of images.

18. The apparatus of any of claims 12 to 17, wherein the video samples comprise a plurality of frames of images, wherein the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, wherein the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, and wherein the apparatus further comprises:

19. The apparatus of claim 18, wherein the first spatial information is a first spatial attention map, the second spatial information is a second spatial attention map, and the information statistics module is configured to map the first intermediate feature map output and the second intermediate feature map output based on a spatial attention mechanism, respectively, to obtain the first spatial attention map and the second spatial attention map, respectively.

20. The apparatus according to any one of claims 12 to 19, wherein the video processing module is configured to process the video samples through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network, a first video processing result output of the first video processing network, and a second intermediate feature map output of the second video processing network, respectively;

21. The apparatus of any of claims 12 to 20, wherein the first video processing network and the second video processing network are configured to perform video enhancement tasks.

22. The apparatus of any one of claims 12 to 21, further comprising: a deblurring module, configured to perform deblurring processing on the first intermediate feature map and the second intermediate feature map respectively before the first intermediate feature map output and the second intermediate feature map output are processed respectively by a recurrent neural network, so as to obtain a deblurred first intermediate feature map and a deblurred second intermediate feature map;

the feature map processing module is configured to process the first intermediate feature map after the deblurring processing and the second intermediate feature map after the deblurring processing, respectively.

23. A model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, and the processor is configured to retrieve the code and perform the method of any of claims 1 to 11.

24. A computer storage medium, characterized in that the computer storage medium stores one or more instructions that, when executed by one or more computers, cause the one or more computers to implement the method of any of claims 1 to 11.

25. A computer product comprising code that when executed is operable to implement the method of any of claims 1 to 11.