CN111667399A

CN111667399A - Method for training style migration model, method and device for video style migration

Info

Publication number: CN111667399A
Application number: CN202010409043.0A
Authority: CN
Inventors: 张依曼; 陈醒濠; 王云鹤; 许春景
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-09-15
Anticipated expiration: 2040-05-14
Also published as: CN111667399B

Abstract

The application discloses a method for training a style migration model, a method for video style migration and a device thereof in the field of artificial intelligence, comprising the following steps: acquiring training data; carrying out image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted synthetic images; determining parameters of a neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted synthetic images, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on the N frames of predicted synthetic images and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images. According to the technical scheme, the stability of the video after the style migration processing can be improved.

Description

Method for training style migration model, method and device for video style migration

Technical Field

The application relates to the field of artificial intelligence, in particular to a method for training a style migration model, a method for video style migration and a device for video style migration in the field of computer vision.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Image rendering tasks such as image style migration have wide application requirement scenes on terminal equipment. Along with the high-speed improvement of the performance and the network performance of the terminal equipment, the entertainment demand of the terminal equipment is gradually changed from an image level to a video level, namely, the image style migration processing of a single image is changed into the image style migration processing of a video; compared with the image style migration task, the video style migration task needs to consider not only the stylized effect of the images, but also the stability among the multi-frame images included in the video, so that the fluency of the video subjected to the image style migration processing is ensured.

Therefore, how to improve the stability of the video after the image migration processing becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a training method of a style migration model, a video style migration method and a video style migration device.

In a first aspect, a method for training a style migration model is provided, including: acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthetic images, the N frames of synthetic images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; carrying out image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted synthetic images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predictive synthetic images,

wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on the N frames of predicted synthesized images and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images.

It should be understood that, for a matrix formed by a plurality of frame images, a low rank matrix may be used to represent regions that are not motion boundaries and that are all present in the N frame images. The sparse matrix can be used for representing the intermittent occurrence region of the N frames of images; for example, the sparse matrix may refer to a region newly appearing or disappearing at the boundary of an image due to camera movement, or a boundary region of a moving object.

In the embodiment of the application, a low-rank loss function is introduced when a target style migration model for video style migration processing is trained, and the low-rank loss function can enable the regions which are not motion boundaries and appear in adjacent multi-frame images in a video to be processed to be the same after the video is subjected to style migration processing, namely the rank of the region in the video subjected to the style migration processing is close to the rank of the region in the video to be processed, so that the stability of the video subjected to the style migration processing can be improved.

It should be understood that the image style transition processing is processing for fusing the image content in the content image a, which is an image required for the style transition, with the image style of the style image B, thereby generating a composite image C having the content of the image a and the style of the image B, or becoming a fused image C.

The style image may refer to a reference image for performing style migration processing, and the style in the image may include texture features of the image and an artistic expression form of the image; for example, the style of famous paintings, the artistic expression of the image can include the styles of cartoon, oil painting, watercolor, ink and wash and the like; the content image may refer to an image that needs to be subjected to style migration, and the content in the image may refer to semantic information in the image, that is, may include high-frequency information, low-frequency information, and the like in the content image.

In one possible implementation, the first low-rank matrix is obtained based on the N frames of sample content images and the optical flow information; for example, the first low-rank matrix may be obtained by calculating optical flow information between adjacent image frames in the N-frame sample content images; obtaining mask information according to the optical flow information, wherein the optical flow information is used for representing operation information of pixel points corresponding to adjacent frame images, and the mask information can be used for representing a change area in two continuous frame images obtained according to the optical flow information; furthermore, the N frames of sample content images are mapped to a fixed frame of image according to the optical flow information and the mask information, the N frames of sample content images after mapping processing are respectively generated into vectors and are combined into a matrix according to the column, and the matrix is a first low-rank matrix. Similarly, the second low-rank matrix may be obtained based on the N frames of predicted synthesized images and the optical flow information, the N frames of predicted synthesized images are mapped to a fixed frame of image according to the optical flow information and the mask information, the N frames of predicted synthesized images after mapping are respectively spread into vectors and are combined into a matrix according to groups, and the matrix is the second low-rank matrix, where the optical flow information is used to represent a position difference of corresponding pixels between two adjacent frames of images in the N frames of sample content images.

With reference to the first aspect, in certain implementations of the first aspect, the image loss function further includes a residual loss function obtained according to a difference between a first sample synthesized image and a second sample synthesized image, where the first sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, and the second model includes an optical flow module, the first model does not include the optical flow module, and the optical flow module is configured to determine the optical flow information.

In the embodiment of the application, the residual loss function is introduced when the target style migration model is trained, so that the neural network model can learn the difference between the synthesized image output by the style migration model including the optical flow module and the synthesized image output by the style migration model not including the optical flow module in the training process, and the stability of the video after style migration processing obtained by the target migration model can be improved.

It should be understood that the difference between the first sample composite image and the second sample composite image may refer to a difference between corresponding pixel values of the first sample composite image and the second sample composite image.

In one possible implementation, the first model and the second model may use the same sample content image and sample style image in the training stage; for example, the first model and the second model may refer to the same model during the training phase; however, in the testing stage, the second model also needs to calculate optical flow information among the multi-frame sample content images; the first model does not need to calculate optical flow information among a plurality of frames of images.

With reference to the first aspect, in certain implementation manners of the first aspect, the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and the knowledge distillation algorithm.

In a possible implementation manner, the target style migration model may refer to a target student model, and when the target student model is trained, a student model to be trained may be trained according to a pre-trained first teacher model (excluding an optical flow module), a pre-trained second teacher model (including an optical flow module), and a pre-trained basic model, so as to obtain the target student model; the network structures of the student model to be trained, the pre-trained basic model and the target student model are the same, and the student model to be trained is trained through the low-rank loss function, the residual loss function and the perception loss function, so that the target student model is obtained.

The pre-trained basic model may be a style transition model which is obtained by pre-training a perceptual loss function and does not include an optical flow module in a testing stage; or, the style migration model trained in advance may refer to a style migration model not including an optical flow module in a test stage, which is trained in advance through a perception loss function and an optical flow loss function; the perception loss function is used for representing content loss between the synthetic image and the content image and style loss between the synthetic image and the style image; and the optical flow loss function is used for expressing the difference between corresponding pixel points of the adjacent frame synthetic images.

In a possible implementation manner, in the process of training the student model to be trained, the difference of the migration result (also called as a synthetic image) output between the student model to be trained and the base model trained in advance is made to continuously approximate the difference of the migration result output between the second model and the first model through the residual loss function.

In the embodiment of the application, the target style migration model may refer to a target student model, and the knowledge distillation method for teacher-student model learning is adopted to enable the difference between the style migration results output by the student model to be trained and the base model trained in advance to continuously approach the difference between the style migration results output by the teacher model including the optical flow module and the style migration results output by the teacher model not including the optical flow module, so that the ghost phenomenon caused by the non-uniform style of the teacher model and the student model can be effectively avoided through the training method.

With reference to the first aspect, in certain implementations of the first aspect, the residual loss function is derived according to the following equation,

wherein L is_resRepresenting the residual loss function; n is a radical of_TRepresenting the second model;

representing the first model; n is a radical of_SRepresenting the student model to be trained;

representing a pre-trained basic model, wherein the pre-trained basic model has the same network structure as the student model to be trained; x is the number ofⁱRepresenting the ith frame sample content image included in the sample video, i being a positive integer.

With reference to the first aspect, in certain implementations of the first aspect, the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss and a style loss, the content loss is used to represent an image content difference between the N-frame predicted synthesized image and the N-frame sample content image corresponding thereto, and the style loss is used to represent an image style difference between the N-frame predicted synthesized image and the sample style image.

With reference to the first aspect, in certain implementations of the first aspect, the image loss function is obtained by weighting the low-rank loss function, the residual loss function, and the perceptual loss function.

With reference to the first aspect, in certain implementations of the first aspect, the parameters of the target style migration model are obtained through multiple iterations of a back propagation algorithm based on the image loss function.

In a second aspect, a method of video style migration, comprising: acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; carrying out image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthetic images; obtaining the video after the style migration processing corresponding to the video to be processed according to the N frames of synthetic images,

wherein the parameters of the target style migration model are determined according to an image loss function of the target style migration model for performing style migration processing on the N frames of sample content images, the image loss function comprises a low rank loss function representing a difference between a first low rank matrix and a second low rank matrix, the first low-rank matrix is derived based on the N frames of sample-content images and optical flow information, the second low-rank matrix is derived based on an N frames of predictive-synthesized images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, the N frames of predicted synthetic images are obtained by performing image style migration processing on the N frames of sample content images according to the sample style images through the target style migration model.

The image style migration is to fuse the image content in one content image a and the image style of one style image B together to generate one composite image C with the image content of a and the image style of B; the style in the image may include information such as texture features of the image; the content in the image may refer to semantic information in the image, i.e., may include high frequency information, low frequency information, etc. in the content image.

On the other hand, in the process of performing style migration processing on the video to be processed, the target style migration model provided in the embodiment of the present application does not need to calculate optical flow information between multiple frames of images included in the video to be processed, so that the target style migration model provided in the embodiment of the present application can improve stability, and at the same time, can shorten the time for style migration processing of the model, and improve the operating efficiency of the target style migration model.

In one possible implementation manner, the to-be-processed video may be a video captured by the electronic device through a camera, or the to-be-processed video may also be a video obtained from an inside of the electronic device (for example, a video stored in an album of the electronic device, or a video obtained by the electronic device from a cloud).

It should be understood that the above-mentioned video to be processed may be a video with a style migration requirement, and the present application does not set any limit to the source of the video to be processed.

With reference to the second aspect, in certain implementations of the second aspect, the image loss function further includes a residual loss function obtained according to a difference between a first sample synthesized image and a second sample synthesized image, where the first sample synthesized image is an image obtained by performing an image style migration process on the N frames of sample content images through a first model, the second sample synthesized image is an image obtained by performing an image style migration process on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, and the second model includes an optical flow module, the first model does not include the optical flow module, and the optical flow module is configured to determine the optical flow information.

It should be understood that the difference between the first sample composite image and the second sample composite image may refer to a difference between corresponding pixel values of the first sample composite image and the second sample composite image. In one possible implementation, the first model and the second model may use the same sample content image and sample style image in the training stage; for example, the first model and the second model may refer to the same model during the training phase; however, in the testing stage, the second model also needs to calculate optical flow information among the multi-frame sample content images; the first model does not need to calculate optical flow information among a plurality of frames of images.

With reference to the second aspect, in some implementation manners of the second aspect, the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and the knowledge distillation algorithm.

It should be noted that the network structures of the above-mentioned student model and the target student model may be the same, that is, the student model may refer to a style transition model which is trained in advance and does not need to input optical flow information in the testing stage; the target student model is obtained by further training through the residual loss function and the low-rank loss function on the basis of the student model.

In one possible implementation, the pre-trained student model may be a student model pre-trained by a perceptual loss function, where the perceptual loss function is used to represent the video stylization effect, that is, may be used to represent the content difference between the sample composite image and the sample style image and the style difference between the sample composite image and the sample content image.

In a possible implementation manner, the pre-trained student model may be a student model obtained by pre-training a perceptual loss function, where the optical flow loss function is used to represent a difference between corresponding pixel points of the adjacent frame composite image.

In some implementations of the second aspect, in combination with the second aspect, the residual loss function is derived from the following equation,

In a possible implementation manner, in the process of training the student model to be trained, the difference of the migration result (also called as a synthetic image) output between the student model to be trained and the base model trained in advance is made to continuously approximate the difference between the migration result output by the second model and the migration result output by the first model through the residual loss function.

With reference to the second aspect, in certain implementations of the second aspect, the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss and a style loss, the content loss is used to represent an image content difference between the N-frame predicted synthesized image and the N-frame sample content image corresponding thereto, and the style loss is used to represent an image style difference between the N-frame predicted synthesized image and the sample style image.

With reference to the second aspect, in certain implementations of the second aspect, the image loss function is obtained by weighting the low-rank loss function, the residual loss function, and the perceptual loss function.

With reference to the second aspect, in some implementations of the second aspect, the parameters of the target style migration model are obtained through multiple iterations of a back propagation algorithm based on the image loss function.

In a third aspect, a training apparatus for a style migration model is provided, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring training data, the training data comprises N frames of sample content images, sample style images and N frames of synthetic images, the N frames of synthetic images are images obtained after image style migration processing is carried out on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; the processing unit is used for carrying out image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted synthetic images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted synthetic images, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on the N frames of predicted synthetic images and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images.

In a possible implementation manner, the functional unit/module included in the training apparatus is further configured to perform the method in any one of the first aspect and the first aspect.

It will be appreciated that extensions, definitions, explanations and explanations of relevant content in the above-described first aspect also apply to the same content in the third aspect.

In a fourth aspect, an apparatus for video style migration is provided, comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed, the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; the processing unit is used for carrying out image style migration processing on the N frames of content images to be processed according to the target style migration model to obtain N frames of synthetic images; obtaining the video after the style migration processing corresponding to the video to be processed according to the N frames of synthetic images,

In a possible implementation manner, the functional units/modules included in the apparatus are further configured to perform the method in any one of the second aspect and the second aspect.

It is to be understood that extensions, definitions, explanations and explanations of relevant contents in the above-described second aspect also apply to the same contents in the fourth aspect.

In a fifth aspect, a training apparatus for a style migration model is provided, including: a memory for storing a program; a processor for executing the memory-stored program, the processor for performing, when the memory-stored program is executed: acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthetic images, the N frames of synthetic images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; carrying out image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted synthetic images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted synthetic images, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on the N frames of predicted synthetic images and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images.

In a possible implementation manner, the training apparatus includes a processor, and is further configured to perform the method in any one of the first aspect and the first implementation manner.

It will be appreciated that extensions, definitions, explanations and explanations of relevant content in the above-described first aspect also apply to the same content in the fifth aspect.

In a sixth aspect, an apparatus for video style migration is provided, including: a memory for storing a program; a processor for executing the memory-stored program, the processor for performing, when the memory-stored program is executed: acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; carrying out image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthetic images; obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthetic images, wherein parameters of the target style migration model are determined according to an image loss function for performing style migration processing on the N frames of sample content images by the target style migration model, the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on the N frames of predicted synthetic images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, and the N frames of predicted synthetic images are subjected to image wind processing on the N frames of sample content images according to the sample style images by the target style migration model And (5) obtaining an image after grid migration processing.

In a possible implementation manner, the processor included in the apparatus is further configured to perform the training method in any two implementation manners of the second aspect.

It will be appreciated that extensions, definitions, explanations and explanations of relevant matters in the above second aspect also apply to the same matters in the sixth aspect.

In a seventh aspect, a computer-readable medium is provided, which stores program code for execution by a device, the program code including instructions for performing the method in any one of the implementations of the first to second aspects and the first to second aspects.

In an eighth aspect, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform the method in any one of the implementations of the first to second aspects and the first to second aspects.

In a ninth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface and executes the method in any implementation manner of the first aspect to the second aspect and the first aspect to the second aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in any one implementation manner of the first aspect to the second aspect and the first aspect to the second aspect.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence agent framework provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario provided by an embodiment of the present application;

FIG. 3 is a system architecture provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

FIG. 6 is a system architecture provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of a training method of a style migration model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a training process of a style migration model provided by an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram of a method for video style migration provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a training phase and a testing phase provided by an embodiment of the present application;

FIG. 11 is a schematic block diagram of an apparatus for video style migration provided by an embodiment of the present application;

FIG. 12 is a schematic block diagram of a training apparatus for a style migration model provided by an embodiment of the present application;

FIG. 13 is a schematic block diagram of an apparatus for video style migration provided by an embodiment of the present application;

fig. 14 is a schematic block diagram of a training apparatus for a style migration model provided in an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application; it is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work according to the embodiments of the present application are within the scope of the present application.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence theme framework 100 described above is described in detail below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "Information Technology (IT) value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure 110

The infrastructure can provide computing power support for the artificial intelligent system, realize communication with the outside world, and realize support through the basic platform.

The infrastructure may communicate with the outside through sensors, and the computing power of the infrastructure may be provided by a smart chip.

The smart chip may be a hardware acceleration chip such as a Central Processing Unit (CPU), a neural-Network Processing Unit (NPU), a Graphic Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), and a Field Programmable Gate Array (FPGA).

The infrastructure platform may include distributed computing framework and network, and may include cloud storage and computing, interworking network, and the like.

For example, for an infrastructure, data may be obtained through sensors and external communications and then provided to an intelligent chip in a distributed computing system provided by the base platform for computation.

(2) Data 120

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphics, images, voice and text, and also relates to internet of things data of traditional equipment, including service data of an existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing 130

The data processing generally includes processing modes such as data training, machine learning, deep learning, searching, reasoning, decision making and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities 140

After the above-mentioned data processing, further based on the result of the data processing, some general-purpose capabilities may be formed, such as algorithms or a general-purpose system, for example, translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industry applications 150

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application.

As shown in fig. 2, the method for migrating the video style according to the embodiment of the present application may be applied to an intelligent terminal; for example, a video to be processed through a camera of the intelligent terminal or a video to be processed stored in an album of the intelligent terminal is input into the target style migration model provided in the embodiment of the present application, so as to obtain a video after style migration processing; by adopting the target style migration model provided by the embodiment of the application, the stability of the video subjected to the style migration processing can be ensured, namely the fluency of the video obtained after the style migration processing is ensured.

In one example, the method for video style migration provided by the embodiment of the present application may be applied to an offline scene.

For example, a video to be processed is acquired and input into the target style migration model, so that a video after style migration processing, that is, an output stable stylized video, is obtained.

In one example, the method for video style migration provided by the embodiment of the application can be applied to an online scene.

For example, a video recorded in real time by the intelligent terminal is obtained, and the video recorded in real time is input into the target style migration model, so that a video subjected to style migration processing and output in real time is obtained; for example, the method can be used for scenes such as real-time exhibition on exhibition stands

For example, when an online video call is performed through the intelligent terminal, a user video shot by a camera in real time can be input to the target style migration model, so that an output video after style migration processing is obtained. For example, a stable stylized video can be displayed to others in real time, and interestingness is improved.

The target style migration model is a pre-trained model obtained by training through the training method of the style migration model provided by the embodiment of the application.

Illustratively, the intelligent terminal may be mobile or fixed; for example, the smart terminal may be a mobile phone having an image processing function, a Tablet Personal Computer (TPC), a media player, a smart tv, a notebook computer (LC), a Personal Digital Assistant (PDA), a Personal Computer (PC), a camera, a camcorder, a smart watch, a Wearable Device (WD), an autonomous vehicle, or the like, which is not limited in the embodiment of the present application.

It should be understood that the above description is illustrative of the application scenario and does not limit the application scenario of the present application in any way.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the sake of understanding, the following description will be made first of all with respect to terms and concepts of the neural networks to which the embodiments of the present application may relate.

1. Neural network

The neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by connecting a plurality of single neural units together, i.e. the output of one neural unit canIs the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

2. Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), α () is an activation function

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

. The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

3. Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

4. Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of neural network can be updated according to the difference situation between the predicted value of the current network and the target value really expected (of course, an initialization process is usually carried out before the first updating, namely parameters are pre-configured for each layer in the deep neural network); for example, if the predicted value of the network is high, the weight vector is adjusted to make the predicted value lower, and the adjustment is continued until the deep neural network can predict the real desired target value or a value very close to the real desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

5. Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller.

Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

6. Image style migration

The image style migration is to fuse the image content in one content image a and the image style of one style image B, so as to generate a composite image C with the image content of a and the image style of B.

Exemplarily, performing image style migration on the content image 1 according to the style image 1 to obtain a composite image 1, wherein the composite image 1 includes the content in the content image 1 and the style in the style image 1; similarly, the content image 1 is subjected to image style migration according to the style image 2, and a composite image 2 can be obtained, wherein the composite image 2 comprises the content in the content image 1 and the style in the style image 2.

The style image can be a reference image for style migration, and the style in the image can comprise texture features of the image and artistic expression forms of the image; for example, the style of famous paintings, the artistic expression of the image can include the styles of cartoon, oil painting, watercolor, ink and wash and the like; the content image may refer to an image that needs to be subjected to style migration, and the content in the image may refer to semantic information in the image, that is, may include high-frequency information, low-frequency information, and the like in the content image.

7. Optical flow information

Optical flow (optical flow) is used to represent the instantaneous speed of pixel motion of a spatially moving object on an observation imaging plane, and is a method for calculating motion information of an object between adjacent frames by using the change of pixels in an image sequence in a time domain and the correlation between adjacent frames to find the correspondence between a previous frame and a current frame.

8. Knowledge distillation

Knowledge distillation refers to a key technology for miniaturizing a deep learning model and meeting the deployment requirement of terminal equipment. Compared with the compression technology such as quantization, sparsification and the like, the method can achieve the purpose of compressing the model without specific hardware support. The knowledge distillation technology adopts a teacher-student model learning strategy, wherein the teacher model can mean that the model parameters are large and can not meet the deployment requirement generally; and the number of parameters of the student model is small, and the student model can be directly deployed. By designing an effective knowledge distillation algorithm, the student model is enabled to learn and imitate the behavior of the teacher model, effective knowledge transfer is carried out, and the student model can finally show the same processing capacity as the teacher model.

First, a system architecture of a method for transferring a video style and a training method for a style transfer model provided in the embodiments of the present application is introduced.

Fig. 3 provides a system architecture 200 according to an embodiment of the present application.

As shown in the system architecture 200 in fig. 3, a data collection device 260 is used to collect training data. For the training method of the style migration model in the embodiment of the present application, the target style migration model may be further trained through training data, that is, the training data collected by the data collecting device 260.

For example, in the embodiment of the present application, the training data for training the target style migration model may be N frames of sample content images, sample style images, and N frames of sample composite images, where the N frames of sample composite images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2.

After the training data is collected, the data collection device 260 stores the training data in the database 230, and the training device 220 trains the target model/rule 201 (i.e., the target style migration model in the embodiment of the present application) according to the training data maintained in the database 230. The training device 220 inputs training data into the target style migration model until a difference between output data of the training target style migration model and sample data meets a preset condition (e.g., a difference between predicted data and sample data is less than a certain threshold, or a difference between predicted data and sample data remains unchanged or no longer decreases), thereby completing training of the target model/rule 201.

Wherein, the output data may refer to N frames of predicted composite images output by the target style migration model; the sample data may refer to an N-frame sample composite image.

In the embodiment provided by the present application, the target model/rule 201 is obtained by training a target style migration model, and the target style migration model may be used to perform style migration processing on a video to be processed. It should be noted that, in practical applications, the training data maintained in the database 230 may not necessarily all come from the collection of the data collection device 260, and may also be received from other devices.

It should be noted that, the training device 220 may not necessarily perform the training of the target model/rule 201 completely according to the training data maintained by the database 230, and may also obtain the training data from the cloud or other places for performing the model training, and the above description should not be taken as a limitation to the embodiments of the present application.

It should also be noted that at least a portion of the training data maintained in the database 230 may also be used to execute the process of the device 210 on the to-be-processed process.

The target model/rule 201 obtained by training according to the training device 220 may be applied to different systems or devices, for example, the execution device 210 shown in fig. 3, where the execution device 210 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, and may also be a server or a cloud.

In fig. 3, the execution device 210 configures an input/output (I/O) interface 212 for data interaction with an external device, and a user may input data to the I/O interface 212 through the client device 240, where the input data may include: the client device inputs the video to be processed.

The preprocessing module 213 and the preprocessing module 214 are configured to perform preprocessing according to input data (e.g., video to be processed) received by the I/O interface 212. In the embodiment of the present application, the input data may be processed directly by the calculation module 211 without the preprocessing module 213 and the preprocessing module 214 (or only one of them may be used).

In the process that the execution device 210 preprocesses the input data, or in the process that the calculation module 211 of the execution device 210 performs the calculation or other related processes, the execution device 210 may call the data, the code, and the like in the data storage system 250 for corresponding processes, or store the data, the instruction, and the like obtained by corresponding processes in the data storage system 250.

Finally, the I/O interface 212 returns the processing result, i.e., the video to be processed obtained as described above, i.e., the obtained video after the style migration process, to the client device 240, and provides it to the user.

It should be noted that the training device 220 may generate corresponding target models/rules 201 according to different training data for different targets or different tasks, and the corresponding target models/rules 201 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 3, in one case, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 212.

Alternatively, the client device 240 may automatically send the input data to the I/O interface 212, and if the client device 240 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also serve as a data collection terminal, collecting input data of the input I/O interface 212 and output results of the output I/O interface 212 as new sample data, and storing the new sample data in the database 230. Of course, the input data input to the I/O interface 212 and the output result output from the I/O interface 212 as shown in the figure may be directly stored in the database 230 as new sample data by the I/O interface 212 without being collected by the client device 240.

It should be noted that fig. 3 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation. For example, in FIG. 3, the data storage system 250 is an external memory with respect to the execution device 210, in other cases, the data storage system 250 may be disposed in the execution device 210.

As shown in fig. 3, a target model/rule 201 is obtained by training according to a training device 220, where the target model/rule 201 may be a target style migration model in this embodiment, specifically, the target style migration model provided in this embodiment may be a deep neural network, a convolutional neural network, or may be a deep convolutional neural network.

The following description focuses on the structure of the convolutional neural network with reference to fig. 4. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, a convolutional neural network is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

The network structure of the style migration model in the embodiment of the present application may be as shown in fig. 4. In fig. 4, convolutional neural network 300 may include an input layer 310, a convolutional/pooling layer 320 (where the pooling layer is optional), and a neural network layer 330. The input layer 310 may obtain an image to be processed, and deliver the obtained image to be processed to the convolutional layer/pooling layer 320 and the following neural network layer 330 for processing, so as to obtain a processing result of the image. The following describes the internal layer structure in CNN300 in fig. 4 in detail.

Convolutional layer/pooling layer 320:

the convolutional layer/pooling layer 320 as shown in FIG. 4 may include layers such as examples 321-326; for example: in one implementation, 321 layers are convolutional layers, 322 layers are pooling layers, 323 layers are convolutional layers, 324 layers are pooling layers, 325 layers are convolutional layers, 326 layers are pooling layers; in another implementation, 321, 322 are convolutional layers, 323 is a pooling layer, 324, 325 are convolutional layers, and 326 is a pooling layer, i.e., the output of a convolutional layer can be used as the input of a subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.

The inner working principle of one convolution layer will be described below by taking convolution layer 321 as an example.

Convolution layer 321 may include a plurality of convolution operators, also called kernels, whose role in image processing is equivalent to a filter for extracting specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels, etc., depending on the value of step size stride) in the horizontal direction, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above.

Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 300 can make correct prediction.

When convolutional neural network 300 has multiple convolutional layers, the initial convolutional layer (e.g., 321) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 300 increases, the more convolutional layers (e.g., 326) later extract more complex features, such as features with high levels of semantics, the more highly semantic features are suitable for the problem to be solved.

Since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 321-326 as illustrated by 320 in fig. 4 may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. The purpose of the pooling layer is to reduce the spatial size of the image during image processing. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 330:

after processing by convolutional layer/pooling layer 320, convolutional neural network 300 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 320 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information required or other relevant information), the convolutional neural network 300 needs to generate one or a set of the number of required classes of output using the neural network layer 330. Therefore, a plurality of hidden layers (331, 332 to 33n shown in fig. 4) and an output layer 340 may be included in the neural network layer 330, and parameters included in the plurality of hidden layers may be obtained by pre-training according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image detection, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 330, i.e. the last layer of the whole convolutional neural network 300 is the output layer 340, the output layer 340 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 310 to 340 in fig. 4 is the forward propagation) of the whole convolutional neural network 300 is completed, the backward propagation (i.e. the propagation from 340 to 310 in fig. 4 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 300, and the error between the result output by the convolutional neural network 300 through the output layer and the ideal result.

It should be noted that the convolutional neural network shown in fig. 4 is only an example of a structure of a target style migration model in the embodiment of the present application, and in a specific application, a style migration model adopted by the method for video style migration in the embodiment of the present application may also exist in the form of other network models.

Fig. 5 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural-Network Processing Unit (NPU) 400. The chip may be provided in the execution device 210 shown in fig. 3 to complete the calculation work of the calculation module 211. The chip may also be disposed in the training device 220 as shown in fig. 3 to complete the training work of the training device 220 and output the target model/rule 201. The algorithm for each layer in the convolutional neural network shown in fig. 4 can be implemented in a chip as shown in fig. 5.

The NPU 400 is mounted as a coprocessor on a main processing unit (CPU), and tasks are allocated by the main CPU. The core portion of the NPU 400 is an arithmetic circuit 403, and the controller 404 controls the arithmetic circuit 403 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuit 403 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 403 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 403 fetches the data corresponding to the matrix B from the weight memory 402 and buffers it in each PE in the arithmetic circuit 403. The arithmetic circuit 403 takes the matrix a data from the input memory 401 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator 408 (accumulator).

The vector calculation unit 407 may further process the output of the operation circuit 403, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 407 may be used for network calculations of non-convolution/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 407 can store the processed output vector to the unified memory 406. For example, the vector calculation unit 407 may apply a non-linear function to the output of the arithmetic circuit 403, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 407 generates normalized values, combined values, or both.

In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 403, for example for use in subsequent layers in a neural network.

The unified memory 406 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 405 (DMAC) to store the input data in the external memory into the input memory 401 and/or the unified memory 406, store the weight data in the external memory into the weight memory 402, and store the data in the unified memory 406 into the external memory.

A bus interface unit 410 (BIU) for implementing interaction between the main CPU, the DMAC, and the instruction fetch memory 409 through a bus.

An instruction fetch buffer 409(instruction fetch buffer) connected to the controller 404 is used for storing instructions used by the controller 404.

And the controller 404 is configured to call the instruction cached in the instruction fetch memory 409 to implement controlling of the working process of the operation accelerator.

Generally, the unified memory 406, the input memory 401, the weight memory 402, and the instruction fetch memory 409 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

The operation of each layer in the convolutional neural network shown in fig. 4 may be performed by the operation circuit 403 or the vector calculation unit 407.

While the above-described implementation device 210 in fig. 3 is capable of implementing the steps of the method for video style migration in the embodiment of the present application, the CNN model shown in fig. 4 and the chip shown in fig. 5 may also be used for implementing the steps of the method for video style migration in the embodiment of the present application.

Fig. 6 illustrates a system architecture 500 according to an embodiment of the present application. The system architecture includes a local device 520, a local device 530, and an execution device 510 and a data storage system 550, wherein the local device 520 and the local device 530 are connected to the execution device 510 through a communication network.

The execution device 510 may be implemented by one or more servers. Optionally, the execution device 510 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The enforcement devices 510 may be disposed on one physical site or distributed across multiple physical sites. The execution device 510 may use data in the data storage system 550 or call program code in the data storage system 550 to implement the method of video style migration of the embodiments of the present application.

It should be noted that the execution device 510 may also be referred to as a cloud device, and at this time, the execution device 510 may be deployed in the cloud.

Specifically, the execution device 510 may perform the following process:

acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthetic images, the N frames of synthetic images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; carrying out image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted synthetic images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted synthetic images, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on the N frames of predicted synthetic images and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images.

Alternatively, the execution device 510 may perform the following process:

acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; carrying out image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthetic images; obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthetic images, wherein parameters of the target style migration model are determined according to an image loss function for performing style migration processing on the N frames of sample content images by the target style migration model, the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on N frames of predicted synthetic images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, and the N frames of predicted synthetic images are subjected to image style migration processing on the N frames of sample content images according to sample style images by the target style migration model And (5) shifting the processed image.

The user may operate respective user devices (e.g., local device 520 and local device 530) to interact with the execution device 510. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, and so forth.

The local devices of each user may interact with the enforcement device 510 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the local device 520 and the local device 530 may obtain relevant parameters of the target style migration model from the execution device 510, deploy the target style migration model on the local device 520 and the local device 530, and perform video style migration processing and the like by using the target style migration model.

In another implementation, the execution device 510 may directly deploy the target style migration model, and the execution device 510 obtains the to-be-processed video from the local device 520 and the local device 530, and performs style migration processing on the to-be-processed video according to the target style migration model, and the like.

At present, video style migration models for stabilizing stylized video by using optical flow information mainly include two types, the first type is to use optical flow in the training process of the style migration models, but optical flow information is not introduced in the testing stage; the second method is to blend the optical flow module into the structure of the style migration model; however, with the first method, the computational efficiency of the style migration model in the test stage can be ensured, but the stability of the obtained video after the style migration processing is poor; the second method can ensure the stability of the output video after the style migration processing, but because an optical flow module is introduced, the optical flow information between the image frames included in the video needs to be calculated in the testing stage, and therefore the computational efficiency of the style migration model in the testing stage cannot be achieved.

In view of this, the embodiment of the application provides a training method for a style migration model and a method for video style migration, a low-rank loss function is introduced in a process of training the style migration model for videos, and through learning of low-rank information, stability of videos after the style migration and original videos can be synchronized, so that stability of videos after the style migration processing obtained by a target migration model can be improved; in addition, the style migration model provided by the embodiment of the application does not need to calculate optical flow information among multiple frames of images included in the video in a testing stage, namely, in the process of carrying out style migration processing on the video to be processed, so that the target style migration model provided by the embodiment of the application can improve stability, shorten the time of style migration processing of the model and improve the operating efficiency of the target style migration model.

FIG. 7 is a schematic flow chart diagram illustrating a method 600 for training a style migration model provided by an embodiment of the present application, which may be performed by an apparatus capable of image style migration; for example, the training method may be performed by the execution device 510 in fig. 6, or may be performed by the local device 520. The training method 600 includes steps S610 to S630, which are described in detail below.

And S610, acquiring training data.

The training data may include N frames of sample content images, sample style images, and N frames of synthetic images, where the N frames of synthetic images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2.

Illustratively, the N-frame sample content image may refer to N consecutive sample content images included in the sample video; the N-frame composite image may be a composite image including N consecutive frames included in a video obtained by performing the style migration process on the sample video according to the sample style image.

It should be understood that the style migration processing for a single frame image, i.e., the image style migration, only needs to consider the content in the content image and the style in the style image; however, for a video, because a plurality of frames of continuous videos are included in the video, the style migration of the video needs to consider not only the stylizing effect of the images but also the stability among the images of the plurality of frames; namely, the fluency of the video after the style migration processing needs to be ensured, and noises such as screen flashing, artifacts and the like are avoided.

It should be noted that the N frames of sample content images refer to N frames of adjacent images in the video; the N-frame composite image is an image corresponding to the N-frame sample content image.

S620, carrying out image style migration processing on the N frames of sample content images according to the sample style images through the neural network model to obtain N frames of predicted synthetic images.

Illustratively, N frames of sample content images and N frames of sample style images included in the sample video may be input to the neural network model.

For example, N frames of sample content images may be respectively input into the neural network model one frame by one frame, and the neural network model may perform image style migration processing on the one frame of sample content image according to the sample style image, so as to obtain one frame of predicted composite image corresponding to the one frame of sample content image; after the above process is performed N times, N frames of predicted composite images corresponding to the N frames of sample content images can be obtained.

For example, a plurality of images of N frames of sample content images may be input to the neural network model once, and the neural network model may perform image style migration processing on the plurality of sample content images according to the sample style images.

S630, determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted synthetic images.

The image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on N frames of predicted synthesized images and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images.

It should be noted that, in the matrix formed by the multiple frames of images, the low-rank matrix may be used to represent the regions that are not motion boundaries and all appear in the N frames of images. The sparse matrix can be used for representing the intermittent occurrence region of the N frames of images; for example, the sparse matrix may refer to a region newly appearing or disappearing at the boundary of an image due to camera movement, or a boundary region of a moving object.

For example, the N frames of sample content images may refer to images in which a user is moving, and a low-rank matrix formed by the N frames of sample content images may be used to represent regions that are not motion boundaries and all appear in the N frames of sample content images; for example, a low rank matrix composed of N frames of sample content may be used to represent a background region where no motion occurs, or a region where users are present in each of the N frames of sample content images and are not motion boundaries.

Exemplarily, it is assumed that a video includes consecutive 5 frames of images, i.e., consecutive 5 frames of sample content images; obtaining 5 frames of style migration results, namely 5 frames of sample synthetic images, of the 5 frames of sample content images subjected to style migration processing; the low rank matrix may be calculated as follows:

step 1, calculating optical flow information between image frames according to 5 frames of images in a video, namely 5 frames of sample content images;

step 2, calculating mask information according to the optical flow information, wherein the mask information can be used for representing a change area in two continuous frames of images obtained according to the optical flow information;

step 3, calculating a low-rank part after aligning the 1 st, 2 nd, 4 th and 5 th frame images with the 3 rd frame image according to the optical flow information and the mask information, and setting a sparse part to be 0;

and 4, respectively generating vectors from the aligned 5 frames of images and combining the vectors into a matrix according to the column, wherein the matrix can be a low-rank matrix.

It should be understood that in computing the low-rank loss function, the goal is a rank approximation of the low-rank portion of the image matrix of 5-frame sample content images to the low-rank portion of the image matrix of 5-frame sample composite images; the rank of the low-rank part can be continuously optimized by optimizing the nuclear norm, and the nuclear norm is obtained by performing singular value decomposition on the matrix.

Illustratively, for successive K frame images

And optical flow information corresponding thereto (e.g., forward optical flow information and reverse optical flow information may be included)

Composite image output by student model

First, a K-frame composite image may be mapped to a fixed frame, typically τ ═ K/2 frames, according to the forward optical flow information, the backward optical flow information, and the mask information; that is, for the t frame composite image, after mapping it to the τ frame composite image, its low rank matrix can be expressed as:

R_t＝M_t-τ⊙W[N_s(x_t),f_t-τ]；

wherein M is_t-τMask information obtained by calculating forward optical flow information and reverse optical flow information for representing K frame images; w is used to represent a mapping operation (warp).

According to step 4, a matrix X, X ═ vec (R) can be obtained after vectorization and combined as columns₀),...,vec(R_K)]^T∈R^K*LWhere L ═ H × W × 3, K is used to indicate the number of rows in the matrix X, which is the number of frames in the image; l is used to represent the number of columns of matrix X; h is used for representing the height of each frame of image; w is used to represent the width of each frame of image.

Performing singular value decomposition on X to obtain a desired nuclear norm, wherein the decomposition process is that X is u ∑ v^TWherein the matrix X has a size of K × L, u ∈ R^K*K,v∈R^L*LAnd the kernel norm | | X | | luminance_*Tr (∑) tr is used to represent the trace of the matrix, e.g. the sum of the elements on the main diagonal (diagonal from top left to bottom right) of an n × n matrix a is called the trace of matrix a, tr (a).

The low rank loss function is:

L＝(||X_input||_*-||X_s||_*)²；

wherein, X_inputRepresenting a vectorization matrix obtained by inputting K frames of images; x_sRepresenting a vectorization matrix derived from K frame composite images output by the target student network.

In the embodiment of the application, the low-rank loss function is introduced when the style migration model is trained, the target is that the regions which are not motion boundaries and appear in the adjacent multi-frame content images in the original video are kept the same after the style migration processing is carried out, namely, the rank of the region in the video after the style migration processing is close to the rank of the region in the original video, so that the stability of the video after the style migration processing can be improved.

Further, in an embodiment of the present application, the image loss function further includes a residual loss function, where the residual loss function is obtained according to a difference between a first sample synthesized image and a second sample synthesized image, where the first sample synthesized image is obtained by performing image style migration processing on the N-frame sample content images through a first model, the second sample synthesized image is obtained by performing image style migration processing on the N-frame sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model includes an optical flow module, the first model does not include the optical flow module, and the optical flow module is configured to determine optical flow information of the N-frame sample content images.

The first model and the second model are style transition models trained by the same style image, wherein the first model and the second model are different in that the first model does not include an optical flow module; the second model comprises an optical flow module; that is, the first model and the second model may adopt the same sample content image and sample style image during training, for example, the first model and the second model may refer to the same model during the training phase; however, in the testing stage, the second model also needs to calculate optical flow information among the multi-frame sample content images; the first model does not need to calculate optical flow information among a plurality of frames of images.

Further, in the embodiment of the present application, in order to meet the deployment requirement of the mobile terminal, a teacher-student model learning strategy may be adopted, that is, the trained style migration model may be a target student model; in the training process, parameters of the student model can be updated through the image loss function, and therefore the target student model is obtained.

It should be noted that the network structure of the student model is the same as that of the target student model, and the student model may refer to a style migration model which is trained in advance and does not need to input optical flow information in a testing stage.

Optionally, in a possible implementation manner, the first model and the second model may be pre-trained teacher models, and the target style migration model refers to a target student model obtained by training a student model to be trained according to a residual loss function and a knowledge distillation algorithm.

It should be understood that knowledge distillation refers to a key technology for miniaturizing the deep learning model and meeting the deployment requirement of terminal equipment. Compared with the compression technology such as quantization, sparsification and the like, the method can achieve the purpose of compressing the model without specific hardware support. The knowledge distillation technology adopts a teacher-student model learning strategy, wherein the teacher model can mean that the model parameters are large and can not meet the deployment requirement generally; and the number of parameters of the student model is small, and the student model can be directly deployed. By designing an effective knowledge distillation algorithm, the student model is enabled to learn and imitate the behavior of the teacher model, effective knowledge transfer is carried out, and the student model can finally show the same processing capacity as the teacher model.

In embodiments of the present application, the model without the optical flow module at test can be distilled by using a model including the optical flow module at test; in the process of transferring the style of the video, due to the fact that the teacher model and the student model are different in structure and training modes, the stylized effects of the student model and the teacher model may not be completely the same; if the student model learns the output information of the teacher model directly at the pixel level, the output of the student model may be ghosted or blurred. In the embodiment of the application, the target style migration model may refer to a target student model, and the knowledge distillation method for teacher-student model learning is adopted to enable the difference between the style migration results output by the student model to be trained and the base model trained in advance to continuously approach the difference between the style migration results output by the teacher model including the optical flow module and the style migration results output by the teacher model not including the optical flow module, so that the ghost phenomenon caused by the non-uniform styles of the teacher model and the student model can be effectively avoided through the training method.

Alternatively, in one possible implementation, the residual loss function is derived according to the following equation,

In one example, the target style migration model may refer to a target student model, and when training the target student model, a student model to be trained may be trained according to a pre-trained first teacher model (excluding an optical flow module), a pre-trained second teacher model (including an optical flow module), and a pre-trained basic model, so as to obtain the target student model; the network structures of the student model to be trained, the pre-trained basic model and the target student model are the same, and the student model to be trained is trained through the low-rank loss function, the residual loss function and the perception loss function, so that the target student model is obtained.

In the embodiment of the application, the target style migration model may refer to a target student model, and the knowledge distillation method for teacher-student model learning is adopted to enable the difference between the style migration results output by the student model to be trained and the base model trained in advance to continuously approach the difference between the style migration results output by the teacher model including the optical flow module and the style migration results output by the teacher model not including the optical flow module, so that the ghost phenomenon caused by the non-uniform styles of the teacher model and the student model can be effectively avoided through the training method.

In one example, parameters of the neural network model are determined from an image loss function between the N frames of sample-synthesized images and the N frames of predicted-synthesized images, wherein the image loss function includes the above-mentioned low-rank loss function for representing a difference between a low-rank matrix composed of the N frames of sample-content images and a low-rank matrix composed of the N frames of sample-synthesized images.

In one example, parameters of the neural network model are determined according to an image loss function between the N frames of sample synthetic images and the N frames of predicted synthetic images, wherein the image loss function comprises the low-rank loss function and the residual loss function, and the low-rank loss function is used for representing the difference between a low-rank matrix formed by the N frames of sample content images and a low-rank matrix formed by the N frames of sample synthetic images; the residual loss function is derived from a difference between the first sample composite image and the second sample composite image; the first sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the same sample style images, the second model comprises an optical flow module, the first module does not comprise the optical flow module, and the optical flow module is used for determining optical flow information of the N frames of sample content images.

Optionally, in a possible implementation manner, the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss and a style loss, the content loss is used to represent an image content difference between the N-frame predicted composite image and the corresponding N-frame sample content image, and the style loss is used to represent an image style difference between the N-frame predicted composite image and the sample style image.

Wherein the perceptual loss function may be used to represent content similarity between the sample content image and the corresponding composite image; and for representing stylistic similarities between the sample stylistic image and the corresponding composite image.

In one example, the parameters of the neural network model are determined according to an image loss function between the N frames of sample synthetic images and the N frames of predicted synthetic images, wherein the image loss function includes the low rank loss function, the residual loss function, and the perceptual loss function.

For example, the image loss is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

Optionally, in a possible implementation manner, the parameters of the target style migration model are obtained by performing multiple iterations through a back propagation algorithm based on the image loss function.

Illustratively, fig. 8 is a schematic diagram of a training process of a style migration model provided in an embodiment of the present application.

As shown in fig. 8, the first teacher model may refer to the second model, i.e. a style transition model including an optical flow module, which is trained in advance; the second teacher model may refer to the first model, i.e. a style migration model trained in advance and not including an optical flow module; the network structures of the pre-trained basic model, the student model to be trained and the target student model are the same; the input data of the first teacher model may include a content image of a T-th frame, a synthesized image of the T-1 th frame processed by optical flow information, and change information calculated by the optical flow information, where the change information may refer to a different region in two content images obtained according to the content image of the T-1 th frame and the content image of the T-th frame; the optical flow information may refer to motion information of corresponding pixels in the T-1 th content image and the T-th frame content image, and the output data of the first teacher model is the T-th frame composite image (# 1). For the second teacher model, since the model does not include the optical flow module, the change information in the input data of the second teacher model may be set to 1; the T-1 frame composite images processed with the optical flow information may all be set to 0, and the output data of the second teacher model is the T-th frame composite image (# 2); the input data of the student model to be trained is a T-th frame content image, and the output data is a T-th frame synthetic image (# 3); in the training process, the input data of the base model trained in advance can be a T-th frame content image, and the output data is a predicted T-th frame synthetic image (# 4); and sequentially inputting the T-th frame to the T-T + N-1 th frame to obtain N frames of sample content images, obtaining the T-T + N-1 th frame to obtain N frames of predicted synthetic images by the pre-trained basic model, and continuously updating the parameters of the student model to be trained through a back propagation algorithm according to the image loss function, namely the low-rank loss function, the residual loss function and the perception loss function, thereby obtaining the trained target student model.

In the embodiment of the application, a low-rank loss function is introduced in the process of training the style migration model for the video, and the stability of the video subjected to style migration and the stability of the original video can be synchronized through the learning of low-rank information, so that the stability of the video subjected to style migration processing and obtained by the target migration model can be improved.

In addition, the style migration model for the video trained in the embodiment of the application can be a target student model obtained by adopting a teacher-learning model learning strategy, so that the requirement of deploying the style migration model by the mobile device can be met on one hand; on the other hand, when the target student model is trained, the learning model learns the difference between output information of the teacher model including the optical flow module and output information of the teacher model not including the optical flow module, so that a ghost phenomenon caused by non-uniform styles of the teacher model and the student model can be effectively avoided, and the stability of the video after style migration processing obtained by the target migration model can be improved.

FIG. 9 is a schematic flow chart diagram illustrating a method 700 for video style migration that may be performed by an apparatus capable of image style migration, according to an embodiment of the present application; for example, the method may be performed by the execution device 510 in fig. 6, or may be performed by the local device 520. The method 700 includes steps S710 to S730, which are described in detail below.

And S710, acquiring a video to be processed.

The video to be processed comprises N frames of content images to be processed, wherein N is an integer greater than or equal to 2.

For example, the to-be-processed video may be a video shot by the electronic device through a camera, or the to-be-processed video may also be a video obtained from the inside of the electronic device (for example, a video stored in an album of the electronic device, or a video obtained by the electronic device from a cloud).

S720, carrying out image style migration processing on the N frames of content images to be processed according to the target style migration model to obtain N frames of synthetic images.

And S730, obtaining the video after the style migration processing corresponding to the video to be processed according to the N frames of the synthesized image.

The parameters of the target style migration model are determined according to an image loss function of the target style migration model for performing style migration processing on the N frames of sample content images, the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on N frames of predicted synthetic images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, and the N frames of predicted synthetic images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images through the target style migration model.

It should be noted that the N frames of sample content images refer to N frames of adjacent images in the video; the N frames of synthesized images are images corresponding to the N frames of sample content images; the target style migration network may refer to a pre-trained style migration model obtained by the training method shown in fig. 7.

For example, assume that a video includes 5 consecutive frames of images, i.e., 5 consecutive frames of sample content images; obtaining 5 frames of style migration results, namely 5 frames of sample synthetic images, of the 5 frames of sample content images subjected to style migration processing; the low rank matrix may be calculated as follows:

In the embodiment of the application, the low-rank loss function is introduced when the style migration model is trained, the target is that the regions which are not motion boundaries appear in the adjacent multi-frame images in the original video and are still the same after the style migration processing is carried out, namely the rank of the region in the video after the style migration processing is close to the rank of the region in the original video, so that the stability of the video after the style migration processing can be improved.

Further, in an embodiment of the present application, the image loss function further includes a residual loss function, where the residual loss function is obtained according to a difference between a first sample synthesized image and a second sample synthesized image, where the first sample synthesized image is obtained by performing image style migration processing on the N-frame sample content images through a first model, the second sample synthesized image is obtained by performing image style migration processing on the N-frame sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, and the second model includes an optical flow module, and the optical flow module is configured to determine optical flow information of the N-frame sample content images.

It should be understood that the difference between the first sample composite image and the second sample composite image may refer to a positional difference between corresponding pixel values of the first sample composite image and the second sample composite image.

The first model and the second model are style transition models trained by the same style image, wherein the first model and the second model are different in that the first model does not include an optical flow module; the second model comprises an optical flow module; namely, the first model and the second model can adopt the same sample content images and sample style images during training; for example, the first model and the second model may refer to the same model during the training phase; however, in the testing stage, the second model also needs to calculate optical flow information among the multi-frame sample content images; the first model does not need to calculate optical flow information among a plurality of frames of images.

It should be noted that the network structure of the learning model is the same as that of the target student model, and the student model may be a style transition model that is trained in advance and does not need to input optical flow information in the testing stage.

In embodiments of the present application, the model without the optical flow module at test can be distilled by using a model including the optical flow module at test; in the process of transferring the style of the video, due to the fact that the teacher model and the student model are different in structure and training modes, the stylized effects of the student model and the teacher model may not be completely the same; if the student model learns the output information of the teacher model directly at the pixel level, the output of the student model may be ghosted or blurred. In the embodiment of the application, through the difference between style migration results output by a teacher model including an optical flow module during learning test and a teacher model not including the optical flow module during testing, a ghost phenomenon caused by the non-uniform style of the teacher model and the student model can be effectively avoided, and the stability of the video after style migration processing obtained by a target migration model can be improved.

wherein L is_resRepresenting the residual loss function; n is a radical of_TTo representThe second model;

In one example, parameters of the neural network model are determined according to an image loss function between the N frames of sample synthetic images and the N frames of predicted synthetic images, wherein the image loss function comprises the low-rank loss function and the residual loss function, and the low-rank loss function is used for representing the difference between a low-rank matrix formed by the N frames of sample content images and a low-rank matrix formed by the N frames of sample synthetic images; the residual loss function is derived from a difference between the first sample composite image and the second sample composite image; the first sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the same sample style images, the second model comprises an optical flow module, the first model does not comprise the optical flow module, and the optical flow module can be used for determining optical flow information of the N frames of sample content images.

In the embodiment of the application, a low-rank loss function is introduced in the process of the target style migration model, and the stability of the video subjected to style migration and the stability of the original video can be synchronized through the learning of low-rank information, so that the stability of the video subjected to style migration processing and obtained by the target migration model can be improved.

In addition, the target style migration model in the embodiment of the application can be a target student model obtained by adopting a teacher-learning model learning strategy, so that the requirement of the mobile device for deploying the style migration model can be met on one hand; on the other hand, when the target student model is trained, the learning model learns the difference between output information of the teacher model including the optical flow module and output information of the teacher model not including the optical flow module, so that a ghost phenomenon caused by non-uniform styles of the teacher model and the student model can be effectively avoided, and the stability of the video after style migration processing obtained by the target migration model can be improved.

Further, in a test stage, that is, in a process of performing style migration processing on a video to be processed, the target style migration model provided in the embodiment of the present application does not need to calculate optical flow information between multiple frames of images included in the video, so that the target style migration model provided in the embodiment of the present application can improve stability, and at the same time, can shorten the time for style migration processing of the model, and improve the operating efficiency of the target style migration model.

Illustratively, fig. 10 is a schematic diagram of a training phase and a testing phase provided in an embodiment of the present application.

A training stage:

for example, in the embodiment of the present application, the Flownet2 network and the dataset with optical flow data generated from the Hollywood2 dataset may be used, and the network may be trained by using the training method shown in the embodiment of the present application.

For example, the specific implementation steps include: firstly, only adopting a perception loss function or adopting the perception loss function and an optical flow loss function to train a style migration model, namely a pre-trained basic model

Training a teacher model N comprising an optical flow model using video data and optical flow data_TAnd a teacher model that does not include an optical flow module

Then according to

N_T、

Training the student model N to be trained by the low-rank loss function and the residual loss function_SFinally, a trained target student model is obtained, wherein the pre-trained basisThe network structures of the model, the student model to be trained and the target student model are the same.

It should be noted that, the specific implementation manner of the training phase may refer to the descriptions in fig. 7 and fig. 8, and is not described herein again.

And (3) a testing stage:

for example, in the test, test data can be input into the target student model, and the test result, namely the data after the style migration processing, can be obtained through the target student model.

It should be noted that, the specific implementation manner of the test phase may refer to the description in fig. 9, and is not described herein again.

TABLE 1

The teacher model in table 1 may refer to the second model in the above embodiments, that is, the style transition model including the optical flow module in the testing stage; the first class of student models may refer to pre-trained student models obtained by training through a perceptual loss function; the second class of student models can be pre-trained student models obtained through the training of a perception loss function and an optical flow loss function; loss function 1 may refer to the residual loss function in this application; loss function 2 may refer to a low rank loss function in this application; alley _2, Ambush _5, Bandwige _2, Market _6 and sample _2 respectively represent the names of five video data in the MPI-Sintel data set; all represents the first five videos. The results of the tests on the stability of the different models by using the MPI-Sintel dataset are shown in Table 1; the stability index calculation method may adopt the following formula:

wherein T represents the number of image frames included in the video; d ═ c ═ w ═ D, M_t∈R^(w*d)Indicating mask information, O_tRepresenting style migration results of t frames; o is_(t-1)Representing style migration results of t-1 frames; w_tOptical flow information representing the optical flow from t-1 frame to t frame; w_t(O_t-1) Indicating that the style transition result of the t-1 frame is aligned with the style transition result of the t frame.

As shown in table 1, the smaller the result of the stability index is, the better the stability of the output data after the migration processing of the model is; from the test results shown in table 1, it can be seen that the stability of the output data of the target migration model provided in the embodiment of the present application after performing the style migration processing is significantly better than that of other models.

It is to be understood that the above description is intended to assist those skilled in the art in understanding the embodiments of the present application and is not intended to limit the embodiments of the present application to the particular values or particular scenarios illustrated. It will be apparent to those skilled in the art from the foregoing description that various equivalent modifications or changes may be made, and such modifications or changes are intended to fall within the scope of the embodiments of the present application.

The training method for style migration and the method for video style migration provided by the embodiment of the present application are described in detail above with reference to fig. 1 to 10; the device embodiment of the present application will be described in detail below with reference to fig. 11 to 14. It should be understood that the image processing apparatus in the embodiment of the present application may execute the foregoing various methods in the embodiment of the present application, that is, the following specific working processes of various products, and reference may be made to the corresponding processes in the foregoing method embodiments.

Fig. 11 is a schematic block diagram of an apparatus for video style migration provided by an embodiment of the present application.

It should be understood that the apparatus 800 for video style migration may perform the method shown in fig. 9, or alternatively, the method of the testing phase shown in fig. 10. The apparatus 800 comprises: an acquisition unit 810 and a processing unit 820.

The acquiring unit 810 is configured to acquire a to-be-processed video, where the to-be-processed video includes N frames of to-be-processed content images, and N is an integer greater than or equal to 2; the processing unit 820 is configured to perform image style migration processing on the N frames of content images to be processed according to the target style migration model to obtain N frames of synthetic images; obtaining the video after the style migration processing corresponding to the video to be processed according to the N frames of synthetic images,

wherein the parameters of the target style migration model are determined according to an image loss function of the target style migration model for performing style migration processing on the N frames of sample content images, the image loss function comprises a low rank loss function representing a difference between a first low rank matrix and a second low rank matrix, the first low-rank matrix is derived based on N frames of sample content images and optical flow information, the second low-rank matrix is derived based on N frames of predictive synthetic images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, the N frames of predicted synthetic images are obtained by performing image style migration processing on the N frames of sample content images according to the sample style images through the target style migration model.

Optionally, as an embodiment, the image loss function further comprises a residual loss function, the residual loss function being derived from a difference between the first sample synthetic image and the second sample synthetic image,

the first sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model comprises an optical flow module, the first module does not comprise the optical flow module, and the optical flow module is used for determining optical flow information of the N frames of sample content images.

Optionally, as an embodiment, the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and the knowledge distillation algorithm.

Alternatively, as an embodiment, the residual loss function is derived according to the following equation,

Optionally, as an embodiment, the image loss function further includes a perceptual loss function, and the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss and a style loss, the content loss is used to represent an image content difference between the N frame predicted composite image and the N frame sample content image corresponding to the N frame predicted composite image, and the style loss is used to represent an image style difference between the N frame predicted composite image and the sample style image.

Optionally, as an embodiment, the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

Optionally, as an embodiment, the parameters of the target style migration model are obtained by performing multiple iterations through a back propagation algorithm based on the image loss function.

Fig. 12 is a schematic block diagram of a training apparatus for a style migration model provided in an embodiment of the present application.

It should be understood that the training apparatus 900 may perform the training method of the style transition model shown in fig. 7, fig. 8, or fig. 10. The training apparatus 900 includes: an acquisition unit 910 and a processing unit 920.

The acquiring unit 910 is configured to acquire training data, where the training data includes N frames of sample content images, sample style images, and N frames of synthetic images, where the N frames of synthetic images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; the processing unit 920 is configured to perform image style migration processing on the N frames of sample content images according to the sample style images through a neural network model, so as to obtain N frames of predicted composite images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predictive synthetic images,

the first sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model comprises an optical flow module, the first model does not comprise the optical flow module, and the optical flow module is used for determining optical flow information of the N frames of sample content images.

Optionally, as an embodiment, the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss and a style loss, the content loss is used to represent an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding to the N-frame predicted composite image, and the style loss is used to represent an image style difference between the N-frame predicted composite image and the sample style image.

The apparatus 800 and the training apparatus 900 are embodied as functional units. The term "unit" herein may be implemented in software and/or hardware, and is not particularly limited thereto.

For example, a "unit" may be a software program, a hardware circuit, or a combination of both that implement the above-described functions. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared processor, a dedicated processor, or a group of processors) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.

Accordingly, the units of the respective examples described in the embodiments of the present application can be realized in electronic hardware, or a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 13 is a hardware configuration diagram of an apparatus for video style migration according to an embodiment of the present application. The apparatus 1000 shown in fig. 13 (the apparatus 1000 may specifically be a computer device) includes a memory 1010, a processor 1020, a communication interface 1030, and a bus 1040. The memory 1010, the processor 1020, and the communication interface 1030 are communicatively connected to each other via a bus 1040.

The memory 1010 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 1010 may store a program, and the processor 1020 is configured to perform the steps of the method for video style migration of the embodiments of the present application when the program stored in the memory 1010 is executed by the processor 1020; for example, the respective steps shown in fig. 9 are performed.

It should be understood that the device for migrating the video style shown in the embodiment of the present application may be a server, for example, a server in a cloud, or may also be a chip configured in the server in the cloud.

The processor 1020 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement the image classification method according to the embodiment of the present disclosure.

The processor 1020 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the image classification method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1020.

The processor 1020 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1010, and the processor 1020 reads the information in the memory 1010, and completes the functions required to be performed by the units included in the apparatus shown in fig. 11 in the embodiment of the present application in combination with the hardware thereof, or performs the method for video style migration shown in fig. 9 in the embodiment of the method of the present application.

The communication interface 1030 enables communication between the apparatus 1000 and other devices or communication networks using transceiver means such as, but not limited to, transceivers.

The bus 1040 may include a pathway to transfer information between various components of the device 1000 (e.g., memory 1010, processor 1020, communication interface 1030).

Fig. 14 is a schematic hardware configuration diagram of a training apparatus for a style transition model according to an embodiment of the present application. The training apparatus 1100 shown in fig. 14 (the training apparatus 1100 may be a computer device) includes a memory 1110, a processor 1120, a communication interface 1130, and a bus 1140. The memory 1110, the processor 1120, and the communication interface 1130 are communicatively connected to each other via a bus 1140.

The memory 1110 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 1110 may store a program, and when the program stored in the memory 1110 is executed by the processor 1120, the processor 1120 is configured to perform the steps of the training method of the style migration model according to the embodiment of the present application; for example, the respective steps shown in fig. 7 or fig. 8 are performed.

It should be understood that the training device shown in the embodiment of the present application may be a server, for example, a server in a cloud, or may also be a chip configured in the server in the cloud.

For example, the processor 1120 may employ a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU) or one or more integrated circuits, and is configured to execute the related programs to implement the method for training the image classification model according to the embodiment of the present application.

Illustratively, the processor 1120 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method of the style migration model of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1120.

The processor 1120 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1110, and the processor 1120 reads the information in the memory 1110, and in combination with the hardware thereof, performs the functions required to be performed by the units included in the training apparatus shown in fig. 12, or performs the training method of the style migration model shown in fig. 7 or fig. 8 according to the embodiment of the method of the present application.

Communication interface 1130 enables communication between exercise device 1100 and other devices or communication networks using transceiver devices such as, but not limited to, transceivers.

Bus 1140 may include a path to transfer information between various components of exercise device 1100 (e.g., memory 1110, processor 1120, communication interface 1130).

It should be noted that although the above-described apparatus 1000 and training apparatus 1100 show only memories, processors, and communication interfaces, in particular implementations, those skilled in the art will appreciate that the apparatus 1000 and training apparatus 1100 may also include other devices necessary to achieve proper operation. Also, those skilled in the art will appreciate that the apparatus 1000 and the exercise apparatus 1100 described above may also include hardware components to implement other additional functions, according to particular needs. Furthermore, it should be understood by those skilled in the art that the apparatus 1000 and the training apparatus 1100 described above may also include only those components necessary to implement the embodiments of the present application, and need not include all of the components shown in FIG. 13 or FIG. 14.

Illustratively, the embodiment of the present application further provides a chip, which includes a transceiver unit and a processing unit. The transceiver unit can be an input/output circuit and a communication interface; the processing unit is a processor or a microprocessor or an integrated circuit integrated on the chip; the chip can execute the method for video style migration in the above method embodiment.

Illustratively, the embodiment of the present application further provides a chip, which includes a transceiver unit and a processing unit. The transceiver unit can be an input/output circuit and a communication interface; the processing unit is a processor or a microprocessor or an integrated circuit integrated on the chip; the chip can execute the training method of the style migration model in the above method embodiment.

Illustratively, the embodiment of the present application further provides a computer-readable storage medium, on which instructions are stored, and the instructions, when executed, perform the method for video style migration in the above method embodiment.

Illustratively, the present application further provides a computer-readable storage medium, on which instructions are stored, and the instructions, when executed, perform the training method of the style migration model in the above method embodiment.

Illustratively, the present application further provides a computer program product containing instructions, which when executed, perform the method for video style migration in the above method embodiments.

Illustratively, the present application further provides a computer program product containing instructions, which when executed, perform the training method of the style migration model in the above method embodiments.

It should be understood that the processor in the embodiments of the present application may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.

In the present application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a style migration model, comprising:

acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthetic images, the N frames of synthetic images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2;

carrying out image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted synthetic images;

determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predictive synthetic images,

2. The training method of claim 1, wherein the image loss function further comprises a residual loss function derived from a difference between the first sample composite image and the second sample composite image,

the first sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthetic image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model comprises an optical flow module, the first model does not comprise the optical flow module, and the optical flow module is used for determining the optical flow information.

3. The training method of claim 2, wherein the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and the knowledge distillation algorithm.

4. The training method of claim 3, wherein the residual loss function is obtained according to the following equation,

wherein，L_resRepresenting the residual loss function; n is a radical of_TRepresenting the second model;

5. The training method of any one of claims 1 to 4, wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N frames of predicted composite images and their corresponding N frames of sample content images and a style loss representing an image style difference between the N frames of predicted composite images and the sample style images.

6. The training method of claim 5, wherein the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

7. A training method as claimed in any one of claims 1 to 6, characterized in that the parameters of the target style migration model are obtained by a back propagation algorithm for a number of iterations based on the image loss function.

8. A method of video style migration, comprising:

acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2;

carrying out image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthetic images;

obtaining the video after the style migration processing corresponding to the video to be processed according to the N frames of synthetic images,

9. The method of claim 8, wherein the image loss functions further comprise a residual loss function derived from a difference between the first sample composite image and the second sample composite image,

10. The method of claim 9, wherein the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and the knowledge distillation algorithm.

11. The method of claim 10, wherein the residual loss function is derived according to the following equation,

representing a style migration model which is trained in advance, wherein the style migration model which is trained in advance has the same network structure with the student model to be trained; x is the number ofⁱRepresenting the ith frame sample content image included in the sample video, i being a positive integer.

12. The method of any of claims 8 to 11, wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N-frame predicted composite image and its corresponding N-frame sample content image and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

13. The method of claim 12, wherein the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

14. The method of any of claims 8 to 13, wherein the parameters of the target style migration model are derived by a back propagation algorithm for a plurality of iterations based on the image loss function.

15. A training apparatus for a style migration model, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring training data, the training data comprises N frames of sample content images, sample style images and N frames of synthetic images, the N frames of synthetic images are images obtained after image style migration processing is carried out on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2;

the processing unit is used for carrying out image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted synthetic images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predictive synthetic images,

16. The training apparatus of claim 15, wherein the image loss function further comprises a residual loss function derived from a difference between the first sample composite image and the second sample composite image,

17. The training apparatus as claimed in claim 16, wherein the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and the knowledge distillation algorithm.

18. The training apparatus of claim 17, wherein the residual loss function is obtained according to the following equation,

representing a pre-trained basic model, wherein the pre-trained basic model has the same network structure as the student model to be trained; x is the number ofⁱPresentation instrumentAnd the ith frame of sample content image included in the sample video, wherein i is a positive integer.

19. The training apparatus as claimed in any one of claims 15 to 18, wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N-frame predicted composite image and its corresponding N-frame sample content image and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

20. The training apparatus of claim 19, wherein the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

21. Training apparatus according to any of claims 15 to 20, wherein the parameters of the target style migration model are derived by a back propagation algorithm for a plurality of iterations based on the image loss function.

22. An apparatus for video style migration, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed, the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2;

the processing unit is used for carrying out image style migration processing on the N frames of content images to be processed according to the target style migration model to obtain N frames of synthetic images; obtaining the video after the style migration processing corresponding to the video to be processed according to the N frames of synthetic images,

23. The apparatus of claim 22, wherein the image loss functions further comprise a residual loss function derived from a difference between a first sample composite image and a second sample composite image,

24. The apparatus of claim 23, wherein the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and the knowledge distillation algorithm.

25. The apparatus of claim 24, wherein the residual loss function is derived according to the following equation,

26. The apparatus according to any one of claims 22 to 25, wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N-frame predicted composite image and its corresponding N-frame sample content image and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

27. The apparatus of claim 26, wherein the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

28. The apparatus of any of claims 22 to 27, wherein the parameters of the target style migration model are derived by a back propagation algorithm for a plurality of iterations based on the image loss function.

29. Training device of a style migration model, characterized in that it comprises a processor and a memory for storing program instructions, the processor being adapted to invoke the program instructions to execute the method of any of claims 1 to 7 or 8 to 14.

30. A computer-readable storage medium, in which program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1 to 7 or 8 to 14.