CN111182367A

CN111182367A - Video generation method and device and computer system

Info

Publication number: CN111182367A
Application number: CN201911396267.6A
Authority: CN
Inventors: 黄敏敏; 董邦发; 杨现
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-19
Also published as: CA3166347A1; WO2021135320A1

Abstract

The application discloses a video generation method, a video generation device and a computer system, wherein the method comprises the following steps: receiving an initial video and a target video classification; segmenting the initial video into video segments according to a preset video segmentation method; inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications; determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment; according to the preset splicing parameters, the video segments corresponding to the target video classification are spliced to obtain the target video, so that the target video meeting the requirements is automatically generated according to the initial video, and the timeliness and the accuracy of video generation are ensured.

Description

Video generation method and device and computer system

Technical Field

The invention relates to the technical field of computer vision, in particular to a video generation method, a video generation device and a computer system.

Background

With the pace of life acceleration, consumers expect to be able to more intuitively obtain the related information of commodities, the traditional method of displaying commodities by relying on a certain number of commodity images cannot meet the requirements of an e-commerce platform on characteristics of displayed commodities and help the consumers to make commodity discrimination decisions, and short commodity display videos for displaying commodity functions or actual use effects become the mainstream of commodity propaganda of various large e-commerce. However, the quality levels and lengths of the massive commodity videos uploaded by users such as merchants are not uniform, and the requirement for platform release cannot be met.

In the prior art, the generation methods of commodity videos are divided into two categories, namely a traditional manual method and image-text video conversion generation. According to the traditional manual method, the uploaded original video is subjected to manual shot segmentation according to scene content, target materials and the like, then each video segment meeting the release standard is subjected to manual screening and splicing, and an innovative commodity release short video meeting the user requirements is obtained.

The method for image-text video conversion comprises the steps of matting commodity display pictures provided by merchants, then laying out the pictures into preset image backgrounds to form commodity pictures, obtaining template files such as video templates, background music and the like from an existing video material library in a platform, and generating commodity videos in batches according to the template files. Although the generation of a large batch of commodity videos can be realized, the styles and formats of the commodity videos completely depend on template files configured in advance in a material library, so that the generated videos are close in style and few in format, the actual states of the commodities cannot be visually presented to consumers, and the expression capability is limited.

Disclosure of Invention

In order to solve the defects of the prior art, the invention mainly aims to provide a video generation method to realize automatic generation of a target video according to an initial video.

In order to achieve the above object, the present invention provides, in a first aspect, a method for generating a video, the method including:

receiving an initial video and a target video classification;

segmenting the initial video into video segments according to a preset video segmentation method;

inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;

determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;

and splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.

In some embodiments, the slicing the initial video into video segments according to a preset video slicing method includes:

determining a shot boundary contained in the initial video by using a preset shot boundary detection method;

and segmenting the initial video into video segments according to the determined shot boundary.

In some embodiments, the shot boundaries include abrupt shots and gradual shots of the initial video, and the segmenting the initial video into video segments according to the determined shot boundaries comprises:

and removing the abrupt shot and the gradual shot from the initial video to obtain a video clip set, wherein the video clip set consists of the video clips left after removal.

In some embodiments, the video is composed of consecutive frames, and the determining of the abrupt shot and the gradual shot comprises:

calculating a degree of difference between all of the frames and adjacent ones of the frames;

when the difference degree exceeds a first preset threshold value, judging the frame to be an abrupt change frame, wherein the abrupt change lens consists of continuous abrupt change frames;

when the difference degree is between a first preset threshold and a second preset threshold, judging the frame to be a potential gradual change frame;

when the number of the continuous potential gradual change frames exceeds a third preset threshold value, the potential gradual change frames are judged to be gradual change frames, and the gradual change lens is composed of the continuous gradual change frames.

In some embodiments, the inputting the video segments into a preset model, and the determining the confidence level of each video segment corresponding to all preset video classifications includes:

sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip;

and preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence coefficient of the video clip corresponding to all the preset video classifications.

In some embodiments, the inputting the preprocessed sampling frame into the preset model includes:

and extracting space-time characteristics contained in the preprocessed sampling frame, and inputting the space-time characteristics into the preset model.

In some embodiments, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.

In some embodiments, the method further includes receiving a target duration, and determining the video segments corresponding to the target video classification according to the target video classification and the confidence levels of all preset video classifications corresponding to each of the video segments includes:

and determining the video segments corresponding to the target video classification according to the target duration, the target video classification, the confidence degrees of all preset video classifications corresponding to each video segment and the duration of the video segments.

In a second aspect, an apparatus for generating a video, the apparatus comprising:

the receiving module is used for receiving the initial video and the target video classification;

the segmentation module is used for segmenting the initial video into video segments according to a preset video segmentation method;

the processing module is used for inputting the video clips into a preset model and determining the confidence of each video clip corresponding to all preset video classifications;

the matching module is used for determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;

and the splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.

In a third aspect, the present application provides a computer system comprising:

one or more processors;

and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

receiving an initial video and a target video classification;

The invention has the following beneficial effects:

the invention discloses a video generation method, which comprises the steps of receiving an initial video and a target video classification, segmenting the initial video into video segments according to a preset video segmentation method, inputting the video segments into a preset model, obtaining confidence coefficients of all preset video classifications corresponding to each video segment, and determining the video segments corresponding to the target video classification according to the target video classification and the confidence coefficients of all the preset video classifications corresponding to each video segment; according to the preset splicing parameters, the video segments corresponding to the target video classification are spliced to obtain the target video, so that the target video meeting the requirements is generated according to the initial video, and the timeliness and the accuracy of video generation are ensured;

the invention also provides a preset shot boundary detection method for determining the shot boundary contained in the initial video; segmenting the initial video into video segments according to the determined shot boundaries, and further providing that the shot boundaries comprise abrupt shots and gradual shots of the initial video, wherein the segmenting the initial video into the video segments according to the determined shot boundaries comprises: and removing the abrupt shot and the gradual shot from the initial video to obtain a video clip set, wherein the video clip set consists of the video clips left after removal. The accuracy of video segment segmentation is ensured;

the application discloses sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence degrees of all preset video classifications corresponding to the video clips; determining the preset video classification corresponding to the confidence coefficient with the maximum value as the preset video classification corresponding to the video clip, wherein the confidence coefficient with the maximum value is the confidence coefficient of the video clip; and determining the confidence degrees of the video segments corresponding to the target video classification and the corresponding video segments according to the preset video classifications and the confidence degrees corresponding to all the video segments, thereby ensuring the accuracy of confidence degree calculation.

All products of the present invention need not have all of the above-described effects.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a model network structure provided in an embodiment of the present application;

fig. 2 is a flowchart of shot segmentation provided in an embodiment of the present application;

FIG. 3 is a flow chart of model training provided by an embodiment of the present application;

FIG. 4 is a flow chart of a method provided by an embodiment of the present application;

FIG. 5 is a block diagram of an apparatus according to an embodiment of the present disclosure;

fig. 6 is a computer system structure diagram provided in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As described in the background art, two commodity video generation methods commonly used in the prior art have certain limitations respectively. The manual editing method is adopted, the required labor cost is high, the efficiency is low, and the actual requirement for generating a large batch of commodity videos cannot be met; although the video generation method based on the image-text conversion has higher efficiency, the available video format and the video style are few and fixed, and the expression capability is limited.

In order to solve the technical problem, the application provides that a video uploaded by a user is segmented by using a preset segmentation method to obtain video segments, each video segment is classified by using a preset classification model, and a confidence coefficient corresponding to each video segment is obtained; and according to the target video classification selected by the user, splicing the video segments with confidence degrees meeting the preset conditions in the classification to obtain the target video. The method and the device realize the generation of the target video meeting the requirements according to the uploaded video of the user, and simultaneously ensure the timeliness of the video generation.

Example one

In order to classify the video segments obtained by segmentation, a classification model needs to be trained in advance, and specifically, an MFnet three-dimensional convolutional neural network model may be used as the classification model. The MFnet three-dimensional convolution neural network model is a light-weight deep learning model, and compared with recent deep learning models such as I3D and SlowFastnet, the model is more simplified, the floating point operand FLOPs is less, and the test effect on a test data set is better.

The training process comprises:

110. importing a training data set;

the training data set may be generated by:

111. acquiring a preset number of commodity videos, and creating a corresponding video folder for each video;

112. and dividing the segments contained in each video into different categories according to the different contents presented, wherein the categories include but are not limited to commodity main body appearance, commodity use scenes and commodity content introduction, and manually clipping according to the divided categories.

113. Establishing a main folder corresponding to each category in a folder corresponding to each video, wherein the main folder marks the corresponding category, each main folder contains one or more sub-video clip folders of the category corresponding to the video, and the sub-video clip folders store one or more image frames of the corresponding video clip;

114. and densely sampling the folder corresponding to each video, and normalizing the sampled samples into N × C × H × W, wherein N represents the number of sampled frames for each sub-video clip folder, C represents an RGB channel of each frame, H represents a preset height of each frame, and W represents a preset width of each frame, and preferably, N is at least 8.

120. Training the MFnet three-dimensional convolution neural network model by using a training data set to obtain a preset model;

fig. 1 shows a schematic network structure diagram of the model, which contains 3DCNN, and is used for extracting three-dimensional convolution features contained in each sample, where the three-dimensional convolution features contain spatio-temporal features, including motion information of objects in a video stream, such as motion trends of commodities, changes of backgrounds, and the like.

3 dpoling is a pooling layer of the model, and is used for pooling the output of the 3DCNN, inputting the pooling result into the 3D MF-Unit layer, and performing different convolution operations such as 1 × 1 × 1, 3 × 3 × 3, 1 × 3 × 3, and the like;

global Pool is used for keeping main characteristics of input results and reducing unnecessary parameters;

FClayer is a fully connected layer for outputting a confidence level for each category for each video segment.

Using the model, 56 commodity short video test sets were tested, and the test results are shown in table 1:

TABLE 1

The model can classify samples obtained through single shot intensive sampling, the classification accuracy rate reaches 95.92% in the test results of 1119 test samples in the video data set, the single model is only 29.6MB, the forward reasoning time for single shot intensive sampling video is 330ms, the accuracy rate is high, and the speed is high.

After the preset model is obtained, the generation of the video can be realized according to the model, as shown in fig. 2, the generation process includes:

step one, receiving an initial video input by a user;

step two, performing shot boundary detection on the initial video, segmenting the video according to a detection result, and eliminating redundant segments to obtain video segments;

as shown in fig. 3, the shot boundary detection process includes:

firstly, dividing each frame of an initial video into a preset number of sub-blocks equally by using the same preset method, then calculating a sub-histogram of each sub-block, and calculating the histogram difference of the sub-blocks at the same position of adjacent frames according to the sub-histograms, wherein the adjacent frames of each frame comprise a previous frame and a next frame of the frame. When the difference value exceeds a first preset threshold value T_HWhen the number of the sub-blocks with the excessive difference of a certain frame is higher than a second preset threshold value, the frame is considered to be an abrupt change frame, and the continuous abrupt change frames form an abrupt change shot. For the difference value at a first preset threshold value T_HAt a third predetermined threshold T_LThe frames in between, i.e. considered as potential start frames, when the difference of the frames in sequence is also at T_LAnd T_HAnd when the duration exceeds a fourth preset threshold, the continuous frames are regarded as gradual change frames to form a gradual change lens, and the lens after the gradual change and sudden change lenses are removed is regarded as a normal lens.

In order to ensure the effect of the generated video, the short shots with the length less than the fifth preset threshold in the normal shots need to be removed, and finally the required video segment set is obtained.

Sampling the video clips, inputting sampling results into a preset model, and obtaining the corresponding category and confidence of each video clip;

firstly, the video clips are subjected to random intensive sampling according to the time sequence of the video.

The random dense sampling process comprises:

and (3) randomly initializing sampling points on the video clip, uniformly sampling N frames by taking the sampling points as seven points and the end of the video clip as a key point, and preprocessing the sampling frames to ensure that the sampling frames meet the input size requirement of a preset model.

And then inputting the preprocessed sampling frame into a preset model to obtain confidence degrees of all categories corresponding to the video clip containing the sampling frame.

Splicing the video clips corresponding to the target type according to the target type and the target duration selected by the user to generate a target video;

for example, when the user displays a video to the appearance of the current commodity, the video clips are sorted according to the confidence degrees of the corresponding appearance display categories, and the video clips meeting the requirements are screened.

Specific screening rules may include:

duration T of video segment when confidence coefficient is highest_iWhen the requirement of the target duration is met, directly taking the video clip with the highest confidence coefficient as a target video;

duration T of video segment when confidence coefficient is highest_iWhen the requirement of the target duration is not met, sequentially selecting the n video segments T according to the sequence of the confidence coefficient values_jWhere j is ∈ [1, n ]]Until the following equation is satisfied:

T₂-T₁representing a target duration;

when the duration of the n +1 shots selected according to the confidence score exceeds the maximum duration T₂And then, according to the duration of each shot, performing head-to-tail interception on the longest shot until the total duration meets the requirement of the target duration.

And step five, sequentially splicing the video clips obtained in the step four according to the time sequence of the initial video to obtain the target video.

The generated target video can be stored in a video database and reused when needed next time, or used for continuously training the model.

Based on the scheme provided by the application, the target video meeting the requirement can be generated according to the uploaded video of the user, and meanwhile, the timeliness of video generation is guaranteed.

Example two

Corresponding to the foregoing embodiments, the present application provides a video generation method, as shown in fig. 4, the method includes:

410. receiving an initial video and a target video classification;

420. segmenting the initial video into video segments according to a preset video segmentation method;

preferably, the method comprises:

421. determining a shot boundary contained in the initial video by using a preset shot boundary detection method;

Preferably, the shot boundaries include abrupt shots and gradual shots of the initial video, the method comprising:

422. and removing the abrupt shot and the gradual shot from the initial video to obtain a video clip set, wherein the video clip set consists of the video clips left after removal.

Preferably, the video is composed of consecutive frames, and the process of determining the abrupt shot and the gradual shot includes:

423. calculating a degree of difference between all of the frames and adjacent ones of the frames;

430. Inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;

preferably, the method comprises:

431. sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip;

Preferably, the obtained sampling frame is at least eight frames.

Preferably, the inputting the preprocessed sampling frame into the preset model includes:

432. and extracting space-time characteristics contained in the preprocessed sampling frame, and inputting the space-time characteristics into the preset model.

Preferably, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.

440. Determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;

preferably, the method further includes receiving a target duration, and determining the video segments corresponding to the target video classification according to the target video classification and confidence levels of all preset video classifications corresponding to each video segment includes:

441. and determining the video segments corresponding to the target video classification according to the target duration, the target video classification, the confidence degrees of all preset video classifications corresponding to each video segment and the duration of the video segments.

450. And splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.

EXAMPLE III

Corresponding to the above method embodiment, the present application provides a video generation apparatus, as shown in fig. 5, the apparatus includes:

a receiving module 510, configured to receive an initial video and a target video category;

a segmentation module 520, configured to segment the initial video into video segments according to a preset video segmentation method;

a processing module 530, configured to input the video segments into a preset model, and determine confidence levels of all preset video classifications corresponding to each of the video segments;

a matching module 540, configured to determine, according to the target video classification and confidence levels of all preset video classifications corresponding to each video segment, the video segment corresponding to the target video classification;

and a splicing module 550, configured to splice the video segments corresponding to the target video categories according to preset splicing parameters, so as to obtain a target video.

Preferably, the segmentation module 520 is further configured to determine a shot boundary included in the initial video by using a preset shot boundary detection method;

Preferably, the shot boundary includes a sudden change shot and a gradual change shot of the initial video, and the segmentation module 520 may be further configured to remove the sudden change shot and the gradual change shot from the initial video to obtain a video segment set, where the video segment set is composed of the video segments remaining after the removal.

Preferably, the video is composed of consecutive frames, and the segmentation module 520 is further configured to calculate the degree of difference between all the frames and the adjacent frames of the frames; when the difference degree exceeds a first preset threshold value, judging the frame to be an abrupt change frame, wherein the abrupt change lens consists of continuous abrupt change frames; when the difference degree is between a first preset threshold and a second preset threshold, judging the frame to be a potential gradual change frame; when the number of the continuous potential gradual change frames exceeds a third preset threshold value, the potential gradual change frames are judged to be gradual change frames, and the gradual change lens is composed of the continuous gradual change frames.

Preferably, the matching module 530 is further configured to sample the video segment according to a preset sampling method, and obtain at least two sampling frames corresponding to the video segment; and preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence coefficient of the video clip corresponding to all the preset video classifications.

Preferably, the matching module 530 is further configured to extract spatio-temporal features included in the preprocessed sampling frames, and input the spatio-temporal features into the preset model.

Preferably, the receiving module 510 is further configured to receive a target duration, and the matching module 540 is further configured to determine the video segment corresponding to the target video classification according to the target duration, the target video classification, confidence levels of all preset video classifications corresponding to each video segment, and the duration of the video segment.

Example four

Corresponding to the above method, apparatus, and system, a fourth embodiment of the present application provides a computer system, including: one or more processors; and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising: receiving an initial video and a target video classification;

Fig. 6 illustrates an architecture of a computer system, which may include, in particular, a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, video display adapter 1511, disk drive 1512, input/output interface 1513, network interface 1514, and memory 1520 may be communicatively coupled via a communication bus 1530.

The processor 1510 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 1520 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, an icon font processing system 1525, and the like can also be stored. The icon font processing system 1525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1520 and called for execution by the processor 1510.

The input/output interface 1513 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 1514 is used to connect a communication module (not shown) to enable the device to communicatively interact with other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

The bus 1530 includes a path to transfer information between the various components of the device, such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.

In addition, the computer system 1500 may also obtain information of specific extraction conditions from the virtual resource object extraction condition information database 1541 for performing condition judgment, and the like.

It should be noted that although the above devices only show the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in a specific implementation, the devices may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for generating a video, the method comprising:

receiving an initial video and a target video classification;

2. The method of claim 1, wherein said slicing the initial video into video segments according to a predetermined video slicing method comprises:

3. The method of claim 2, wherein the shot boundaries comprise abrupt shots and gradual shots of the initial video, and wherein the slicing the initial video into video segments according to the determined shot boundaries comprises:

4. The method according to claim 3, wherein the video is composed of consecutive frames, and the determining process of the abrupt shot and the gradual shot comprises:

5. The method according to any one of claims 1-4, wherein said inputting said video segments into a predetermined model and said determining confidence level of each of said video segments for all predetermined video categories comprises:

6. The method of claim 5, wherein the inputting the preprocessed sample frame into the preset model comprises:

7. The method according to any one of claims 1-4, wherein the predetermined model is a pre-trained MFnet three-dimensional convolutional neural network model.

8. The method according to any one of claims 1-4, wherein the method further comprises receiving a target duration, and wherein determining the video segment corresponding to the target video classification according to the target video classification and the confidence level of each of the video segments corresponding to all of the preset video classifications comprises:

9. An apparatus for generating a video, the apparatus comprising:

10. A computer system, the system comprising:

one or more processors;

receiving an initial video and a target video classification;