CN111182367A - Video generation method and device and computer system - Google Patents

Video generation method and device and computer system Download PDF

Info

Publication number
CN111182367A
CN111182367A CN201911396267.6A CN201911396267A CN111182367A CN 111182367 A CN111182367 A CN 111182367A CN 201911396267 A CN201911396267 A CN 201911396267A CN 111182367 A CN111182367 A CN 111182367A
Authority
CN
China
Prior art keywords
video
preset
target
classification
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911396267.6A
Other languages
Chinese (zh)
Inventor
黄敏敏
董邦发
杨现
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Cloud Computing Co Ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201911396267.6A priority Critical patent/CN111182367A/en
Publication of CN111182367A publication Critical patent/CN111182367A/en
Priority to CA3166347A priority patent/CA3166347A1/en
Priority to PCT/CN2020/111952 priority patent/WO2021135320A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/812Monomedia components thereof involving advertisement data

Abstract

The application discloses a video generation method, a video generation device and a computer system, wherein the method comprises the following steps: receiving an initial video and a target video classification; segmenting the initial video into video segments according to a preset video segmentation method; inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications; determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment; according to the preset splicing parameters, the video segments corresponding to the target video classification are spliced to obtain the target video, so that the target video meeting the requirements is automatically generated according to the initial video, and the timeliness and the accuracy of video generation are ensured.

Description

Video generation method and device and computer system
Technical Field
The invention relates to the technical field of computer vision, in particular to a video generation method, a video generation device and a computer system.
Background
With the pace of life acceleration, consumers expect to be able to more intuitively obtain the related information of commodities, the traditional method of displaying commodities by relying on a certain number of commodity images cannot meet the requirements of an e-commerce platform on characteristics of displayed commodities and help the consumers to make commodity discrimination decisions, and short commodity display videos for displaying commodity functions or actual use effects become the mainstream of commodity propaganda of various large e-commerce. However, the quality levels and lengths of the massive commodity videos uploaded by users such as merchants are not uniform, and the requirement for platform release cannot be met.
In the prior art, the generation methods of commodity videos are divided into two categories, namely a traditional manual method and image-text video conversion generation. According to the traditional manual method, the uploaded original video is subjected to manual shot segmentation according to scene content, target materials and the like, then each video segment meeting the release standard is subjected to manual screening and splicing, and an innovative commodity release short video meeting the user requirements is obtained.
The method for image-text video conversion comprises the steps of matting commodity display pictures provided by merchants, then laying out the pictures into preset image backgrounds to form commodity pictures, obtaining template files such as video templates, background music and the like from an existing video material library in a platform, and generating commodity videos in batches according to the template files. Although the generation of a large batch of commodity videos can be realized, the styles and formats of the commodity videos completely depend on template files configured in advance in a material library, so that the generated videos are close in style and few in format, the actual states of the commodities cannot be visually presented to consumers, and the expression capability is limited.
Disclosure of Invention
In order to solve the defects of the prior art, the invention mainly aims to provide a video generation method to realize automatic generation of a target video according to an initial video.
In order to achieve the above object, the present invention provides, in a first aspect, a method for generating a video, the method including:
receiving an initial video and a target video classification;
segmenting the initial video into video segments according to a preset video segmentation method;
inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;
determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
and splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.
In some embodiments, the slicing the initial video into video segments according to a preset video slicing method includes:
determining a shot boundary contained in the initial video by using a preset shot boundary detection method;
and segmenting the initial video into video segments according to the determined shot boundary.
In some embodiments, the shot boundaries include abrupt shots and gradual shots of the initial video, and the segmenting the initial video into video segments according to the determined shot boundaries comprises:
and removing the abrupt shot and the gradual shot from the initial video to obtain a video clip set, wherein the video clip set consists of the video clips left after removal.
In some embodiments, the video is composed of consecutive frames, and the determining of the abrupt shot and the gradual shot comprises:
calculating a degree of difference between all of the frames and adjacent ones of the frames;
when the difference degree exceeds a first preset threshold value, judging the frame to be an abrupt change frame, wherein the abrupt change lens consists of continuous abrupt change frames;
when the difference degree is between a first preset threshold and a second preset threshold, judging the frame to be a potential gradual change frame;
when the number of the continuous potential gradual change frames exceeds a third preset threshold value, the potential gradual change frames are judged to be gradual change frames, and the gradual change lens is composed of the continuous gradual change frames.
In some embodiments, the inputting the video segments into a preset model, and the determining the confidence level of each video segment corresponding to all preset video classifications includes:
sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip;
and preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence coefficient of the video clip corresponding to all the preset video classifications.
In some embodiments, the inputting the preprocessed sampling frame into the preset model includes:
and extracting space-time characteristics contained in the preprocessed sampling frame, and inputting the space-time characteristics into the preset model.
In some embodiments, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
In some embodiments, the method further includes receiving a target duration, and determining the video segments corresponding to the target video classification according to the target video classification and the confidence levels of all preset video classifications corresponding to each of the video segments includes:
and determining the video segments corresponding to the target video classification according to the target duration, the target video classification, the confidence degrees of all preset video classifications corresponding to each video segment and the duration of the video segments.
In a second aspect, an apparatus for generating a video, the apparatus comprising:
the receiving module is used for receiving the initial video and the target video classification;
the segmentation module is used for segmenting the initial video into video segments according to a preset video segmentation method;
the processing module is used for inputting the video clips into a preset model and determining the confidence of each video clip corresponding to all preset video classifications;
the matching module is used for determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
and the splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
In a third aspect, the present application provides a computer system comprising:
one or more processors;
and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
receiving an initial video and a target video classification;
segmenting the initial video into video segments according to a preset video segmentation method;
inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;
determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
and splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.
The invention has the following beneficial effects:
the invention discloses a video generation method, which comprises the steps of receiving an initial video and a target video classification, segmenting the initial video into video segments according to a preset video segmentation method, inputting the video segments into a preset model, obtaining confidence coefficients of all preset video classifications corresponding to each video segment, and determining the video segments corresponding to the target video classification according to the target video classification and the confidence coefficients of all the preset video classifications corresponding to each video segment; according to the preset splicing parameters, the video segments corresponding to the target video classification are spliced to obtain the target video, so that the target video meeting the requirements is generated according to the initial video, and the timeliness and the accuracy of video generation are ensured;
the invention also provides a preset shot boundary detection method for determining the shot boundary contained in the initial video; segmenting the initial video into video segments according to the determined shot boundaries, and further providing that the shot boundaries comprise abrupt shots and gradual shots of the initial video, wherein the segmenting the initial video into the video segments according to the determined shot boundaries comprises: and removing the abrupt shot and the gradual shot from the initial video to obtain a video clip set, wherein the video clip set consists of the video clips left after removal. The accuracy of video segment segmentation is ensured;
the application discloses sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip; preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence degrees of all preset video classifications corresponding to the video clips; determining the preset video classification corresponding to the confidence coefficient with the maximum value as the preset video classification corresponding to the video clip, wherein the confidence coefficient with the maximum value is the confidence coefficient of the video clip; and determining the confidence degrees of the video segments corresponding to the target video classification and the corresponding video segments according to the preset video classifications and the confidence degrees corresponding to all the video segments, thereby ensuring the accuracy of confidence degree calculation.
All products of the present invention need not have all of the above-described effects.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a model network structure provided in an embodiment of the present application;
fig. 2 is a flowchart of shot segmentation provided in an embodiment of the present application;
FIG. 3 is a flow chart of model training provided by an embodiment of the present application;
FIG. 4 is a flow chart of a method provided by an embodiment of the present application;
FIG. 5 is a block diagram of an apparatus according to an embodiment of the present disclosure;
fig. 6 is a computer system structure diagram provided in the embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As described in the background art, two commodity video generation methods commonly used in the prior art have certain limitations respectively. The manual editing method is adopted, the required labor cost is high, the efficiency is low, and the actual requirement for generating a large batch of commodity videos cannot be met; although the video generation method based on the image-text conversion has higher efficiency, the available video format and the video style are few and fixed, and the expression capability is limited.
In order to solve the technical problem, the application provides that a video uploaded by a user is segmented by using a preset segmentation method to obtain video segments, each video segment is classified by using a preset classification model, and a confidence coefficient corresponding to each video segment is obtained; and according to the target video classification selected by the user, splicing the video segments with confidence degrees meeting the preset conditions in the classification to obtain the target video. The method and the device realize the generation of the target video meeting the requirements according to the uploaded video of the user, and simultaneously ensure the timeliness of the video generation.
Example one
In order to classify the video segments obtained by segmentation, a classification model needs to be trained in advance, and specifically, an MFnet three-dimensional convolutional neural network model may be used as the classification model. The MFnet three-dimensional convolution neural network model is a light-weight deep learning model, and compared with recent deep learning models such as I3D and SlowFastnet, the model is more simplified, the floating point operand FLOPs is less, and the test effect on a test data set is better.
The training process comprises:
110. importing a training data set;
the training data set may be generated by:
111. acquiring a preset number of commodity videos, and creating a corresponding video folder for each video;
112. and dividing the segments contained in each video into different categories according to the different contents presented, wherein the categories include but are not limited to commodity main body appearance, commodity use scenes and commodity content introduction, and manually clipping according to the divided categories.
113. Establishing a main folder corresponding to each category in a folder corresponding to each video, wherein the main folder marks the corresponding category, each main folder contains one or more sub-video clip folders of the category corresponding to the video, and the sub-video clip folders store one or more image frames of the corresponding video clip;
114. and densely sampling the folder corresponding to each video, and normalizing the sampled samples into N × C × H × W, wherein N represents the number of sampled frames for each sub-video clip folder, C represents an RGB channel of each frame, H represents a preset height of each frame, and W represents a preset width of each frame, and preferably, N is at least 8.
120. Training the MFnet three-dimensional convolution neural network model by using a training data set to obtain a preset model;
fig. 1 shows a schematic network structure diagram of the model, which contains 3DCNN, and is used for extracting three-dimensional convolution features contained in each sample, where the three-dimensional convolution features contain spatio-temporal features, including motion information of objects in a video stream, such as motion trends of commodities, changes of backgrounds, and the like.
3 dpoling is a pooling layer of the model, and is used for pooling the output of the 3DCNN, inputting the pooling result into the 3D MF-Unit layer, and performing different convolution operations such as 1 × 1 × 1, 3 × 3 × 3, 1 × 3 × 3, and the like;
global Pool is used for keeping main characteristics of input results and reducing unnecessary parameters;
FClayer is a fully connected layer for outputting a confidence level for each category for each video segment.
Using the model, 56 commodity short video test sets were tested, and the test results are shown in table 1:
Figure BDA0002346379900000071
TABLE 1
The model can classify samples obtained through single shot intensive sampling, the classification accuracy rate reaches 95.92% in the test results of 1119 test samples in the video data set, the single model is only 29.6MB, the forward reasoning time for single shot intensive sampling video is 330ms, the accuracy rate is high, and the speed is high.
After the preset model is obtained, the generation of the video can be realized according to the model, as shown in fig. 2, the generation process includes:
step one, receiving an initial video input by a user;
step two, performing shot boundary detection on the initial video, segmenting the video according to a detection result, and eliminating redundant segments to obtain video segments;
as shown in fig. 3, the shot boundary detection process includes:
firstly, dividing each frame of an initial video into a preset number of sub-blocks equally by using the same preset method, then calculating a sub-histogram of each sub-block, and calculating the histogram difference of the sub-blocks at the same position of adjacent frames according to the sub-histograms, wherein the adjacent frames of each frame comprise a previous frame and a next frame of the frame. When the difference value exceeds a first preset threshold value THWhen the number of the sub-blocks with the excessive difference of a certain frame is higher than a second preset threshold value, the frame is considered to be an abrupt change frame, and the continuous abrupt change frames form an abrupt change shot. For the difference value at a first preset threshold value THAt a third predetermined threshold TLThe frames in between, i.e. considered as potential start frames, when the difference of the frames in sequence is also at TLAnd THAnd when the duration exceeds a fourth preset threshold, the continuous frames are regarded as gradual change frames to form a gradual change lens, and the lens after the gradual change and sudden change lenses are removed is regarded as a normal lens.
In order to ensure the effect of the generated video, the short shots with the length less than the fifth preset threshold in the normal shots need to be removed, and finally the required video segment set is obtained.
Sampling the video clips, inputting sampling results into a preset model, and obtaining the corresponding category and confidence of each video clip;
firstly, the video clips are subjected to random intensive sampling according to the time sequence of the video.
The random dense sampling process comprises:
and (3) randomly initializing sampling points on the video clip, uniformly sampling N frames by taking the sampling points as seven points and the end of the video clip as a key point, and preprocessing the sampling frames to ensure that the sampling frames meet the input size requirement of a preset model.
And then inputting the preprocessed sampling frame into a preset model to obtain confidence degrees of all categories corresponding to the video clip containing the sampling frame.
Splicing the video clips corresponding to the target type according to the target type and the target duration selected by the user to generate a target video;
for example, when the user displays a video to the appearance of the current commodity, the video clips are sorted according to the confidence degrees of the corresponding appearance display categories, and the video clips meeting the requirements are screened.
Specific screening rules may include:
duration T of video segment when confidence coefficient is highestiWhen the requirement of the target duration is met, directly taking the video clip with the highest confidence coefficient as a target video;
duration T of video segment when confidence coefficient is highestiWhen the requirement of the target duration is not met, sequentially selecting the n video segments T according to the sequence of the confidence coefficient valuesjWhere j is ∈ [1, n ]]Until the following equation is satisfied:
Figure BDA0002346379900000091
T2-T1representing a target duration;
when the duration of the n +1 shots selected according to the confidence score exceeds the maximum duration T2And then, according to the duration of each shot, performing head-to-tail interception on the longest shot until the total duration meets the requirement of the target duration.
And step five, sequentially splicing the video clips obtained in the step four according to the time sequence of the initial video to obtain the target video.
The generated target video can be stored in a video database and reused when needed next time, or used for continuously training the model.
Based on the scheme provided by the application, the target video meeting the requirement can be generated according to the uploaded video of the user, and meanwhile, the timeliness of video generation is guaranteed.
Example two
Corresponding to the foregoing embodiments, the present application provides a video generation method, as shown in fig. 4, the method includes:
410. receiving an initial video and a target video classification;
420. segmenting the initial video into video segments according to a preset video segmentation method;
preferably, the method comprises:
421. determining a shot boundary contained in the initial video by using a preset shot boundary detection method;
and segmenting the initial video into video segments according to the determined shot boundary.
Preferably, the shot boundaries include abrupt shots and gradual shots of the initial video, the method comprising:
422. and removing the abrupt shot and the gradual shot from the initial video to obtain a video clip set, wherein the video clip set consists of the video clips left after removal.
Preferably, the video is composed of consecutive frames, and the process of determining the abrupt shot and the gradual shot includes:
423. calculating a degree of difference between all of the frames and adjacent ones of the frames;
when the difference degree exceeds a first preset threshold value, judging the frame to be an abrupt change frame, wherein the abrupt change lens consists of continuous abrupt change frames;
when the difference degree is between a first preset threshold and a second preset threshold, judging the frame to be a potential gradual change frame;
when the number of the continuous potential gradual change frames exceeds a third preset threshold value, the potential gradual change frames are judged to be gradual change frames, and the gradual change lens is composed of the continuous gradual change frames.
430. Inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;
preferably, the method comprises:
431. sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip;
and preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence coefficient of the video clip corresponding to all the preset video classifications.
Preferably, the obtained sampling frame is at least eight frames.
Preferably, the inputting the preprocessed sampling frame into the preset model includes:
432. and extracting space-time characteristics contained in the preprocessed sampling frame, and inputting the space-time characteristics into the preset model.
Preferably, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
440. Determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
preferably, the method further includes receiving a target duration, and determining the video segments corresponding to the target video classification according to the target video classification and confidence levels of all preset video classifications corresponding to each video segment includes:
441. and determining the video segments corresponding to the target video classification according to the target duration, the target video classification, the confidence degrees of all preset video classifications corresponding to each video segment and the duration of the video segments.
450. And splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.
EXAMPLE III
Corresponding to the above method embodiment, the present application provides a video generation apparatus, as shown in fig. 5, the apparatus includes:
a receiving module 510, configured to receive an initial video and a target video category;
a segmentation module 520, configured to segment the initial video into video segments according to a preset video segmentation method;
a processing module 530, configured to input the video segments into a preset model, and determine confidence levels of all preset video classifications corresponding to each of the video segments;
a matching module 540, configured to determine, according to the target video classification and confidence levels of all preset video classifications corresponding to each video segment, the video segment corresponding to the target video classification;
and a splicing module 550, configured to splice the video segments corresponding to the target video categories according to preset splicing parameters, so as to obtain a target video.
Preferably, the segmentation module 520 is further configured to determine a shot boundary included in the initial video by using a preset shot boundary detection method;
and segmenting the initial video into video segments according to the determined shot boundary.
Preferably, the shot boundary includes a sudden change shot and a gradual change shot of the initial video, and the segmentation module 520 may be further configured to remove the sudden change shot and the gradual change shot from the initial video to obtain a video segment set, where the video segment set is composed of the video segments remaining after the removal.
Preferably, the video is composed of consecutive frames, and the segmentation module 520 is further configured to calculate the degree of difference between all the frames and the adjacent frames of the frames; when the difference degree exceeds a first preset threshold value, judging the frame to be an abrupt change frame, wherein the abrupt change lens consists of continuous abrupt change frames; when the difference degree is between a first preset threshold and a second preset threshold, judging the frame to be a potential gradual change frame; when the number of the continuous potential gradual change frames exceeds a third preset threshold value, the potential gradual change frames are judged to be gradual change frames, and the gradual change lens is composed of the continuous gradual change frames.
Preferably, the matching module 530 is further configured to sample the video segment according to a preset sampling method, and obtain at least two sampling frames corresponding to the video segment; and preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence coefficient of the video clip corresponding to all the preset video classifications.
Preferably, the matching module 530 is further configured to extract spatio-temporal features included in the preprocessed sampling frames, and input the spatio-temporal features into the preset model.
Preferably, the preset model is a pre-trained MFnet three-dimensional convolutional neural network model.
Preferably, the receiving module 510 is further configured to receive a target duration, and the matching module 540 is further configured to determine the video segment corresponding to the target video classification according to the target duration, the target video classification, confidence levels of all preset video classifications corresponding to each video segment, and the duration of the video segment.
Example four
Corresponding to the above method, apparatus, and system, a fourth embodiment of the present application provides a computer system, including: one or more processors; and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising: receiving an initial video and a target video classification;
segmenting the initial video into video segments according to a preset video segmentation method;
inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;
determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
and splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.
Fig. 6 illustrates an architecture of a computer system, which may include, in particular, a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, video display adapter 1511, disk drive 1512, input/output interface 1513, network interface 1514, and memory 1520 may be communicatively coupled via a communication bus 1530.
The processor 1510 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.
The Memory 1520 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, an icon font processing system 1525, and the like can also be stored. The icon font processing system 1525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1520 and called for execution by the processor 1510.
The input/output interface 1513 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The network interface 1514 is used to connect a communication module (not shown) to enable the device to communicatively interact with other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
The bus 1530 includes a path to transfer information between the various components of the device, such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
In addition, the computer system 1500 may also obtain information of specific extraction conditions from the virtual resource object extraction condition information database 1541 for performing condition judgment, and the like.
It should be noted that although the above devices only show the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in a specific implementation, the devices may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for generating a video, the method comprising:
receiving an initial video and a target video classification;
segmenting the initial video into video segments according to a preset video segmentation method;
inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;
determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
and splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.
2. The method of claim 1, wherein said slicing the initial video into video segments according to a predetermined video slicing method comprises:
determining a shot boundary contained in the initial video by using a preset shot boundary detection method;
and segmenting the initial video into video segments according to the determined shot boundary.
3. The method of claim 2, wherein the shot boundaries comprise abrupt shots and gradual shots of the initial video, and wherein the slicing the initial video into video segments according to the determined shot boundaries comprises:
and removing the abrupt shot and the gradual shot from the initial video to obtain a video clip set, wherein the video clip set consists of the video clips left after removal.
4. The method according to claim 3, wherein the video is composed of consecutive frames, and the determining process of the abrupt shot and the gradual shot comprises:
calculating a degree of difference between all of the frames and adjacent ones of the frames;
when the difference degree exceeds a first preset threshold value, judging the frame to be an abrupt change frame, wherein the abrupt change lens consists of continuous abrupt change frames;
when the difference degree is between a first preset threshold and a second preset threshold, judging the frame to be a potential gradual change frame;
when the number of the continuous potential gradual change frames exceeds a third preset threshold value, the potential gradual change frames are judged to be gradual change frames, and the gradual change lens is composed of the continuous gradual change frames.
5. The method according to any one of claims 1-4, wherein said inputting said video segments into a predetermined model and said determining confidence level of each of said video segments for all predetermined video categories comprises:
sampling the video clip according to a preset sampling method to obtain at least two sampling frames corresponding to the video clip;
and preprocessing the sampling frame, inputting the preprocessed sampling frame into the preset model, and obtaining the confidence coefficient of the video clip corresponding to all the preset video classifications.
6. The method of claim 5, wherein the inputting the preprocessed sample frame into the preset model comprises:
and extracting space-time characteristics contained in the preprocessed sampling frame, and inputting the space-time characteristics into the preset model.
7. The method according to any one of claims 1-4, wherein the predetermined model is a pre-trained MFnet three-dimensional convolutional neural network model.
8. The method according to any one of claims 1-4, wherein the method further comprises receiving a target duration, and wherein determining the video segment corresponding to the target video classification according to the target video classification and the confidence level of each of the video segments corresponding to all of the preset video classifications comprises:
and determining the video segments corresponding to the target video classification according to the target duration, the target video classification, the confidence degrees of all preset video classifications corresponding to each video segment and the duration of the video segments.
9. An apparatus for generating a video, the apparatus comprising:
the receiving module is used for receiving the initial video and the target video classification;
the segmentation module is used for segmenting the initial video into video segments according to a preset video segmentation method;
the processing module is used for inputting the video clips into a preset model and determining the confidence of each video clip corresponding to all preset video classifications;
the matching module is used for determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
and the splicing module is used for splicing the video clips corresponding to the target video classification according to preset splicing parameters to obtain the target video.
10. A computer system, the system comprising:
one or more processors;
and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
receiving an initial video and a target video classification;
segmenting the initial video into video segments according to a preset video segmentation method;
inputting the video clips into a preset model, and determining the confidence of each video clip corresponding to all preset video classifications;
determining the video segments corresponding to the target video classification according to the target video classification and the confidence degrees of all preset video classifications corresponding to each video segment;
and splicing the video segments corresponding to the target video classification according to preset splicing parameters to obtain the target video.
CN201911396267.6A 2019-12-30 2019-12-30 Video generation method and device and computer system Pending CN111182367A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201911396267.6A CN111182367A (en) 2019-12-30 2019-12-30 Video generation method and device and computer system
CA3166347A CA3166347A1 (en) 2019-12-30 2020-08-28 Video generation method and apparatus, and computer system
PCT/CN2020/111952 WO2021135320A1 (en) 2019-12-30 2020-08-28 Video generation method and apparatus, and computer system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911396267.6A CN111182367A (en) 2019-12-30 2019-12-30 Video generation method and device and computer system

Publications (1)

Publication Number Publication Date
CN111182367A true CN111182367A (en) 2020-05-19

Family

ID=70657587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911396267.6A Pending CN111182367A (en) 2019-12-30 2019-12-30 Video generation method and device and computer system

Country Status (3)

Country Link
CN (1) CN111182367A (en)
CA (1) CA3166347A1 (en)
WO (1) WO2021135320A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110545462A (en) * 2018-05-29 2019-12-06 优酷网络技术(北京)有限公司 video processing method and device
CN112132931A (en) * 2020-09-29 2020-12-25 新华智云科技有限公司 Processing method, device and system for templated video synthesis
CN112632326A (en) * 2020-12-24 2021-04-09 北京风平科技有限公司 Video production method and device based on video script semantic recognition
WO2021120685A1 (en) * 2019-12-20 2021-06-24 苏宁云计算有限公司 Video generation method and apparatus, and computer system
CN113676671A (en) * 2021-09-27 2021-11-19 北京达佳互联信息技术有限公司 Video editing method and device, electronic equipment and storage medium
CN115460446A (en) * 2022-08-19 2022-12-09 上海爱奇艺新媒体科技有限公司 Alignment method and device for multiple paths of video signals and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115442660B (en) * 2022-08-31 2023-05-19 杭州影象官科技有限公司 Self-supervision countermeasure video abstract extraction method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013444A (en) * 2007-02-13 2007-08-08 华为技术有限公司 Method and apparatus for adaptively generating abstract of football video
CN101252646A (en) * 2008-01-24 2008-08-27 王志远 Method for realizing video frequency propaganda film modularization making
CN109121021A (en) * 2018-09-28 2019-01-01 北京周同科技有限公司 A kind of generation method of Video Roundup, device, electronic equipment and storage medium
US20190052701A1 (en) * 2013-09-15 2019-02-14 Yogesh Rathod System, method and platform for user content sharing with location-based external content integration
CN109657100A (en) * 2019-01-25 2019-04-19 深圳市商汤科技有限公司 Video Roundup generation method and device, electronic equipment and storage medium
CN110232357A (en) * 2019-06-17 2019-09-13 深圳航天科技创新研究院 A kind of video lens dividing method and system
US20190286749A1 (en) * 2018-03-15 2019-09-19 Microsoft Technology Licensing, Llc Query interpolation in computer text input
CN110602526A (en) * 2019-09-11 2019-12-20 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2216781C2 (en) * 2001-06-29 2003-11-20 Самсунг Электроникс Ко., Лтд Image-based method for presenting and visualizing three-dimensional object and method for presenting and visualizing animated object
US9189884B2 (en) * 2012-11-13 2015-11-17 Google Inc. Using video to encode assets for swivel/360-degree spinners
RU2586566C1 (en) * 2015-03-25 2016-06-10 Общество с ограниченной ответственностью "Лаборатория 24" Method of displaying object
US10147226B1 (en) * 2016-03-08 2018-12-04 Pixelworks, Inc. 2D motion vectors from 3D model data
CN107767432A (en) * 2017-09-26 2018-03-06 盐城师范学院 A kind of real estate promotional system using three dimensional virtual technique
CN110312117B (en) * 2019-06-12 2021-06-18 北京达佳互联信息技术有限公司 Data refreshing method and device
CN111161392B (en) * 2019-12-20 2022-12-16 苏宁云计算有限公司 Video generation method and device and computer system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013444A (en) * 2007-02-13 2007-08-08 华为技术有限公司 Method and apparatus for adaptively generating abstract of football video
CN101252646A (en) * 2008-01-24 2008-08-27 王志远 Method for realizing video frequency propaganda film modularization making
US20190052701A1 (en) * 2013-09-15 2019-02-14 Yogesh Rathod System, method and platform for user content sharing with location-based external content integration
US20190286749A1 (en) * 2018-03-15 2019-09-19 Microsoft Technology Licensing, Llc Query interpolation in computer text input
CN109121021A (en) * 2018-09-28 2019-01-01 北京周同科技有限公司 A kind of generation method of Video Roundup, device, electronic equipment and storage medium
CN109657100A (en) * 2019-01-25 2019-04-19 深圳市商汤科技有限公司 Video Roundup generation method and device, electronic equipment and storage medium
CN110232357A (en) * 2019-06-17 2019-09-13 深圳航天科技创新研究院 A kind of video lens dividing method and system
CN110602526A (en) * 2019-09-11 2019-12-20 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110545462A (en) * 2018-05-29 2019-12-06 优酷网络技术(北京)有限公司 video processing method and device
WO2021120685A1 (en) * 2019-12-20 2021-06-24 苏宁云计算有限公司 Video generation method and apparatus, and computer system
CN112132931A (en) * 2020-09-29 2020-12-25 新华智云科技有限公司 Processing method, device and system for templated video synthesis
CN112132931B (en) * 2020-09-29 2023-12-19 新华智云科技有限公司 Processing method, device and system for templated video synthesis
CN112632326A (en) * 2020-12-24 2021-04-09 北京风平科技有限公司 Video production method and device based on video script semantic recognition
CN112632326B (en) * 2020-12-24 2022-02-18 北京风平科技有限公司 Video production method and device based on video script semantic recognition
CN113676671A (en) * 2021-09-27 2021-11-19 北京达佳互联信息技术有限公司 Video editing method and device, electronic equipment and storage medium
CN115460446A (en) * 2022-08-19 2022-12-09 上海爱奇艺新媒体科技有限公司 Alignment method and device for multiple paths of video signals and electronic equipment

Also Published As

Publication number Publication date
CA3166347A1 (en) 2021-07-08
WO2021135320A1 (en) 2021-07-08

Similar Documents

Publication Publication Date Title
CN111182367A (en) Video generation method and device and computer system
CN109688463B (en) Clip video generation method and device, terminal equipment and storage medium
US10657652B2 (en) Image matting using deep learning
WO2021120685A1 (en) Video generation method and apparatus, and computer system
CN109977983B (en) Method and device for obtaining training image
CN111950723A (en) Neural network model training method, image processing method, device and terminal equipment
JP7394809B2 (en) Methods, devices, electronic devices, media and computer programs for processing video
CN111680678B (en) Target area identification method, device, equipment and readable storage medium
CN111432206A (en) Video definition processing method and device based on artificial intelligence and electronic equipment
CN111062854A (en) Method, device, terminal and storage medium for detecting watermark
CN111814913A (en) Training method and device for image classification model, electronic equipment and storage medium
CN109241930B (en) Method and apparatus for processing eyebrow image
CN111461211A (en) Feature extraction method for lightweight target detection and corresponding detection method
CN111144215A (en) Image processing method, image processing device, electronic equipment and storage medium
CN112839185B (en) Method, apparatus, device and medium for processing image
CN111597845A (en) Two-dimensional code detection method, device and equipment and readable storage medium
CN112215221A (en) Automatic vehicle frame number identification method
CN114399497A (en) Text image quality detection method and device, computer equipment and storage medium
CN113987264A (en) Video abstract generation method, device, equipment, system and medium
CN114360053A (en) Action recognition method, terminal and storage medium
CN115147434A (en) Image processing method, device, terminal equipment and computer readable storage medium
CN112449249A (en) Video stream processing method and device, electronic equipment and storage medium
CN113971627A (en) License plate picture generation method and device
CN112749660A (en) Method and equipment for generating video content description information
CN112528897B (en) Portrait age estimation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200519