WO2024099171A1 - 视频生成方法和装置 - Google Patents

视频生成方法和装置 Download PDF

Info

Publication number
WO2024099171A1
WO2024099171A1 PCT/CN2023/128301 CN2023128301W WO2024099171A1 WO 2024099171 A1 WO2024099171 A1 WO 2024099171A1 CN 2023128301 W CN2023128301 W CN 2023128301W WO 2024099171 A1 WO2024099171 A1 WO 2024099171A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
processing
editing result
clips
result
Prior art date
Application number
PCT/CN2023/128301
Other languages
English (en)
French (fr)
Inventor
卢杨
牟俊舟
郭士伟
吕晶晶
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Publication of WO2024099171A1 publication Critical patent/WO2024099171A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects

Definitions

  • Embodiments of the present disclosure relate to the field of computer technology, and in particular to a video generation method and device.
  • Video editing generally refers to the use of various applications or tools to perform non-linear editing on videos, such as cutting and merging videos, or adding pictures, background music, special effects, scenes and other materials to videos to generate new videos with different expressiveness.
  • video is becoming more and more common as the main form of expression in various fields, such as short video platforms, product promotion, knowledge popularization, travel photography sharing, etc.
  • users expect to edit a specified video to form a new video with a shorter duration.
  • presenting summary videos at specified locations on some pages allows users to quickly determine whether they are interested, and browse the full video if they are interested.
  • an e-commerce platform can display a short video that highlights the features of the product on the product page so that users can quickly understand the product.
  • the embodiments of the present disclosure provide a video generation method and device.
  • the present disclosure provides a video generation method, the method comprising: obtaining at least two video segments obtained by segmenting a video to be edited; processing the at least two video segments using a pre-trained video processing model to obtain a processing result, wherein the processing result indicates the probability that each video segment belongs to the video editing result,
  • the training samples of the video processing model are obtained through the following steps: obtaining a video editing result set corresponding to the original video, determining the effect index value of each video editing result in the video editing result set, and generating the training samples of the video processing model according to the effect index value; according to the processing result, selecting a video clip from at least two video clips to generate a video editing result.
  • the present disclosure provides a video generating device, which includes: an acquiring unit, configured to acquire at least two video segments obtained by segmenting a video to be edited; a processing unit, configured to process the at least two video segments using a pre-trained video processing model to obtain a processing result, wherein the processing result represents the probability that each video segment belongs to a video editing result, and the training sample of the video processing model is obtained by the following steps: acquiring a video editing result set corresponding to the original video, respectively determining an effect index value of each video editing result in the video editing result set, and generating a training sample of the video processing model according to the effect index value; and a generating unit, configured to select a video segment from at least two video segments according to the processing result to generate a video editing result.
  • the present disclosure provides an electronic device, comprising: one or more processors; and a storage device for storing one or more programs; when the one or more programs are executed by one or more processors, the one or more processors implement a method described in any implementation manner in any of the above embodiments.
  • the present disclosure provides a non-transitory computer-readable medium having a computer program stored thereon.
  • the computer program is executed by a processor, the method described in any implementation manner in any of the above embodiments is implemented.
  • an embodiment of the present disclosure provides a computer program product, including a computer program, which implements the method described in any implementation manner in any of the above embodiments when executed by a processor.
  • FIG1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied
  • FIG2 is a flow chart of an embodiment of a video generation method according to the present disclosure.
  • FIG3 is a flow chart of an embodiment of generating training samples for a video processing model
  • FIG4 is a schematic diagram of the network structure of the video processing model
  • FIG5 is a schematic structural diagram of an embodiment of a video generating device according to the present disclosure.
  • FIG. 6 is a schematic diagram of the structure of an electronic device suitable for implementing the embodiments of the present disclosure.
  • FIG. 1 shows an exemplary architecture 100 to which an embodiment of a video generating method or a video generating apparatus of the present disclosure can be applied.
  • system architecture 100 may include terminal devices 101, 102, 103, network 104 and server 105.
  • Network 104 is used to provide a medium for communication links between terminal devices 101, 102, 103 and server 105.
  • Network 104 may include various connection types, such as wired, wireless communication links or optical fiber cables, etc.
  • the terminal devices 101, 102, 103 interact with the server 105 through the network 104 to receive or send messages, etc.
  • Various client applications may be installed on the terminal devices 101, 102, 103. For example, browser applications, search applications, shopping applications, social platforms, video processing applications, instant messaging tools, etc.
  • Terminal devices 101, 102, 103 can be hardware or software.
  • terminal devices 101, 102, 103 When terminal devices 101, 102, 103 are hardware, they can be various electronic devices, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers and desktop computers, etc.
  • terminal devices 101, 102, 103 When terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (for example, multiple software or software modules for providing distributed services), or it can be implemented as a single software or software module. No specific limitation is made here.
  • the server 105 may be a server that provides various services, such as a server that provides backend support for client applications installed on the terminal devices 101, 102, and 103.
  • the server may segment the video to be edited sent by the terminal devices 101, 102, and 103, and use a pre-trained video processing model to process at least two video segments obtained by segmentation to obtain a processing result, and then select a video segment from the at least two video segments according to the processing result to generate a video editing result of the video to be edited.
  • the above-mentioned video to be edited can also be directly stored locally on the server 105.
  • the server 105 can directly extract the locally stored video to be edited and process it.
  • the terminal devices 101, 102, 103 and the network 104 may not exist.
  • the video generating method provided in the embodiments of the present disclosure is generally executed by the server 105 , and accordingly, the video generating device is generally disposed in the server 105 .
  • a video processing application may also be installed in the terminal devices 101, 102, and 103, and the terminal devices 101, 102, and 103 may also process the video to be edited based on the video processing application.
  • the video generation method may also be executed by the terminal devices 101, 102, and 103, and accordingly, the video generation device may also be set in the terminal devices 101, 102, and 103.
  • the exemplary system architecture 100 may not have the server 105 and the network 104.
  • the server 105 can be hardware or software.
  • the server 105 can be implemented as a distributed server cluster consisting of multiple servers, or it can be implemented as a single server.
  • the server 105 is software, it can be implemented as multiple software or software modules (for example, multiple software or software modules for providing distributed services), or it can be implemented as a single software or software module. No specific limitation is made here.
  • terminal devices, networks and servers in Figure 1 is only illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements.
  • FIG. 2 it shows a process 200 of an embodiment of a video generation method according to the present disclosure.
  • the video generation method comprises the following steps:
  • Step 201 Obtain at least two video segments obtained by segmenting the video to be edited.
  • the video to be edited can be any type of video, which can be determined according to the actual application scenario.
  • the video to be edited can be an introduction video of a product.
  • the video to be edited may be a video recording of a certain event.
  • the video to be edited is usually intended to be edited to obtain a video with a shorter duration than the video to be edited itself as a video editing result.
  • various video segmentation methods can be used to segment the video to be edited according to actual application requirements to obtain at least two video segments, that is, multiple video segments.
  • the video to be edited can be segmented at equal intervals to obtain multiple video segments.
  • the video to be edited can be segmented into multiple video segments based on the video content (such as the continuity and relevance of the content).
  • the specific segmentation implementation can be achieved using various existing video editing applications or tools.
  • the durations of the video segments obtained by segmentation can be the same or different.
  • the content of the video segments obtained by segmentation belongs to the video to be edited.
  • the execution subject of the video generation method can obtain the at least two video clips from various data sources such as a local data source, a connected database, or a third-party data platform. It should be noted that the execution subject that divides the video to be edited to obtain the at least two video clips can be the same as the execution subject of the video generation method, or can be different.
  • Step 202 Use a pre-trained video processing model to process at least two video clips to obtain a processing result.
  • the processing result may represent the probability that each of the at least two video segments belongs to the video editing result, that is, the probability that the expected video editing result includes the content of each video segment.
  • the greater the corresponding probability the more likely the video editing result is to include the video segment content.
  • the input of the video processing model may be at least two video clips, and the output may be a processing result indicating the probability that each video clip belongs to a video editing result.
  • the video processing model may be a neural network model of various types, and the specific network structure may be flexibly set by a technician.
  • the video processing model may be trained in advance using training samples based on methods such as back propagation and gradient descent.
  • the training samples of the video processing model can be obtained through the following steps:
  • Step 1 Get the video clip result set corresponding to the original video.
  • the original video can be any video.
  • the video clip result set corresponding to the original video can be obtained in various ways. For example, various existing video clips can be used.
  • the editing method is to edit the original video to obtain a variety of video editing results according to application requirements (such as the duration requirement of the video editing result, etc.). For another example, the original video can be divided into equal intervals, and each video segment obtained by the division is used as the video editing result.
  • the subject executing the video clipping result set may be the same as or different from the subject executing the video generation method.
  • the subject executing the video clipping result set may obtain the video clipping result set corresponding to the original video from a local or other data source.
  • Step 2 respectively determine the effect index value of each video editing result in the video editing result set.
  • the effect index may refer to the effect or goal that is expected to be achieved.
  • the effect index of the video editing result may refer to the effect or optimization goal that is expected to be achieved by the video editing result.
  • the effect index may be flexibly set according to the actual application requirements.
  • the effect index may be a click-through rate, a completion rate, a conversion rate, etc.
  • the effect index value is the specific value of the effect index.
  • the effect index value of each video editing result can be determined by various methods according to the actual application scenario. For example, the effect index value of each video editing result can be predicted by a preset prediction method. For another example, each video editing result can be used online (such as online delivery, etc.), and then the effect index value of each video editing result can be obtained by statistics or other methods.
  • Step 3 Generate training samples for the video processing model based on the effect index value.
  • the video processing model input is multiple video clips
  • the output is the ranking results among the video clips.
  • the ranking results can be arranged in order from large to small or from small to large according to the probability that the corresponding video clips belong to the video editing result.
  • the video editing results can be sorted in order from large to small according to the effect index value to obtain the ranking result, and then each video editing result and the corresponding ranking result can be used as training samples.
  • the probability that the video clip belongs to the video editing result is positively correlated with the effect index value of the video. That is, the effect index of the video clip is positively correlated with the effect index value of the video.
  • multiple original videos can be obtained, and the above steps are used to process each original video to obtain multiple training samples.
  • Step 203 According to the processing result, a video segment is selected from at least two video segments to generate a video editing result.
  • a video segment after obtaining the processing result output by the video processing model, a video segment can be selected from the at least two video segments, and a video editing result can be generated based on the selected video segment.
  • various selection methods can be used to select the video segment according to the actual application scenario, and various generation methods can be used to generate the video editing result based on the video segment.
  • the video clip with the highest corresponding probability can be selected from the at least two video clips, and the selected video clip can be directly used as the video editing result.
  • the ranking result is formed by arranging in order from large to small according to the corresponding probabilities
  • the video clip with the first ranking is selected as the video editing result.
  • the ranking result is formed by arranging in order from small to large according to the corresponding probabilities
  • the video clip at the end of the ranking can be selected as the video editing result.
  • the feedback of the online effect index value is used to construct a training sample to obtain a video processing model, and then the video processing model is used to process the video to be edited, and the video editing result is generated according to the processing result.
  • the actual effect of the video editing result determined by the existing video editing methods, such as image quality, content diversity and representativeness, is unstable.
  • the video generation method provided by the present disclosure proposes to directly start from the online effect index, and use the feedback of the effect index to construct a video processing model, so as to generate the video editing result using the video processing model, so that the video editing result can be more in line with the expected effect, and the online effect index can reflect the user's interest to a certain extent, so that the generated video editing result can meet the user's preferences and improve the user experience.
  • FIG3 shows a flow chart of an embodiment of generating training samples of a video processing model. Specifically, the following steps are included:
  • Step 301 Obtain a video clip result set corresponding to the original video.
  • Step 302 Determine the effect index of each video editing result in the video editing result set. Standard value.
  • Step 303 Select a video editing result whose corresponding effect index value meets a preset condition from the video editing result set.
  • the preset condition can be flexibly set by the technician according to the actual application requirements.
  • the preset condition can be that the effect index value is greater than a preset effect index value threshold.
  • the preset condition can be that the effect index value is the maximum.
  • the video editing result with the largest corresponding effect index value can be selected from the video editing results in the video editing result set corresponding to the original video.
  • Step 304 Determine the time period of the selected video clip result in the original video as the target time period.
  • the time period of the video clipping result in the original video is the time period composed of the time points at which the video clipping result appears in the original video.
  • the time period composed of the start time point to the end time point of the video clipping result in the original video can be regarded as the target time period.
  • Step 305 divide the original video into at least two original video segments, and determine a label for each original video segment.
  • the original video may be segmented in various ways to obtain at least two original video segments, i.e., multiple original video segments.
  • the segments may be segmented at equal intervals.
  • the duration of the original video segment is not greater than the duration of the video clipping results in the video clipping result set in step 301 above.
  • the annotation of each original video segment can indicate whether the time period of the original video segment in the original video belongs to the target time period.
  • a Boolean value can be used to represent the annotation. As an example, "1" is used to indicate that the time period of the original video segment in the original video belongs to the above target time period, and "0" is used to indicate that the time period of the original video segment in the original video does not belong to the above target time period.
  • Step 306 Determine at least two original video clips and the annotations corresponding to each original video clip as training samples for the video processing model.
  • At least two segments corresponding to the original video and the annotations corresponding to each of the at least two video segments can be used as the video Training samples of the processing model.
  • multiple original videos can be used to obtain multiple training samples.
  • the video processing model can be trained based on the machine learning method using the multiple training samples.
  • the video processing model can be obtained by training through the following steps: obtaining an initial model, wherein the initial model can include an initial video processing model and an initial discriminant model, wherein the initial video processing model can be various types of neural network models (such as deep learning models, etc.), and its input can be multiple video clips, and the output can be the probability that each input video clip belongs to the video editing result.
  • the initial model can include an initial video processing model and an initial discriminant model
  • the initial video processing model can be various types of neural network models (such as deep learning models, etc.)
  • its input can be multiple video clips
  • the output can be the probability that each input video clip belongs to the video editing result.
  • the initial discriminant model can be various types of discriminant models (such as binary classification discriminators, etc.), and the input can be the probability that each video clip outputted by the initial video processing model belongs to the video editing result, and the output can be a binary classification result indicating whether each video clip belongs to the video editing result, one category indicating that the video clip belongs to the video editing result, and the other category indicating that the video clip does not belong to the video editing result, and the binary classification result here corresponds to the annotation of the above video clip.
  • the above training samples can be used to train the initial model using back propagation and gradient descent algorithms based on a preset loss function (such as a loss function designed based on KL divergence, etc.) to obtain a trained initial model.
  • the initial video processing model included in the trained initial model can be used as the video processing model determined to be trained.
  • the video processing model may include a first feature extraction model, a second feature extraction model, and a generation model.
  • the first feature extraction model may be used to extract features of video clips.
  • the second feature extraction model may determine the temporal relationship features between the video clips based on the features of the video clips respectively extracted by the first feature extraction model, and the generation model may generate the above processing results based on the temporal relationship features between the video clips extracted by the second feature extraction model.
  • the first feature extraction model can be used to extract feature vectors of each of the at least two video segments, and then the feature vectors corresponding to the at least two video segments can be extracted.
  • the feature vectors are input into the second feature extraction model to obtain feature vectors corresponding to each video clip and representing the temporal relationship characteristics between the video clips.
  • the feature vectors corresponding to each video clip output by the second feature extraction model are then input into the generation model to obtain the processing results.
  • the network structures of the first feature extraction model, the second feature extraction model and the generation model can be flexibly set by technical personnel according to actual application requirements.
  • the first feature extraction model can be constructed based on a C3D network (Convolutional 3D Network) or a C2D network (Convolutional 2D Network) combined with LSTM (Long Short-Term Memory), which can extract the features of the input video clip.
  • the second feature extraction model can be constructed based on a Transformer model.
  • the generation model can be constructed based on an MLP (Multilayer Perceptron).
  • the C3D model can be used to model video sequences with good feature expression capabilities.
  • Transformer and other methods can be used for temporal modeling to learn the features of the contextual video clips of each video clip.
  • MLP and other methods can be used to map the features to the probability that each video clip belongs to the video editing result as the processing result, thereby achieving good modeling and processing of various videos (such as long videos, etc.), which in turn helps to improve the accuracy of the processing results.
  • video clips may be selected and combined in descending order of corresponding probabilities to obtain a video editing result.
  • the number of selected video clips can be flexibly set according to actual application requirements. For example, it can be determined based on the duration of the desired video editing result and the duration of each video clip to ensure that the total duration of the selected video clips is not greater than the duration of the desired video editing result.
  • the selected video clips can be combined in various ways to obtain the video editing result. For example, the selected video clips are combined in a random order. For another example, the selected video clips can be combined in sequence according to the order of the time periods of the selected video clips in the video to be edited.
  • a sliding window of preset duration may be used to select a group of video clips in the sliding window from the original video and combine them to obtain a video editing result.
  • the duration (or size) of the sliding window can be flexibly set.
  • the sum of the probabilities corresponding to the respective video clips in the selected set of video clips can meet the preset conditions.
  • the preset conditions can be preset by a technician.
  • the preset condition can be greater than a preset threshold.
  • the preset condition can be the maximum sum.
  • This not only ensures that the video clips that make up the video editing results have a high probability, but also ensures the continuity of the content of the video clips that make up the video editing results, which helps to improve the content fluency of the video editing results. It can be applied to the generation of introductory videos (such as product introductions, promotional videos, etc.) in application scenarios.
  • the video generation method uses the effect index values corresponding to the video editing results corresponding to the original video to guide the generation of training samples, so as to pre-train a video processing model, and then use the video processing model to process the video segments obtained after the video to be edited is segmented, to obtain the probability that each video segment belongs to the video editing result, and then select video segments from the video segments based on this to generate the video editing result of the video to be edited. Since the training samples are generated by using the effect index, it can ensure to a certain extent that the video editing result generated by the processing result of the trained video processing model can meet the expected effect, thereby improving the quality of the video editing result.
  • the present disclosure provides an embodiment of a video generating device, which corresponds to the method embodiment shown in FIG. 2 , and can be specifically applied to various electronic devices.
  • the video generation device 500 includes an acquisition unit 501, a processing unit 502 and a generation unit 503.
  • the acquisition unit 501 is configured to acquire at least two video segments obtained by segmenting the video to be edited;
  • the processing unit 502 is configured to process the at least two video segments using a pre-trained video processing model to obtain a processing result, wherein the processing result indicates the probability that each video segment belongs to a video editing result, and the training sample of the video processing model is obtained by the following steps: acquiring a video editing result set corresponding to the original video, determining the effect index value of each video editing result in the video editing result set, and generating a training sample of the video processing model according to the effect index value;
  • the generating unit 503 is configured to select a video segment from at least two video segments to generate a video editing result according to the processing result.
  • the specific processing of the acquisition unit 501, the processing unit 502 and the generating unit 503 and the technical effects brought about by them can be respectively referred to the relevant descriptions of step 201, step 202 and step 203 in the corresponding embodiment of Figure 2, and will not be repeated here.
  • the above steps also include: selecting video editing results whose corresponding effect index values meet preset conditions from the video editing result set; determining the time period of the selected video editing results in the original video as the target time period; dividing the original video into at least two original video segments, and determining a label for each original video segment, the label of the original video segment indicating whether the time period of the original video segment in the original video belongs to the target time period; determining at least two original video segments and the labels corresponding to each original video segment as training samples of the video processing model.
  • the video processing model includes a first feature extraction model, a second feature extraction model and a generation model; and the processing unit 502 is further configured to: use the first feature extraction model to respectively extract features of each video clip in at least two video clips, and obtain first feature vectors corresponding to each video clip; use the second feature extraction model to extract temporal relationship features between each video clip, and obtain second feature vectors corresponding to each video clip; and use the generation model to generate processing results according to the second feature vectors corresponding to each video clip.
  • the generation unit 503 is further configured to select video clips in descending order of corresponding probabilities to combine and obtain a video editing result.
  • the generation unit 503 is further configured to utilize a sliding window of preset duration to select a group of video clips in the sliding window from the original video and combine them to obtain a video editing result, wherein the sum of the probabilities corresponding to each video clip in the selected group of video clips meets a preset condition.
  • the apparatus obtains at least two video segments obtained by segmenting the video to be edited through an acquisition unit; the processing unit processes the at least two video segments using a pre-trained video processing model to obtain a processing result, wherein the processing unit processes the at least two video segments using a pre-trained video processing model to obtain a processing result.
  • the result represents the probability that each video clip belongs to the video editing result.
  • the training sample of the video processing model is obtained through the following steps: obtaining the video editing result set corresponding to the original video, determining the effect index value of each video editing result in the video editing result set respectively, and generating the training sample of the video processing model according to the effect index value; the generation unit selects a video clip from at least two video clips to generate the video editing result according to the processing result. Since the training sample is generated by using the effect index, it can ensure to a certain extent that the video editing result generated by the processing result of the video processing model obtained by training can meet the expected effect, thereby improving the quality of the video editing result.
  • the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
  • FIG 6 it shows a schematic diagram of the structure of an electronic device (such as the server in Figure 1) 600 suitable for implementing the embodiments of the present disclosure.
  • the server shown in Figure 6 is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 600 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 to a random access memory (RAM) 603.
  • a processing device e.g., a central processing unit, a graphics processing unit, etc.
  • RAM random access memory
  • various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604.
  • the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 608 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 609.
  • the communication device 609 may allow the electronic device 600 to communicate wirelessly or wired with other devices to exchange data.
  • FIG. 6 shows an electronic device 600 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively. Each box shown in FIG. 6 may represent one device, or may represent multiple devices as needed.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602.
  • the processing device 601 the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
  • the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device.
  • the computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried.
  • This propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • the computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device.
  • the program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the computer readable medium may be included in the electronic device; or it may exist independently without being installed in the electronic device.
  • the computer readable medium carries one or more programs.
  • the electronic device is obtained by: obtaining at least two video segments obtained by segmenting a video to be edited; processing the at least two video segments using a pre-trained video processing model to obtain a processing result, wherein the processing result represents the probability that each video segment belongs to a video editing result, and the training sample of the video processing model is obtained by the following steps: obtaining a video editing result set corresponding to the original video, respectively determining the effect index value of each video editing result in the video editing result set, and generating a training sample of the video processing model according to the effect index value; and selecting a video segment from the at least two video segments according to the processing result to generate a video editing result.
  • Computer program code for performing the operations of embodiments of the present disclosure may be written in one or more programming languages or a combination thereof, including object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages.
  • the program code may be executed entirely on a user's computer, partially on a user's computer, as a separate software package, partially on a user's computer and partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., via the Internet using an Internet service provider
  • each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two square boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • a processor includes an acquisition unit, a processing unit and a generation unit.
  • the names of these units do not constitute a limitation on the unit itself in some cases.
  • the acquisition unit can also be described as "a unit for acquiring at least two video segments obtained by segmenting the video to be edited".

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本公开的实施例公开了视频生成方法和装置。该方法的一具体实施方式包括:获取对待剪辑视频进行切分得到的至少两个视频片段;利用预先训练的视频处理模型对至少两个视频片段进行处理,得到处理结果,其中,处理结果表示各视频片段分别属于视频剪辑结果的概率,视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定视频剪辑结果集的各视频剪辑结果的效果指标值,根据效果指标值,生成视频处理模型的训练样本;根据处理结果,从至少两个视频片段中选取视频片段生成视频剪辑结果。

Description

视频生成方法和装置
本专利申请要求于2022年11月8日提交的、申请号为202211389074.X、发明名称为“视频生成方法和装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本公开的实施例涉及计算机技术领域,具体涉及视频生成方法和装置。
背景技术
视频剪辑一般是指使用各种应用或工具等对视频进行非线性编辑等处理,例如对视频进行切割、合并等处理,又例如在视频中加入图片、背景音乐、特效、场景等素材,以生成具有不同表现力的新视频。
随着多媒体行业的全面发展,视频作为主要的表达方式应用在各种领域也越来越常见,如短视频平台、产品宣传、知识科普、旅拍分享等等。在一些场景下,用户期望从对指定视频进行剪辑,形成一个时长更短的新视频。例如,在一些页面指定位置处呈现摘要视频便于用户快速判断是否感兴趣,并在感兴趣时可以浏览完整视频。又例如,电商平台在产品页面可以展示一个短小但能突出产品特点的视频使用户能够快速了解产品。再例如,对于一些赛事或影视作品等,可能需要回放一些高光视频。
发明内容
本公开的实施例提出了视频生成方法和装置。
在一个或多个实施例中,本公开提供了一种视频生成方法,该方法包括:获取对待剪辑视频进行切分得到的至少两个视频片段;利用预先训练的视频处理模型对至少两个视频片段进行处理,得到处理结果,其中,处理结果表示各视频片段分别属于视频剪辑结果的概率, 视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定视频剪辑结果集的各视频剪辑结果的效果指标值,根据效果指标值,生成视频处理模型的训练样本;根据处理结果,从至少两个视频片段中选取视频片段生成视频剪辑结果。
在一个或多个实施例中,本公开提供了一种视频生成装置,该装置包括:获取单元,被配置成获取对待剪辑视频进行切分得到的至少两个视频片段;处理单元,被配置成利用预先训练的视频处理模型对至少两个视频片段进行处理,得到处理结果,其中,处理结果表示各视频片段分别属于视频剪辑结果的概率,视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定视频剪辑结果集的各视频剪辑结果的效果指标值,根据效果指标值,生成视频处理模型的训练样本;生成单元,被配置成根据处理结果,从至少两个视频片段中选取视频片段生成视频剪辑结果。
在一个或多个实施例中,本公开提供了一种电子设备,该电子设备包括:一个或多个处理器;以及存储装置,用于存储一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如上述任一实施例中任一实现方式描述的方法。
在一个或多个实施例中,本公开提供了一种非瞬时计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一实施例中任一实现方式描述的方法。
在一个或多个实施例中,本公开的实施例提供了一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现如上述任一实施例中任一实现方式所描述的方法。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显:
图1是本公开的一个实施例可以应用于其中的示例性***架构图;
图2是根据本公开的视频生成方法的一个实施例的流程图;
图3是生成视频处理模型的训练样本的一个实施例的流程图;
图4是视频处理模型的网络结构的一个示意图;
图5是根据本公开的视频生成装置的一个实施例的结构示意图;
图6是适于用来实现本公开的实施例的电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了可以应用本公开的视频生成方法或视频生成装置的实施例的示例性架构100。
如图1所示,***架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种客户端应用。例如,浏览器类应用、搜索类应用、购物类应用、社交平台、视频处理类应用、即时通信工具等等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的多个软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如为终端设备101、102、103安装的客户端应用提供后端支持的服务器。服务器可以对终端设备101、102、103发送的待剪辑视频进行切分,并利用预先训练的视频处理模型对切分得到的至少两个视频片段进行处理得到处理结果,再根据处理结果,从至少两个视频片段中选取视频片段生成待剪辑视频的视频剪辑结果。
需要说明的是,上述待剪辑视频也可以直接存储在服务器105的本地,服务器105可以直接提取本地所存储的待剪辑视频并进行处理,此时,可以不存在终端设备101、102、103和网络104。
需要说明的是,本公开的实施例所提供的视频生成方法一般由服务器105执行,相应地,视频生成装置一般设置于服务器105中。
还需要指出的是,终端设备101、102、103中也可以安装有视频处理类应用,终端设备101、102、103也可以基于视频处理类应用对待剪辑视频进行处理,此时,视频生成方法也可以由终端设备101、102、103执行,相应地,视频生成装置也可以设置于终端设备101、102、103中。此时,示例性***架构100可以不存在服务器105和网络104。
需要说明的是,服务器105可以是硬件,也可以是软件。当服务器105为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器105为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的多个软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,其示出了根据本公开的视频生成方法的一个实施例的流程200。该视频生成方法包括以下步骤:
步骤201、获取对待剪辑视频进行切分得到的至少两个视频片段。
在本实施例中,待剪辑视频可以是各种类型的视频,具体可以根据实际的应用场景确定。例如,待剪辑视频可以为某个产品的介绍视 频等。又例如,待剪辑视频可以为某一场赛事录像。待剪辑视频通常为期望对其进行剪辑,以得到时长相较于待剪辑视频本身的时长较短的视频作为视频剪辑结果。
具体地,可以根据实际的应用需求采用各种视频切分方式对待剪辑视频进行切分,得到至少两个视频片段,即多个视频片段。例如,可以将待剪辑视频等间隔切分以得到多个视频片段。又例如,可以根据视频内容(如内容的连续性、关联性等)将待剪辑视频切分为多个视频片段。具体的切分实现可以利用现有的各种视频编辑应用或工具来实现。通过切分得到的各视频片段的时长可以相同,也可以不同。一般地,通过切分得到的各视频片段的内容属于待剪辑视频。
视频生成方法的执行主体(如图1所示的服务器105等)可以从本地、所连接的数据库或第三方数据平台等各种数据源获取上述至少两个视频片段。需要说明的是,对待剪辑视频进行切分以得到上述至少两个视频片段的执行主体可以与视频生成方法的执行主体相同,也可以不同。
步骤202、利用预先训练的视频处理模型对至少两个视频片段进行处理,得到处理结果。
在本实施例中,处理结果可以表示上述至少两个视频片段中的各视频片段分别属于视频剪辑结果的概率,即期望的视频剪辑结果包括每个视频片段内容的概率。一般地,对应的概率越大,表示视频剪辑结果越可能包括视频片段内容。
视频处理模型的输入可以为至少两个视频片段,输出可以为表示各视频片段分别属于视频剪辑结果的概率的处理结果。视频处理模型可以是各种类型的神经网络模型,具体的网络结构可以由技术人员灵活设置。视频处理模型可以预先利用训练样本,基于反向传播和梯度下降等方法训练得到。
其中,视频处理模型的训练样本可以通过如下步骤得到:
步骤一、获取原始视频对应的视频剪辑结果集。
在本步骤中,原始视频可以是任意的视频。原始视频对应的视频剪辑结果集可以采用各种方式得到。例如,可以利用现有的各种视频 剪辑方式,根据应用需求(如视频剪辑结果的时长要求等),对原始视频进行剪辑得到多种视频剪辑结果。又例如,可以对原始视频进行等间隔切分,并将切分得到的各个视频片段作为视频剪辑结果。
需要说明的是,获取上述视频剪辑结果集的执行主体可以与上述视频生成方法的执行主体相同,也可以不同。获取上述视频剪辑结果集的执行主体可以从本地或其它各种数据源获取原始视频对应的视频剪辑结果集。
步骤二、分别确定视频剪辑结果集的各视频剪辑结果的效果指标值。
在本步骤中,效果指标可以指期望实现的效果或目标。视频剪辑结果的效果指标即可以指期望视频剪辑结果达到的效果或优化目标。效果指标具体可以根据实际的应用需求灵活设置。例如,效果指标可以为点击率、完播率、转化率等等。效果指标值即为效果指标的具体数值。
每个视频剪辑结果的效果指标值可以根据实际的应用场景采用各种方法确定。例如,可以利用预设的预测方法预测每个视频剪辑结果的效果指标值。又例如,可以将各视频剪辑结果进行线上使用(如线上投放等),然后可以通过统计等方式得到每个视频剪辑结果的效果指标值。
步骤三、根据效果指标值,生成视频处理模型的训练样本。
在本步骤中,在得到各视频剪辑结果的效果指标值后,可以根据视频处理模型的具体输入和输出形式,灵活采用各种方式生成视频处理模型的训练样本。
例如,视频处理模型输入为多个视频片段,输出为各视频片段之间的排序结果,排序结果可以按照对应的视频片段属于视频剪辑结果的概率由大到小或由小到大依次排列形成。此时,在得到原始视频对应的各视频剪辑结果的效果指标值后,可以按照效果指标值从大到小的顺序对各视频剪辑结果进行排序,得到排序结果,然后可以将各视频剪辑结果和对应的排序结果作为训练样本。此时,视频片段属于视频剪辑结果的概率和视频的效果指标值正相关。即视频片段的效果指 标值越大,可以表示该视频片段属于视频剪辑结果的概率越大。然后,利用同样的方式,可以获取多个原始视频,并利用上述步骤对各原始视频进行处理得到多个训练样本。
步骤203、根据处理结果,从至少两个视频片段中选取视频片段生成视频剪辑结果。
在本实施例中,在得到视频处理模型输出的处理结果之后,可以从上述至少两个视频片段中选取视频片段,并根据选取的视频片段生成视频剪辑结果。具体地,可以根据实际的应用场景采用各种选取方式选取视频片段,以及采用各种生成方式根据视频片段生成视频剪辑结果。
例如,对于视频处理模型输入为多个视频片段,输出为各视频片段之间的排序结果的情况,可以从上述至少两个视频片段中选取对应的概率最大的视频片段,并且可以直接使用选取的视频片段作为视频剪辑结果。若排序结果是按照对应的概率从大到小的顺序排列形成,则选取排序第一的视频片段作为视频剪辑结果。对应地,若排序结果是按照对应的概率从小到大的顺序排列形成,则可以选取排序最末尾的视频片段作为视频剪辑结果。
根据对视频剪辑结果的期望效果,利用线上的效果指标值的反馈构建训练样本以得到视频处理模型,再利用视频处理模型对待剪辑视频进行处理,并根据处理结果生成视频剪辑结果。现有的视频剪辑方法如从图像质量、内容多样性和代表性等方面确定的视频剪辑结果的实际效果不稳定,本公开提供的视频生成方法提出直接以线上效果指标出发,利用效果指标的反馈构建视频处理模型,以利用视频处理模型生成视频剪辑结果,能够使得视频剪辑结果能够更符合期望效果,而且线上效果指标能一定程度上反应用户的兴趣,从而使得生成的视频剪辑结果能够符合用户喜好,提升用户体验。
下面参考图3,其示出了生成视频处理模型的训练样本的一个实施例的流程图。具体包括如下步骤:
步骤301、获取原始视频对应的视频剪辑结果集。
步骤302、分别确定视频剪辑结果集的各视频剪辑结果的效果指 标值。
步骤303、从视频剪辑结果集中选取对应的效果指标值符合预设条件的视频剪辑结果。
在本实施例中,预设条件可以由技术人员根据实际的应用需求灵活设置。例如,预设条件可以为效果指标值大于预设的效果指标值阈值。又例如,预设条件可以为效果指标值最大。
作为示例,以预设条件为效果指标值最大为例,可以从原始视频对应的视频剪辑结果集中的各视频剪辑结果中选取对应的效果指标值最大的视频剪辑结果。
步骤304、确定选取的视频剪辑结果在原始视频中的时间段作为目标时间段。
在本步骤中,视频剪辑结果在原始视频中的时间段即视频剪辑结果在原始视频中出现的各时间点组成的时间段。例如,视频剪辑结果为一段连续的视频片段时,则视频剪辑结果在原始视频中的起始时间点到结束时间点组成的时间段可以视为目标时间段。
步骤305、将原始视频切分成至少两个原始视频片段,以及确定每个原始视频片段的标注。
在本步骤中,可以采用各种切分方式对原始视频进行切分以得到至少两个原始视频片段,即多个原始视频片段。例如,可以等间隔切分。一般地,原始视频片段的时长不大于上述步骤301中视频剪辑结果集中的视频剪辑结果的时长。
每个原始视频片段的标注可以表示该原始视频片段在原始视频中的时间段是否属于目标时间段。例如,可以使用布尔值表示批注。作为示例,使用“1”表示原始视频片段位于原始视频中的时间段属于上述目标时间段,以及使用“0”表示原始视频片段位于原始视频中的时间段不属于上述目标时间段。
步骤306、确定至少两个原始视频片段和每个原始视频片段对应的标注作为视频处理模型的训练样本。
在本步骤中,对于原始视频,可以将该原始视频对应的至少两个片段和该至少两个视频片段中的各视频片段分别对应的标注作为视频 处理模型的训练样本。由此,可以利用多个原始视频得到多个训练样本。之后,可以利用多个训练样本,基于机器学习的方法训练得到视频处理模型。
作为示例,可以通过如下步骤训练得到视频处理模型:获取初始模型,其中,初始模型可以包括初始视频处理模型和初始判别模型,其中,初始视频处理模型可以是各种类型的神经网络模型(如深度学习模型等等),其输入可以为多个视频片段,输出可以为输入的各视频片段分别属于视频剪辑结果的概率。初始判别模型可以是各种类型的判别模型(如二分类判别器等),输入可以为初始视频处理模型的输出各视频片段分别属于视频剪辑结果的概率,输出可以为表示各视频片段是否属于视频剪辑结果的二分类结果,一种类别表示视频片段属于视频剪辑结果,另一种类别表示视频片段不属于视频剪辑结果,此处的二分类结果对应于上述视频片段的标注。然后,可以利用上述训练样本,利用反向传播和梯度下降算法、基于预设的损失函数(如基于KL散度设计的损失函数等)对初始模型进行训练,得到训练完成的初始模型。然后,可以将训练完成的初始模型包括的初始视频处理模型作为确定为训练完成的视频处理模型。
一般地,视频的效果指标的影响因素可能很多,利用视频的效果指标值,通过布尔值对每个视频片段进行标注以形成训练样本完成视频处理模型的训练,有助于保证视频处理模型能够学习到是否属于视频剪辑结果的特征,以辅助生成更优效果的视频剪辑结果。
在本实施例的一些可选地实现方式中,视频处理模型可以包括第一特征提取模型、第二特征提取模型和生成模型。其中,第一特征提取模型可以用于提取视频片段的特征。第二特征提取模型可以根据第一特征提取模型分别提取的各视频片段的特征,确定各视频片段之间的时序关系特征,生成模型可以根据第二特征提取模型提取的各视频片段之间的时序关系特征生成上述处理结果。
此时,在获取到对待剪辑视频模型切分得到的至少两个视频片段后,可以先利用第一特征提取模型分别提取至少两个视频片段中的各视频片段的特征向量,然后将至少两个视频片段分别对应的特征向量 输入至第二特征提取模型,以得到各视频片段分别对应的、表示有各视频片段之间的时序关系特征的特征向量,再将第二特征提取模型输出的各视频片段分别对应的特征向量输入至生成模型,得到处理结果。
其中,第一特征提取模型、第二特征提取模型和生成模型的网络结构都可以根据实际的应用需求由技术人员灵活设置。
作为示例,如图4所示,其示出了视频处理模型的网络结构的一个示意图。第一特征提取模型可以基于C3D网络(Convolutional 3D Network)或C2D网络(Convolutional 2D Network)结合LSTM(Long Short-Term Memory)构建,其可以提取输入的视频片段的特征。第二特征提取模型可以基于Transformer模型构建。生成模型可以基于MLP(Multilayer Perceptron)构建。
利用C3D模型等可以对视频序列进行建模,并且具有较好的特征表达能力,再利用Transformer等进行时序上的建模,学习每个视频片段的上下文视频片段的特征,再通过MLP等将特征映射为每个视频片段属于视频剪辑结果的概率作为处理结果,从而实现对各种视频(如长视频等)的良好建模和处理,进而有助于提升处理结果的准确性。
在本实施例的一些可选地实现方式中,可以按照对应的概率从大到小的顺序,选取视频片段进行组合得到视频剪辑结果。
其中,选取视频片段的数量可以根据实际的应用需求灵活设置。例如,可以根据期望的视频剪辑结果的时长和每个视频片段的时长确定,以保证选取的各视频片段的总时长不大于期望得到的视频剪辑结果的时长。在选取视频片段之后,可以采用各种方式对选取的各视频片段进行组合以得到视频剪辑结果。例如,按照随机顺序组合各选取的视频片段。又例如,可以按照各选取的视频片段分别在待剪辑视频你中的时间段的先后顺序依次组合各选取的视频片段。
由此可以保证组成视频剪辑结果的视频片段都对应较高的概率,可以应用于视频摘要、高光视频的生成等应用场景下。
在本实施例的一些可选地实现方式中,可以利用预设时长的滑动窗口,从原始视频中选取位于滑动窗口中的一组视频片段进行组合得到视频剪辑结果。
其中,滑动窗口的时长(或称大小)可以灵活设置。选取的一组视频片段中的各视频片段分别对应的概率的总和可以符合预设条件。预设条件可以由技术人员预先设置。例如,预设条件可以为大于预设阈值。又例如,预设条件可以为总和最大。
由此不仅可以保证组成视频剪辑结果的视频片段具有较高的概率,还可以保证组成视频剪辑结果的视频片段内容的连续性,从而有助于提升视频剪辑结果的内容流畅性,可以应用于介绍类视频(如产品介绍、宣传类视频等等)生成的应用场景下。
此外,根据实际的应用需求,在组合各选取的视频片段时,还可以添加特效、各种其它素材、背景音乐等内容以丰富生成的视频剪辑结果的内容和呈现方式。
本公开的实施例提供的视频生成方法,利用原始视频对应的各视频剪辑结果分别对应的效果指标值指导生成训练样本,以预先训练得到视频处理模型,从而利用视频处理模型对待剪辑视频切分后得到的各视频片段进行处理,得到各视频片段分别属于视频剪辑结果的概率,进而基于此从视频片段中选取视频片段生成待剪辑视频的视频剪辑结果,由于是利用效果指标生成训练样本,能够在一定程度上保证利用训练得到的视频处理模型的处理结果生成的视频剪辑结果能够符合期望的效果,提升视频剪辑结果的质量。
进一步参考图5,作为对上述各图所示方法的实现,本公开提供了视频生成装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例提供的视频生成装置500包括获取单元501、处理单元502和生成单元503。其中,获取单元501被配置成获取对待剪辑视频进行切分得到的至少两个视频片段;处理单元502被配置成利用预先训练的视频处理模型对至少两个视频片段进行处理,得到处理结果,其中,处理结果表示各视频片段分别属于视频剪辑结果的概率,视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定视频剪辑结果集的各视频剪辑结果的效果指标值,根据效果指标值,生成视频处理模型的训练样本; 生成单元503被配置成根据处理结果,从至少两个视频片段中选取视频片段生成视频剪辑结果。
在本实施例中,视频生成装置500中:获取单元501、处理单元502和生成单元503的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201、步骤202和步骤203的相关说明,在此不再赘述。
在本实施例的一些可选的实现方式中,上述步骤还包括:从视频剪辑结果集中选取对应的效果指标值符合预设条件的视频剪辑结果;确定选取的视频剪辑结果在原始视频中的时间段作为目标时间段;将原始视频切分成至少两个原始视频片段,以及确定每个原始视频片段的标注,原始视频片段的标注表示原始视频片段在原始视频中的时间段是否属于目标时间段;确定至少两个原始视频片段和每个原始视频片段对应的标注作为视频处理模型的训练样本。
在本实施例的一些可选的实现方式中,上述视频处理模型包括第一特征提取模型、第二特征提取模型和生成模型;以及上述处理单元502进一步被配置成:利用第一特征提取模型分别提取至少两个视频片段中的各视频片段的特征,得到各视频片段分别对应的第一特征向量;利用第二特征提取模型提取各视频片段之间的时序关系特征,得到各视频片段分别对应的第二特征向量;利用生成模型根据各视频片段分别对应的第二特征向量生成处理结果。
在本实施例的一些可选的实现方式中,上述生成单元503进一步被配置:按照对应的概率从大到小的顺序,选取视频片段进行组合得到视频剪辑结果。
在本实施例的一些可选的实现方式中,上述生成单元503进一步被配置:利用预设时长的滑动窗口,从原始视频中选取位于滑动窗口中的一组视频片段进行组合得到视频剪辑结果,其中,选取的一组视频片段中的各视频片段分别对应的概率的总和符合预设条件。
本公开的上述实施例提供的装置,通过获取单元获取对待剪辑视频进行切分得到的至少两个视频片段;处理单元利用预先训练的视频处理模型对至少两个视频片段进行处理,得到处理结果,其中,处理 结果表示各视频片段分别属于视频剪辑结果的概率,视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定视频剪辑结果集的各视频剪辑结果的效果指标值,根据效果指标值,生成视频处理模型的训练样本;生成单元根据处理结果,从至少两个视频片段中选取视频片段生成视频剪辑结果,由于是利用效果指标生成训练样本,能够在一定程度上保证利用训练得到的视频处理模型的处理结果生成的视频剪辑结果能够符合期望的效果,提升视频剪辑结果的质量。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
下面参考图6,其示出了适于用来实现本公开的实施例的电子设备(例如图1中的服务器)600的结构示意图。图6示出的服务器仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图6中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开的实施例的方法中限定的上述功能。
需要说明的是,本公开的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使 得该电子设备:获取对待剪辑视频进行切分得到的至少两个视频片段;利用预先训练的视频处理模型对至少两个视频片段进行处理,得到处理结果,其中,处理结果表示各视频片段分别属于视频剪辑结果的概率,视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定视频剪辑结果集的各视频剪辑结果的效果指标值,根据效果指标值,生成视频处理模型的训练样本;根据处理结果,从至少两个视频片段中选取视频片段生成视频剪辑结果。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的实施例的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开的实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理 器中,例如,可以描述为:一种处理器包括获取单元、处理单元和生成单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,获取单元还可以被描述为“获取对待剪辑视频进行切分得到的至少两个视频片段的单元”。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (13)

  1. 一种视频生成方法,包括:
    获取对待剪辑视频进行切分得到的至少两个视频片段;
    利用预先训练的视频处理模型对所述至少两个视频片段进行处理,得到处理结果,其中,所述处理结果表示各视频片段分别属于视频剪辑结果的概率,所述视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定所述视频剪辑结果集的各视频剪辑结果的效果指标值,根据所述效果指标值,生成所述视频处理模型的训练样本;
    根据所述处理结果,从所述至少两个视频片段中选取视频片段生成视频剪辑结果。
  2. 根据权利要求1所述的方法,其中,所述根据所述效果指标值,生成所述视频处理模型的样本,包括:
    从所述视频剪辑结果集中选取对应的效果指标值符合预设条件的视频剪辑结果;
    确定选取的视频剪辑结果在所述原始视频中的时间段作为目标时间段;
    将所述原始视频切分成至少两个原始视频片段,以及确定每个原始视频片段的标注,所述原始视频片段的标注表示原始视频片段在所述原始视频中的时间段是否属于所述目标时间段;
    确定所述至少两个原始视频片段和每个原始视频片段对应的标注作为所述视频处理模型的训练样本。
  3. 根据权利要求2所述的方法,其中,所述视频处理模型包括第一特征提取模型、第二特征提取模型和生成模型;以及
    所述利用预先训练的视频处理模型对所述至少两个视频片段进行处理,得到处理结果,包括:
    利用所述第一特征提取模型分别提取所述至少两个视频片段中的 各视频片段的特征,得到各视频片段分别对应的第一特征向量;
    利用所述第二特征提取模型提取所述各视频片段之间的时序关系特征,得到各视频片段分别对应的第二特征向量;
    利用所述生成模型根据所述各视频片段分别对应的第二特征向量生成处理结果。
  4. 根据权利要求1-3任一所述的方法,其中,所述根据所述处理结果,从所述至少两个视频片段中选取视频片段生成视频剪辑结果,包括:
    按照对应的概率从大到小的顺序,选取视频片段进行组合得到视频剪辑结果。
  5. 根据权利要求1-3任一所述的方法,其中,所述根据所述处理结果,从所述至少两个视频片段中选取视频片段生成视频剪辑结果,包括:
    利用预设时长的滑动窗口,从所述原始视频中选取位于滑动窗口中的一组视频片段进行组合得到视频剪辑结果,其中,选取的一组视频片段中的各视频片段分别对应的概率的总和符合预设条件。
  6. 一种视频生成装置,包括:
    获取单元,被配置成获取对待剪辑视频进行切分得到的至少两个视频片段;
    处理单元,被配置成利用预先训练的视频处理模型对所述至少两个视频片段进行处理,得到处理结果,其中,所述处理结果表示各视频片段分别属于视频剪辑结果的概率,所述视频处理模型的训练样本通过如下步骤得到:获取原始视频对应的视频剪辑结果集,分别确定所述视频剪辑结果集的各视频剪辑结果的效果指标值,根据所述效果指标值,生成所述视频处理模型的训练样本;
    生成单元,被配置成根据所述处理结果,从所述至少两个视频片段中选取视频片段生成视频剪辑结果。
  7. 根据权利要求6所述的装置,其中,所述得到所述视频处理模型的训练样本的步骤还包括:
    从所述视频剪辑结果集中选取对应的效果指标值符合预设条件的视频剪辑结果;
    确定选取的视频剪辑结果在所述原始视频中的时间段作为目标时间段;
    将所述原始视频切分成至少两个原始视频片段,以及确定每个原始视频片段的标注,所述原始视频片段的标注表示原始视频片段在所述原始视频中的时间段是否属于所述目标时间段;
    确定所述至少两个原始视频片段和每个原始视频片段对应的标注作为所述视频处理模型的训练样本。
  8. 根据权利要求7所述的装置,其中,所述视频处理模型包括第一特征提取模型、第二特征提取模型和生成模型;以及
    所述处理单元进一步被配置成:利用所述第一特征提取模型分别提取所述至少两个视频片段中的各视频片段的特征,得到各视频片段分别对应的第一特征向量;
    利用所述第二特征提取模型提取所述各视频片段之间的时序关系特征,得到各视频片段分别对应的第二特征向量;
    利用所述生成模型根据所述各视频片段分别对应的第二特征向量生成处理结果。
  9. 根据权利要求6-8任一所述的装置,其中,所述生成单元进一步被配置:按照对应的概率从大到小的顺序,选取视频片段进行组合得到视频剪辑结果。
  10. 根据权利要求6-8任一所述的装置,其中,所述生成单元进一步被配置:利用预设时长的滑动窗口,从所述原始视频中选取位于滑动窗口中的一组视频片段进行组合得到视频剪辑结果,其中,选取 的一组视频片段中的各视频片段分别对应的概率的总和符合预设条件。
  11. 一种电子设备,包括:
    一个或多个处理器;以及
    存储装置,其上存储有一个或多个计算机程序;
    当所述一个或多个计算机程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-5中任一所述的方法。
  12. 一种非瞬时计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-5中任一所述的方法。
  13. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如权利要求1-5中任一所述的方法。
PCT/CN2023/128301 2022-11-08 2023-10-31 视频生成方法和装置 WO2024099171A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211389074.X 2022-11-08
CN202211389074.XA CN115801980A (zh) 2022-11-08 2022-11-08 视频生成方法和装置

Publications (1)

Publication Number Publication Date
WO2024099171A1 true WO2024099171A1 (zh) 2024-05-16

Family

ID=85436006

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/128301 WO2024099171A1 (zh) 2022-11-08 2023-10-31 视频生成方法和装置

Country Status (2)

Country Link
CN (1) CN115801980A (zh)
WO (1) WO2024099171A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801980A (zh) * 2022-11-08 2023-03-14 北京沃东天骏信息技术有限公司 视频生成方法和装置
CN116132752B (zh) * 2023-04-13 2023-12-08 北京百度网讯科技有限公司 视频对照组构造、模型训练、视频打分方法、装置及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107566907A (zh) * 2017-09-20 2018-01-09 广东欧珀移动通信有限公司 视频剪辑方法、装置、存储介质及终端
CN110401873A (zh) * 2019-06-17 2019-11-01 北京奇艺世纪科技有限公司 视频剪辑方法、装置、电子设备和计算机可读介质
CN110505519A (zh) * 2019-08-14 2019-11-26 咪咕文化科技有限公司 一种视频剪辑方法、电子设备及存储介质
CN112532897A (zh) * 2020-11-25 2021-03-19 腾讯科技(深圳)有限公司 视频剪辑方法、装置、设备及计算机可读存储介质
CN112770061A (zh) * 2020-12-16 2021-05-07 影石创新科技股份有限公司 视频剪辑方法、***、电子设备及存储介质
CN115801980A (zh) * 2022-11-08 2023-03-14 北京沃东天骏信息技术有限公司 视频生成方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109982109B (zh) * 2019-04-03 2021-08-03 睿魔智能科技(深圳)有限公司 短视频的生成方法及装置、服务器和存储介质
KR102378746B1 (ko) * 2019-08-16 2022-03-25 서울여자대학교 산학협력단 의료 영상에서 딥러닝에 기반한 복부 장기 자동분할 방법

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107566907A (zh) * 2017-09-20 2018-01-09 广东欧珀移动通信有限公司 视频剪辑方法、装置、存储介质及终端
CN110401873A (zh) * 2019-06-17 2019-11-01 北京奇艺世纪科技有限公司 视频剪辑方法、装置、电子设备和计算机可读介质
CN110505519A (zh) * 2019-08-14 2019-11-26 咪咕文化科技有限公司 一种视频剪辑方法、电子设备及存储介质
CN112532897A (zh) * 2020-11-25 2021-03-19 腾讯科技(深圳)有限公司 视频剪辑方法、装置、设备及计算机可读存储介质
CN112770061A (zh) * 2020-12-16 2021-05-07 影石创新科技股份有限公司 视频剪辑方法、***、电子设备及存储介质
CN115801980A (zh) * 2022-11-08 2023-03-14 北京沃东天骏信息技术有限公司 视频生成方法和装置

Also Published As

Publication number Publication date
CN115801980A (zh) 2023-03-14

Similar Documents

Publication Publication Date Title
CN109460513B (zh) 用于生成点击率预测模型的方法和装置
US20180365257A1 (en) Method and apparatu for querying
US11758088B2 (en) Method and apparatus for aligning paragraph and video
WO2024099171A1 (zh) 视频生成方法和装置
CN107943877B (zh) 待播放多媒体内容的生成方法和装置
JP7394809B2 (ja) ビデオを処理するための方法、装置、電子機器、媒体及びコンピュータプログラム
CN109862100B (zh) 用于推送信息的方法和装置
CN109255037B (zh) 用于输出信息的方法和装置
CN109255035B (zh) 用于构建知识图谱的方法和装置
CN110866040B (zh) 用户画像生成方法、装置和***
US20200409998A1 (en) Method and device for outputting information
CN112287168A (zh) 用于生成视频的方法和装置
US20210377628A1 (en) Method and apparatus for outputting information
CN111897950A (zh) 用于生成信息的方法和装置
CN108038172B (zh) 基于人工智能的搜索方法和装置
US20230367972A1 (en) Method and apparatus for processing model data, electronic device, and computer readable medium
CN112182255A (zh) 用于存储媒体文件和用于检索媒体文件的方法和装置
CN110059172B (zh) 基于自然语言理解的推荐答案的方法和装置
CN111078849A (zh) 用于输出信息的方法和装置
CN111125502B (zh) 用于生成信息的方法和装置
CN109857838B (zh) 用于生成信息的方法和装置
CN110888583B (zh) 页面显示方法、***、装置和电子设备
US10910014B2 (en) Method and apparatus for generating video
CN114239501A (zh) 合同生成方法、装置、设备及介质
CN112287173A (zh) 用于生成信息的方法和装置