CN113949828B

CN113949828B - Video editing method, device, electronic equipment and storage medium

Info

Publication number: CN113949828B
Application number: CN202111211990.XA
Authority: CN
Inventors: 梅立军; 付瑞吉; 李月雷; 张德兵
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2024-04-30
Anticipated expiration: 2041-10-18
Also published as: WO2023065663A1; CN113949828A

Abstract

The disclosure relates to a video editing method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a selection instruction of a clipping point of an original video, and extracting a target video fragment from the original video; the target video clip is a video clip with preset time length before the clipping point or after the clipping point in the original video; inputting video content characteristics corresponding to the target video clips into a content characteristic prediction model to obtain predicted video content characteristics; determining a video segment to be inserted from a video material segment set according to the predicted video content characteristics; the matching degree between the video content characteristics corresponding to the video clips to be inserted and the predicted video content characteristics meets the preset condition; the video clips to be inserted are fed back to the user and are used for being inserted into the clipping points of the original video. By adopting the video editing method and the video editing device, video editing can be optimized, and intellectualization is embodied on video connection, so that the video after editing is more natural and smooth.

Description

Video editing method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a video editing method, a video editing device, electronic equipment and a storage medium.

Background

At present, for short video editing, a method is generally adopted to insert fragments in a plurality of different short videos into one video or directly integrate fragments of a group of short videos into one video, but the method needs manual marking to collect video fragments, relies on manual completion of mixed shearing operation, lacks automatic short video mixed shearing, and is only integrated video fragments obtained by simple attribute aggregation with less automatic mixed shearing, so that intellectualization cannot be embodied on video linking.

Therefore, the related art has the problem of lack of automatic intelligent mixing and shearing.

Disclosure of Invention

The disclosure provides a video editing method, a video editing device, electronic equipment and a storage medium, which at least solve the problem of lack of automatic intelligent mixed cutting in the related technology. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a video editing method, including:

acquiring a selection instruction of a clipping point of an original video, and extracting a target video fragment from the original video; the target video clip is a video clip of the original video, which is before the clipping point or after the clipping point and has a preset duration;

Inputting the video content characteristics corresponding to the target video segments into a content characteristic prediction model to obtain predicted video content characteristics;

determining a video segment to be inserted from a video material segment set according to the predicted video content characteristics; the matching degree between the video content characteristics corresponding to the video clips to be inserted and the predicted video content characteristics meets the preset condition;

And feeding back the video clips to be inserted to a user for inserting the video clips to be inserted into the clipping points of the original video.

In one possible implementation manner, the determining, according to the predicted video content characteristics, a video segment to be inserted from a video material segment set includes:

Determining matching degree sequencing results between a plurality of video content features and the predicted video content features based on video content features corresponding to a plurality of video material fragments in a video material fragment set;

When the matching degree is larger than a preset threshold value, judging that the matching degree between the video content characteristics and the predicted video content characteristics meets a preset condition;

and taking the video material fragments corresponding to the video content characteristics as video fragments to be inserted.

In one possible implementation manner, the video clip to be inserted includes a plurality of video clips, and the feedback of the video clip to be inserted to the user includes:

acquiring preset feedback index information;

Sequencing the video clips to be inserted according to the feedback index information to obtain a feedback sequencing result;

and feeding back the plurality of video clips to be inserted based on the feedback sequencing result.

In one possible implementation manner, after the step of feeding back the video clip to be inserted to the user, the method further includes:

determining a target inserted video segment from the plurality of video segments to be inserted according to the insertion selection information returned by the user;

inserting the target insert video segment before or after the point of clipping of the original video.

In one possible implementation, the method further includes:

Acquiring training sample data; the training sample data includes a plurality of video segment pairs; each video segment pair comprises a first video segment and a second video segment which belong to the same sample video; the first video segment is a video segment with a preset duration before a video key point in the sample video; the second video segment is a video segment with a preset duration after a video key point in the sample video;

and training the content characteristic prediction model to be trained by adopting the training sample data to obtain the content characteristic prediction model.

In one possible implementation manner, when the target video segment is a video segment of the original video, which is a preset duration before the clipping point; training the content feature prediction model to be trained by adopting the training sample data to obtain the content feature prediction model, wherein the training comprises the following steps:

Inputting the video content characteristics corresponding to the first video segment into a content characteristic prediction model to be trained to obtain predicted video content characteristics corresponding to the first video segment;

Based on the difference between the predicted video content characteristics corresponding to the first video segment and the video content characteristics corresponding to the second video segment, adjusting model parameters of a content characteristic prediction model to be trained until the adjusted content characteristic prediction model meets preset training conditions, and obtaining the content characteristic prediction model;

When the target video clip is a video clip with a preset duration after the clipping point in the original video; training the content feature prediction model to be trained by adopting the training sample data to obtain the content feature prediction model, wherein the training comprises the following steps:

inputting the video content characteristics corresponding to the second video segment into a content characteristic prediction model to be trained to obtain predicted video content characteristics corresponding to the second video segment;

And adjusting model parameters of a content feature prediction model to be trained based on the difference between the predicted video content feature corresponding to the second video segment and the video content feature corresponding to the first video segment until the adjusted content feature prediction model meets preset training conditions, so as to obtain the content feature prediction model.

In one possible implementation manner, after the step of obtaining training sample data, the method further includes:

Aiming at each image content characteristic dimension, adjusting each image frame in a first video segment and a second video segment of each video segment pair according to an image preprocessing mode corresponding to the image content characteristic dimension to obtain an adjusted image frame;

extracting image features of the adjusted image frames to obtain a plurality of image feature vectors;

splicing the plurality of image feature vectors to obtain video feature vectors corresponding to the first video segment and the second video segment respectively; the video feature vector is used for representing video content features corresponding to the first video segment and the second video segment respectively.

In one possible implementation manner, the acquiring training sample data includes:

Acquiring a video highlight set of a sample video;

Determining, for each video highlight, a first video segment of a preset duration before the video highlight in the sample video and a second video segment of a preset duration after the video highlight in the sample video;

and obtaining a video fragment pair corresponding to the video highlight according to the first video fragment and the second video fragment.

In one possible implementation, the acquiring the set of video highlights of the sample video includes:

acquiring preset highlight extraction information; the highlight point extraction information is used for identifying video highlight points according to picture information, sound information and text information in the video;

And according to the highlight point extraction information, determining a plurality of video highlight points from the sample video to obtain a video highlight point set of the sample video.

According to a second aspect of embodiments of the present disclosure, there is provided a video editing apparatus, comprising:

an acquisition unit configured to execute a selection instruction for acquiring a clip point of an original video from which a target video clip is extracted; the target video clip is a video clip of the original video, which is before the clipping point or after the clipping point and has a preset duration;

the prediction unit is configured to input video content characteristics corresponding to the target video clips into a content characteristic prediction model to obtain predicted video content characteristics;

the video segment matching unit is configured to determine a video segment to be inserted from a video material segment set according to the predicted video content characteristics; the matching degree between the video content characteristics corresponding to the video clips to be inserted and the predicted video content characteristics meets the preset condition;

and the feedback unit is configured to perform feedback of the video clips to be inserted to a user for inserting the video clips to be inserted into the clipping points of the original video.

In one possible implementation manner, the video segment matching unit is specifically configured to perform determining a matching degree ranking result between a plurality of video content features and the predicted video content features based on video content features corresponding to a plurality of video material segments in a video material segment set; when the matching degree is larger than a preset threshold value, judging that the matching degree between the video content characteristics and the predicted video content characteristics meets a preset condition; and taking the video material fragments corresponding to the video content characteristics as video fragments to be inserted.

In one possible implementation manner, the video clip to be inserted includes a plurality of feedback units, and the feedback units are specifically configured to perform obtaining preset feedback index information; sequencing the video clips to be inserted according to the feedback index information to obtain a feedback sequencing result; and feeding back the plurality of video clips to be inserted based on the feedback sequencing result.

In one possible implementation, the method further includes:

A target insertion video clip determining unit configured to perform determination of a target insertion video clip from the plurality of video clips to be inserted according to insertion selection information returned by a user;

A target insertion video clip insertion unit configured to perform insertion of the target insertion video clip before or after a clip point of the original video.

In one possible implementation, the method further includes:

A training sample data acquisition unit configured to perform acquisition of training sample data; the training sample data includes a plurality of video segment pairs; each video segment pair comprises a first video segment and a second video segment which belong to the same sample video; the first video segment is a video segment with a preset duration before a video key point in the sample video; the second video segment is a video segment with a preset duration after a video key point in the sample video;

and the model training unit is configured to train the content characteristic prediction model to be trained by adopting the training sample data to obtain the content characteristic prediction model.

In one possible implementation manner, when the target video segment is a video segment of the original video, which is of a preset duration before the clipping point, the model training unit is specifically configured to perform inputting a video content feature corresponding to the first video segment into a content feature prediction model to be trained, so as to obtain a predicted video content feature corresponding to the first video segment; based on the difference between the predicted video content characteristics corresponding to the first video segment and the video content characteristics corresponding to the second video segment, adjusting model parameters of a content characteristic prediction model to be trained until the adjusted content characteristic prediction model meets preset training conditions, and obtaining the content characteristic prediction model;

When the target video clip is a video clip with a preset duration after the clipping point in the original video; the model training unit is specifically configured to input video content features corresponding to the second video segment into a content feature prediction model to be trained, so as to obtain predicted video content features corresponding to the second video segment; and adjusting model parameters of a content feature prediction model to be trained based on the difference between the predicted video content feature corresponding to the second video segment and the video content feature corresponding to the first video segment until the adjusted content feature prediction model meets preset training conditions, so as to obtain the content feature prediction model.

In one possible implementation, the method further includes:

the image preprocessing unit is configured to execute image preprocessing modes corresponding to each image content characteristic dimension, and adjust each image frame in the first video segment and the second video segment of each video segment pair according to the image preprocessing modes corresponding to the image content characteristic dimension to obtain adjusted image frames;

An image feature extraction unit configured to perform image feature extraction on the adjusted image frame to obtain a plurality of image feature vectors;

The splicing unit is configured to splice the plurality of image feature vectors to obtain video feature vectors corresponding to the first video segment and the second video segment respectively; the video feature vector is used for representing video content features corresponding to the first video segment and the second video segment respectively.

In one possible implementation manner, the training sample data obtaining unit is specifically configured to perform obtaining a video highlight set of the sample video; determining, for each video highlight, a first video segment of a preset duration before the video highlight in the sample video and a second video segment of a preset duration after the video highlight in the sample video; and obtaining a video fragment pair corresponding to the video highlight according to the first video fragment and the second video fragment.

In one possible implementation manner, the training sample data obtaining unit is specifically configured to obtain preset highlight extraction information; the highlight point extraction information is used for identifying video highlight points according to picture information, sound information and text information in the video; and according to the highlight point extraction information, determining a plurality of video highlight points from the sample video to obtain a video highlight point set of the sample video.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising a memory storing a computer program and a processor implementing the video clipping method according to the first aspect or any one of the possible implementations of the first aspect when the computer program is executed.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the video clip method according to the first aspect or any one of the possible implementations of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, causing the device to perform the video editing method as described in any of the embodiments of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the method comprises the steps of extracting a target video segment from an original video by acquiring a selection instruction of a clipping point of the original video, wherein the target video segment is a video segment with preset time length before the clipping point or after the clipping point in the original video, inputting video content characteristics corresponding to the target video segment into a content characteristic prediction model to obtain predicted video content characteristics, determining a video segment to be inserted from a video material segment set according to the predicted video content characteristics, and feeding back the video segment to be inserted to a user so as to be used for inserting the video segment to be inserted into the clipping point of the original video, wherein the matching degree between the video content characteristics corresponding to the video segment to be inserted and the predicted video content characteristics meets preset conditions. Therefore, the predicted video content characteristics can be obtained based on the video content characteristics corresponding to the target video clips, and further the video clips to be inserted can be matched from the video material clip set for feedback, so that video clipping is optimized, the clipped video is more natural and smooth, the intellectualization is realized on video connection, and the clipped video is prevented from being abrupt.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is an application environment diagram illustrating a video clip method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a video clip method according to an exemplary embodiment.

Fig. 3 is a schematic diagram illustrating a process flow of intelligent video clip editing according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating a model training step according to an exemplary embodiment.

FIG. 5a is a schematic diagram illustrating a model training according to an exemplary embodiment.

FIG. 5b is a schematic diagram illustrating a process flow for training data preparation and model training, according to an example embodiment.

Fig. 6 is a flowchart illustrating another video clip method according to an exemplary embodiment.

Fig. 7 is a block diagram of a video clip apparatus according to an exemplary embodiment.

Fig. 8 is an internal structural diagram of an electronic device, which is shown according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure.

It should be further noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

The video clipping method provided by the present disclosure can be applied to an application environment as shown in fig. 1. Wherein, the client 110 interacts with the server 120 through a network. The server 120 obtains a selection instruction of a clipping point of an original video, extracts a target video segment from the original video, inputs video content features corresponding to the target video segment into a content feature prediction model to obtain predicted video content features, determines a video segment to be inserted from a video material segment set according to the predicted video content features, and feeds back the video segment to be inserted to the client 110. In practical applications, the client 110 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers and portable wearable devices, and the server 120 may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

Fig. 2 is a flowchart illustrating a video clip method, as shown in fig. 2, for use in the server 120 of fig. 1, according to an exemplary embodiment, including the following steps.

In step S210, a selection instruction of a clipping point of an original video is obtained, and a target video clip is extracted from the original video; the target video clip is a video clip with preset time length before the clipping point or after the clipping point in the original video;

The original video may be a video to be inserted into the clip, for example, the only base video that is being edited by the current user side may be used as the original video.

The target video segment may be a video segment of the video content to be predicted in the original video, for example, a video segment connected with the video content of the target video segment may be predicted based on the target video segment extracted from the original video.

As an example, the clip point may be a user-specified time position of an insert in the original video, such as specifying an insert time position p based on user requirements.

In a specific implementation, in a video editing process, the server may receive a selection instruction of an editing point of an original video sent by the user side, and further the server may extract a target video segment from the original video according to the obtained selection instruction, where the target video segment may be a video segment in the original video before the editing point or after the editing point for a preset duration.

For example, after determining the insertion time position p, a video based on a certain time interval [ t _p-n,t_p ] before the insertion time position may be extracted from the original video as a target video clip.

In an example, since the total duration of the short video is short, the preset duration n (i.e. the time interval) before the clip point may be selected within 10-15s before the time position p, and the preset duration n may also be other set values, which is not limited herein.

In step S220, video content features corresponding to the target video clip are input to the content feature prediction model, so as to obtain predicted video content features;

The video content features corresponding to the target video segment may be feature vector sequences obtained by extracting multidimensional features from the target video segment, and may be used to characterize the video content features of the video segment.

In the specific implementation, after the target video segment is obtained, multidimensional feature extraction can be performed on the target video segment to obtain video content features corresponding to the target video segment, and then the video content features corresponding to the target video segment can be input into a content feature prediction model to obtain predicted video content features, wherein video content represented by the predicted video content features can be connected with video content represented by the video content features corresponding to the target video segment.

In an example, the predicted video content feature may be a set of vector sequences, for example, based on the target video segment extracted at the insertion time position p, an optional set Y _p may be predicted by the pre-trained content feature prediction model, which may have a plurality of elements, each element Y may refer to a vector sequence, each vector may correspond to a video frame, the complete vector sequence may correspond to a video segment, or the optional set Y _p may be a set of vector sequences corresponding to the predicted optional video segment.

In step S230, determining a video clip to be inserted from the video material clip set according to the predicted video content characteristics; the matching degree between the video content characteristics corresponding to the video clips to be inserted and the predicted video content characteristics meets the preset condition;

the set of video material segments may be a set of video segments, and each video segment may correspond to a sequence of vectors characterizing video content of the video segment.

In a specific implementation, searching can be performed in a video material fragment set according to the predicted video content characteristics, and a vector sequence which meets the preset condition with the matching degree of the predicted video content characteristics can be searched through the similarity matching process of the predicted video content characteristics and the vector sequences of the video content characteristics corresponding to all video fragments in the video material fragment set, so that the video fragment corresponding to the searched vector sequence can be used as the video fragment to be inserted.

In an example, the searched similarity matching result may be N video segments with the highest matching degree with the predicted video content features, i.e. the video segments to be inserted.

In step S240, the video clip to be inserted is fed back to the user for insertion into the clip point of the original video.

After the video clip to be inserted is obtained, the server can feed back the video clip to be inserted to the user side, and then the video clip to be inserted can be inserted to the clipping point of the original video based on user operation, so that the video after mixed editing is obtained.

According to the video editing method, the target video clips are extracted from the original video by acquiring the selection instruction of the clipping points of the original video, then the video content characteristics corresponding to the target video clips are input into the content characteristic prediction model to obtain the predicted video content characteristics, and further the video clips to be inserted are determined from the video material clip set according to the predicted video content characteristics and fed back to the user so as to be used for inserting the video clips to be inserted into the clipping points of the original video. Therefore, the predicted video content characteristics can be obtained based on the video content characteristics corresponding to the target video clips, and further the video clips to be inserted can be matched from the video material clip set for feedback, so that video clipping is optimized, the clipped video is more natural and smooth, the intellectualization is realized on video connection, and the clipped video is prevented from being abrupt.

In an exemplary embodiment, obtaining a selection instruction of a clipping point of an original video, extracting a target video segment from the original video, includes: acquiring a selection instruction of a clipping point of an original video, and determining a time interval before the clipping point or after the clipping point in the original video; based on the time interval, a target video clip is extracted from the original video.

In a specific implementation, the server may receive a selection instruction of a clipping point of an original video sent by the user side, and further the server may determine a time interval before the clipping point or after the clipping point in the original video according to the obtained selection instruction, and after the time interval is obtained, may extract a target video segment corresponding to the time interval from the original video.

For example, the inserting time position p may be determined according to the selection instruction, and then a time interval [ t _p-n,t_p ] before the inserting time position may be obtained based on the preset duration n, and a video segment corresponding to the time interval [ t _p-n,t_p ] may be extracted from the original video as the target video segment.

According to the technical scheme, the time interval before the clipping point or after the clipping point in the original video is determined by acquiring the selection instruction of the clipping point of the original video, and then the target video segment is extracted from the original video based on the time interval, so that the target video segment can be accurately extracted from the original video based on the user requirement, and data support is provided for the follow-up prediction of the video content characteristics.

In an exemplary embodiment, determining a video segment to be inserted from a set of video material segments based on predicted video content characteristics includes: determining a matching degree ordering result between the plurality of video content features and the predicted video content features based on the video content features corresponding to the plurality of video material segments in the video material segment set; when the matching degree is larger than a preset threshold value, judging that the matching degree between the video content characteristics and the predicted video content characteristics meets a preset condition; and taking the video material fragments corresponding to the video content characteristics as video fragments to be inserted.

In a specific implementation, a plurality of video material fragments are provided in a video material fragment set, searching can be performed in the video material fragment set according to the predicted video content characteristics based on the video content characteristics corresponding to the plurality of video material fragments, and the video content characteristics which meet the preset condition in the matching degree of the predicted video content characteristics can be searched out through the similarity matching process of the predicted video content characteristics and the video content characteristics corresponding to the plurality of video material fragments, so that the video material fragments corresponding to the searched video content characteristics can be used as video fragments to be inserted.

For example, when 5 elements are included in the predicted video content feature, for each element, 10 video segments with highest similarity matching degree can be searched out from the video material segment set, and then based on 10 video segments, namely 50 video segments, with highest similarity matching degree corresponding to each of the 5 elements, the video segments to be inserted can be formed.

According to the technical scheme, after the server obtains the predicted video content characteristics, the matching degree sequencing result between the plurality of video content characteristics and the predicted video content characteristics is determined based on the video content characteristics corresponding to the plurality of video material fragments in the video material fragment set, then when the matching degree is larger than the preset threshold value, the matching degree between the video content characteristics and the predicted video content characteristics is judged to meet the preset condition, and then the video material fragments corresponding to the video content characteristics are used as video fragments to be inserted, so that the video material fragments with higher similarity can be effectively matched for the predicted video content characteristics, and the video content linking effect is improved.

In an exemplary embodiment, the video clip to be inserted may include a plurality of video clips, and feeding back the video clip to be inserted to the user includes: acquiring preset feedback index information; sequencing the video clips to be inserted according to the feedback index information to obtain a feedback sequencing result; and feeding back a plurality of video clips to be inserted based on the feedback sequencing result.

The feedback index information may include a plurality of specified indexes, such as relevance, highlighting, and the like.

In a specific implementation, the video clips to be inserted may include a plurality of video clips to be inserted, recommendation degree ordering may be performed on the plurality of video clips to be inserted according to preset feedback index information, a feedback ordering result is obtained, and then the server may feed back the plurality of video clips to be inserted to the user terminal based on the feedback ordering result.

According to the technical scheme, the video clips to be inserted can comprise a plurality of video clips, and preset feedback index information is obtained; the video clips to be inserted are sequenced according to the feedback index information to obtain a feedback sequencing result, and then the video clips to be inserted are fed back based on the feedback sequencing result, so that intelligent video mixed cutting materials are provided for users, and the video after editing can be more natural and smooth.

In an exemplary embodiment, after the step of feeding back the video clip to be inserted to the user, the method further includes: determining a target inserted video segment from a plurality of video segments to be inserted according to the insertion selection information returned by the user; the target insert video clip is inserted before or after the point of the original video.

In practical application, after the video clips to be inserted are fed back to the user, the target inserting video clips can be determined from the multiple video clips to be inserted according to the insertion selection information returned by the user, and then the target inserting video clips can be inserted before or after the clipping points of the original video, for example, according to the selection operation of the user on the sequenced video clips to be inserted, the target inserting video clips can be determined, and then the target inserting video clips can be spliced into the original video, so that the video after the mixed editing is obtained.

In an example, when the target video segment of the video content to be predicted is a video segment of the original video that is a preset time period before the clip point, the target insert video segment may be inserted after the clip point of the original video; when the target video clip of the video content to be predicted is a video clip of the original video that is a preset time period after the clip point, the target insert video clip may be inserted before the clip point of the original video.

According to the technical scheme of the embodiment, the target inserted video clip is determined from a plurality of video clips to be inserted according to the insertion selection information returned by the user; before or after the target insert video clip is inserted to the clipping point of the original video, intelligent video mixing and clipping can be performed based on user selection, and the intellectualization is embodied on video connection, so that the clipped video is more natural and smooth.

For ease of understanding by those skilled in the art, FIG. 3 exemplarily provides a process flow diagram of intelligent video clip editing; as shown in fig. 3, a user may specify an inserting time position p (i.e., a clipping point) of a base video (i.e., an original video) based on a user side, the server may extract a video (i.e., a target video clip) corresponding to a time interval [ t _p-n,t_p ] from an existing video (i.e., an original video) according to the specified inserting time position p, perform multidimensional feature extraction on the video corresponding to the time interval [ t _p-n,t_p ] to obtain a predicted video content feature, then may generate an optional set Y _p (i.e., a predicted video content feature) by generating a deep learning model (i.e., a content feature prediction model), and further search in a candidate clipping video (i.e., a video material clip set) for each element Y in the optional set Y _p to obtain a to-be-inserted video clip set Y _y (i.e., to-be-inserted video clip), and perform sorting on the to-be-inserted video clip set Y _y according to a specified index, and may perform feedback according to a sorting result to provide the user with respect to the sorted to the to-be-inserted video clip set Y _y.

Fig. 4 is a flowchart illustrating another video clip method according to an exemplary embodiment, as shown in fig. 4, for use in the server 120 of fig. 1, including the following steps.

In step S410, training sample data is acquired; the training sample data includes a plurality of video segment pairs; each video clip pair comprises a first video clip and a second video clip belonging to the same sample video; the first video clip is a video clip with a preset duration before a video key point in the sample video; the second video clip is a video clip with a preset duration after the video key point in the sample video;

In a specific implementation, before acquiring a selection instruction for a clipping point of an original video and extracting a target video segment from the original video, the server further needs to train the content feature prediction model, so as to acquire training sample data, where the training sample data may include a plurality of video segment pairs, each video clip pair may include a first video clip and a second video clip belonging to the same sample video, the first video clip may be a video clip of a preset duration before a video key point in the sample video, and the second video clip may be a video clip of a preset duration after the video key point in the sample video.

In an example, the content feature prediction model may be a generative deep learning model that may employ VAEs, GANs, and variants thereof, e.g., recurrent neural networks, bidirectional RNN (bi-directional recurrent neural networks), deep (Bidirectional) RNN (deep (bi-directional) recurrent neural networks), LSTM, etc., and convolutional neural networks (Convolutional Neural Network, CNN), etc

In step S420, training the content feature prediction model to be trained by using training sample data, so as to obtain the content feature prediction model.

In practical application, the server may use training sample data to train the content feature prediction model to be trained to obtain a content feature prediction model, and specifically, may train the content feature prediction model to be trained based on the first video segment and the second video segment of each video segment pair to obtain the content feature prediction model.

According to the technical scheme, the training sample data is obtained, the content characteristic prediction model to be trained is trained by adopting the training sample data, the content characteristic prediction model is obtained, video content prediction can be carried out based on the pre-trained content characteristic prediction model, video editing is optimized, and intelligentization is embodied on editing video connection.

In an exemplary embodiment, in the case where the target video clip is a video clip of the original video that is a preset length of time before the clip point; training the content characteristic prediction model to be trained by adopting training sample data to obtain the content characteristic prediction model, wherein the training sample data comprises the following steps:

based on the difference between the predicted video content characteristics corresponding to the first video segment and the video content characteristics corresponding to the second video segment, adjusting model parameters of a content characteristic prediction model to be trained until the adjusted content characteristic prediction model meets preset training conditions, and obtaining a content characteristic prediction model;

When the target video clip is a video clip with a preset time length after the clipping point in the original video; training the content characteristic prediction model to be trained by adopting training sample data to obtain the content characteristic prediction model, wherein the training sample data comprises the following steps:

And adjusting model parameters of the content feature prediction model to be trained based on the difference between the predicted video content feature corresponding to the second video segment and the video content feature corresponding to the first video segment until the adjusted content feature prediction model meets preset training conditions, so as to obtain the content feature prediction model.

In a specific implementation, if the target video segment is a video segment of the original video, which is of a preset duration before the clipping point, in a model training process, video content features corresponding to the first video segment can be input into a content feature prediction model to be trained to obtain predicted video content features corresponding to the first video segment, and model parameters of the content feature prediction model to be trained are adjusted based on differences between the predicted video content features corresponding to the first video segment and the video content features corresponding to the second video segment until the adjusted content feature prediction model meets preset training conditions, so that a content feature prediction model can be obtained.

If the target video segment is a video segment of the original video, which is in a preset time period after the clipping point, in the process of model training, video content characteristics corresponding to the second video segment can be input into a content characteristic prediction model to be trained, so as to obtain predicted video content characteristics corresponding to the second video segment, and model parameters of the content characteristic prediction model to be trained are adjusted based on differences between the predicted video content characteristics corresponding to the second video segment and the video content characteristics corresponding to the first video segment until the adjusted content characteristic prediction model meets preset training conditions, so that a content characteristic prediction model can be obtained.

According to the technical scheme, when a target video clip is a video clip with a preset duration before a clipping point in an original video, video content characteristics corresponding to a first video clip are input into a content characteristic prediction model to be trained to obtain predicted video content characteristics corresponding to the first video clip, and model parameters of the content characteristic prediction model to be trained are adjusted based on differences between the predicted video content characteristics corresponding to the first video clip and the video content characteristics corresponding to a second video clip until the adjusted content characteristic prediction model meets preset training conditions to obtain a content characteristic prediction model; when the target video segment is a video segment with a preset time length behind the editing point in the original video, obtaining predicted video content characteristics corresponding to the second video segment by inputting the video content characteristics corresponding to the second video segment into a content characteristic prediction model to be trained; based on the difference between the predicted video content characteristics corresponding to the second video segment and the video content characteristics corresponding to the first video segment, model parameters of the content characteristic prediction model to be trained are adjusted until the adjusted content characteristic prediction model meets preset training conditions, a content characteristic prediction model is obtained, video content prediction can be effectively carried out on video segments before or after the clipping point, and video clipping effect is improved.

In an exemplary embodiment, after the step of acquiring training sample data, further comprising: aiming at each image content characteristic dimension, adjusting each image frame in the first video segment and the second video segment of each video segment pair according to an image preprocessing mode corresponding to the image content characteristic dimension to obtain an adjusted image frame; extracting image features of the adjusted image frames to obtain a plurality of image feature vectors; splicing the plurality of image feature vectors to obtain video feature vectors corresponding to the first video segment and the second video segment respectively; the video feature vector is used for representing video content features corresponding to the first video segment and the second video segment respectively.

In a specific implementation, because video content and video quality are inconsistent between mixed and sheared videos, in order to enhance generalization capability of a content feature prediction model, after training sample data is acquired, each video segment in the training sample data can be preprocessed on a corresponding picture sequence from multiple dimensions, each image frame in a first video segment and a second video segment of each video segment pair is adjusted according to an image preprocessing mode corresponding to each image content feature dimension by aiming at each image content feature dimension, an adjusted image frame is obtained, then image feature extraction is carried out on the adjusted image frame, and multiple image feature vectors are obtained, and then the multiple image feature vectors can be spliced to obtain the video feature vectors corresponding to each first video segment and each second video segment.

In an example, the process of image feature extraction may be: and converting the video clip into a picture sequence, and then extracting image features of each picture in the picture sequence by using a convolutional neural network to obtain an image feature vector. And splicing the image feature vectors corresponding to the pictures to obtain video feature vectors, such as feature vector sequences, corresponding to the video segments.

For example, the multiple dimensions may include whether a background is included (including, not including), whether a picture color is ignored (yes, no), whether only a person is included (including, not including), whether only a moving object is included (yes, no), wherein the inclusion background and the non-inclusion background may be used as two dimensions, and for each dimension, a multi-dimensional feature extraction may be performed on a pair of video segments to obtain video content features, such as feature vector sequences, corresponding to a first video segment and a second video segment in the pair of video segments, and further a content feature prediction model may be trained.

As shown in fig. 5a, for dimension 1, input data stitching may be performed based on a first video segment in the pair of video segments, such as a video segment with a preset duration before a video key point in a sample video, by using dimension 1-input feature data (i.e., a plurality of image feature vectors corresponding to the first video segment) to obtain video content features corresponding to the first video segment, and output data stitching may be performed based on a second video segment in the pair of video segments, such as a video segment with a preset duration after a video key point in the sample video, by using dimension 1-output feature data (i.e., a plurality of image feature vectors corresponding to the second video segment) to obtain video content features corresponding to the second video segment, so that a generated depth learning model (i.e., a content feature prediction model to be trained) may be trained according to the predicted video content features corresponding to the first video segment and the second video content features corresponding to the second video segment.

According to the technical scheme of the embodiment, each image frame in the first video segment and the second video segment of each video segment pair is adjusted according to the image preprocessing mode corresponding to each image content feature dimension to obtain an adjusted image frame, then image feature extraction is carried out on the adjusted image frame to obtain a plurality of image feature vectors, and then the plurality of image feature vectors are spliced to obtain video feature vectors corresponding to each of the first video segment and the second video segment, so that model training can be carried out based on the plurality of image content feature dimensions, and the generalization capability of a content feature prediction model is enhanced.

In an exemplary embodiment, obtaining training sample data includes: acquiring a video highlight set of a sample video; for each video highlight, determining a first video segment of a preset duration before the video highlight in the sample video and a second video segment of a preset duration after the video highlight in the sample video; and obtaining a video clip pair corresponding to the video highlight according to the first video clip and the second video clip.

In a specific implementation, a video highlight point set of a sample video is obtained, then for each video highlight point, a first video segment with a preset duration before the video highlight point in the sample video and a second video segment with a preset duration after the video highlight point in the sample video are determined, and further, a video segment pair corresponding to the video highlight point can be obtained according to the first video segment and the second video segment.

According to the technical scheme, the video highlight point set of the sample video is obtained, then for each video key point, a first video segment with a preset duration before the video highlight point in the sample video and a second video segment with a preset duration after the video highlight point in the sample video are determined, further, according to the first video segment and the second video segment, a video segment pair corresponding to the video highlight point is obtained, the video segment to be trained can be accurately obtained based on the video highlight point, and data support is provided for model training.

For ease of understanding by those skilled in the art, FIG. 5b provides an exemplary process flow diagram for training data preparation and model training; as shown in fig. 5b, by extracting a set of keypoints K (i.e., a set of video highlights of a sample video) from an existing video (i.e., a sample video), for each keypoint K (i.e., a video highlight) in the set of keypoints K, a video training pair < x _k,y_k > (i.e., a video clip pair) may be extracted from the existing video (i.e., the sample video), where x _k is a video (i.e., a first video clip) of a [ t _k-n,t_k ] time interval and y _k is a video (i.e., a second video clip) of a [ t _k,t_k+n ] time interval, and then multi-dimensional feature extraction may be performed on the training pair < x _k,y_k >, to obtain a video feature vector corresponding to the training pair < x _k,y_k >, so as to train the generated deep learning model (i.e., the content feature prediction model to be trained).

In an exemplary embodiment, obtaining a set of video highlights for a sample video includes: acquiring preset highlight extraction information; the highlight point extraction information is used for identifying the highlight point of the video according to picture information, sound information and text information in the video; and according to the highlight point extraction information, determining a plurality of video highlight points from the sample video to obtain a video highlight point set of the sample video.

Wherein the video highlight point may be a time center point of a highlight in the video.

In a specific implementation, the preset highlight point extraction information is obtained, so that the highlight point extraction information can be adopted, a plurality of video highlight points are identified from the sample video according to picture information, sound information and text information in the video, and then a video highlight point set of the sample video can be obtained.

For example, since the duration of a short video is short, in order to attract users, the most wonderful part of the video needs to be found, the following method may be used to extract the wonderful video points:

1. the video highlights are identified by training a visual recognition model. Taking football match as an example, the video highlight points can be the time points corresponding to video pictures when shooting, goal, and red and yellow cards are included;

2. Video highlights are identified by training an acoustic recognition model. Taking football match as an example, a part of the sound with the loudness exceeding a threshold (for example, the threshold is 1.5 times of the average value of the loudness of the whole audio) can be confirmed as a highlight, and then the video highlight can be a time point when the loudness of the sound exceeds the threshold;

3. Through ASR (Automatic Speech Recognition) technology, the voice part in the audio can be converted into text, and then the video highlight can be identified by identifying keywords in the text, such as 'ball in', 'red card', 'yellow card'.

According to the technical scheme, the preset highlight point extraction information is obtained, a plurality of video highlight points are determined from the sample video according to the highlight point extraction information, a video highlight point set of the sample video is obtained, the video highlight points can be determined for highlight fragments in the video, and video editing operation is facilitated for a user.

Fig. 6 is a flowchart illustrating another video clip method according to an exemplary embodiment, as shown in fig. 6, for use in the server 120 of fig. 1, including the following steps.

In step S601, training sample data is acquired; the training sample data includes a plurality of video segment pairs; each video segment pair comprises a first video segment and a second video segment which belong to the same sample video; the first video segment is a video segment with a preset duration before a video key point in the sample video; the second video segment is a video segment with a preset duration after the video key point in the sample video. In step S602, for each image content feature dimension, each image frame in the first video segment and the second video segment of each video segment pair is adjusted according to an image preprocessing mode corresponding to the image content feature dimension, so as to obtain an adjusted image frame. In step S603, image feature extraction is performed on the adjusted image frame, so as to obtain a plurality of image feature vectors. In step S604, the plurality of image feature vectors are spliced to obtain video feature vectors corresponding to the first video segment and the second video segment respectively; the video feature vector is used for representing video content features corresponding to the first video segment and the second video segment respectively. In step S605, the training sample data is used to train the content feature prediction model to be trained, so as to obtain the content feature prediction model. In step S606, a selection instruction of a clipping point of an original video is obtained, and a target video clip is extracted from the original video; the target video segment is a video segment of the original video, which is in front of the clipping point or behind the clipping point and has a preset duration. In step S607, the video content features corresponding to the target video segment are input to the content feature prediction model, so as to obtain predicted video content features. In step S608, determining a video segment to be inserted from the video material segment set according to the predicted video content feature; the matching degree between the video content characteristics corresponding to the video clips to be inserted and the predicted video content characteristics meets the preset condition. In step S609, the video clip to be inserted is fed back to the user. In step S610, a target insertion video clip is determined from the plurality of video clips to be inserted according to insertion selection information returned by the user. In step S611, the target insertion video clip is inserted before or after the clip point of the original video. It should be noted that, the specific limitation of the above steps may be referred to the specific limitation of a video editing method, which is not described herein.

It should be understood that, although the steps in the flowcharts of fig. 2, 4, and 6 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2, 4, and 6 may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least some of the other steps or stages.

Fig. 7 is a block diagram of a video clip apparatus according to an exemplary embodiment. Referring to fig. 7, the apparatus includes:

An obtaining unit 701 configured to perform a selection instruction for obtaining a clip point of an original video from which a target video clip is extracted; the target video clip is a video clip of the original video, which is before the clipping point or after the clipping point and has a preset duration;

a prediction unit 702 configured to perform inputting video content features corresponding to the target video segments into a content feature prediction model, so as to obtain predicted video content features;

A video clip matching unit 703 configured to perform determining a video clip to be inserted from a set of video material clips according to the predicted video content characteristics; the matching degree between the video content characteristics corresponding to the video clips to be inserted and the predicted video content characteristics meets the preset condition;

and a feedback unit 704 configured to perform feedback of the video clip to be inserted to a user for inserting the video clip to be inserted into a clip point of the original video.

In one possible implementation manner, the video segment matching unit 703 is specifically configured to perform determining a matching degree ranking result between a plurality of video content features and the predicted video content features based on video content features corresponding to a plurality of video material segments in a video material segment set; when the matching degree is larger than a preset threshold value, judging that the matching degree between the video content characteristics and the predicted video content characteristics meets a preset condition; and taking the video material fragments corresponding to the video content characteristics as video fragments to be inserted.

In one possible implementation manner, the video clip to be inserted includes a plurality of feedback units 704, which are specifically configured to perform obtaining preset feedback index information; sequencing the video clips to be inserted according to the feedback index information to obtain a feedback sequencing result; and feeding back the plurality of video clips to be inserted based on the feedback sequencing result.

In one possible implementation, the method further includes:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 8 is a block diagram illustrating an electronic device 800 for performing a video clip method according to an exemplary embodiment. For example, electronic device 800 may be a server. Referring to fig. 8, electronic device 800 includes a processing component 820 that further includes one or more processors and memory resources represented by memory 822 for storing instructions, such as application programs, executable by processing component 820. The application programs stored in memory 822 may include one or more modules each corresponding to a set of instructions. Further, the processing component 820 is configured to execute instructions to perform the methods described above.

The electronic device 800 may further include: the power component 824 is configured to perform power management of the electronic device 800, the wired or wireless network interface 826 is configured to connect the electronic device 800 to a network, and the input output (I/O) interface 828. The electronic device 800 may operate based on an operating system stored in memory 822, such as Windows Server, mac OS X, unix, linux, freeBSD, or the like.

In an exemplary embodiment, a computer-readable storage medium is also provided, such as memory 822, including instructions executable by a processor of electronic device 800 to perform the above-described method. The storage medium may be a computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising instructions therein, executable by a processor of the electronic device 800 to perform the above-described method.

It should be noted that the descriptions of the foregoing apparatus, the electronic device, the computer readable storage medium, the computer program product, and the like according to the method embodiments may further include other implementations, and the specific implementation may refer to the descriptions of the related method embodiments and are not described herein in detail.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video editing, the method comprising:

Inputting the video content characteristics corresponding to the target video segments into a content characteristic prediction model to obtain predicted video content characteristics; the video content represented by the predicted video content features is linked with the video content represented by the video content features corresponding to the target video segment;

2. The method of claim 1, wherein said determining a video segment to be inserted from a set of video material segments based on said predicted video content characteristics comprises:

3. The method of claim 1, wherein the video clip to be inserted comprises a plurality of video clips, and wherein the feeding back the video clip to be inserted to the user comprises:

acquiring preset feedback index information;

4. A method according to claim 3, further comprising, after the step of feeding back the video clip to be inserted to the user:

5. The method according to any one of claims 1 to 4, further comprising:

6. The method of claim 5, wherein when the target video clip is a video clip of the original video that is a preset length of time before the clip point; training the content feature prediction model to be trained by adopting the training sample data to obtain the content feature prediction model, wherein the training comprises the following steps:

7. The method of claim 5, further comprising, after the step of obtaining training sample data:

8. The method of claim 5, wherein the acquiring training sample data comprises:

Acquiring a video highlight set of a sample video;

9. The method of claim 8, wherein the obtaining a set of video highlights of the sample video comprises:

10. A video editing apparatus, comprising:

The prediction unit is configured to input video content characteristics corresponding to the target video clips into a content characteristic prediction model to obtain predicted video content characteristics; the video content represented by the predicted video content features is linked with the video content represented by the video content features corresponding to the target video segment;

11. The apparatus according to claim 10, wherein the video clip matching unit is specifically configured to perform determining a matching degree ranking result between a plurality of the video content features and the predicted video content features based on video content features corresponding to each of a plurality of video material clips in a video material clip set; when the matching degree is larger than a preset threshold value, judging that the matching degree between the video content characteristics and the predicted video content characteristics meets a preset condition; and taking the video material fragments corresponding to the video content characteristics as video fragments to be inserted.

12. The apparatus according to claim 10, wherein the video clip to be inserted includes a plurality of feedback units, and the feedback units are specifically configured to perform acquiring preset feedback index information; sequencing the video clips to be inserted according to the feedback index information to obtain a feedback sequencing result; and feeding back the plurality of video clips to be inserted based on the feedback sequencing result.

13. The apparatus of claim 12, wherein the apparatus further comprises:

14. The apparatus according to any one of claims 10 to 13, further comprising:

15. The apparatus according to claim 14, wherein, in the case that the target video clip is a video clip of the original video that is a preset duration before the clip point, the model training unit is specifically configured to perform inputting a video content feature corresponding to the first video clip into a content feature prediction model to be trained, so as to obtain a predicted video content feature corresponding to the first video clip; based on the difference between the predicted video content characteristics corresponding to the first video segment and the video content characteristics corresponding to the second video segment, adjusting model parameters of a content characteristic prediction model to be trained until the adjusted content characteristic prediction model meets preset training conditions, and obtaining the content characteristic prediction model;

16. The apparatus of claim 14, wherein the apparatus further comprises:

17. The apparatus according to claim 14, wherein the training sample data acquisition unit is specifically configured to perform acquiring a set of video highlights of a sample video; determining, for each video highlight, a first video segment of a preset duration before the video highlight in the sample video and a second video segment of a preset duration after the video highlight in the sample video; and obtaining a video fragment pair corresponding to the video highlight according to the first video fragment and the second video fragment.

18. The apparatus according to claim 17, wherein the training sample data acquisition unit is specifically configured to acquire preset highlight extraction information; the highlight point extraction information is used for identifying video highlight points according to picture information, sound information and text information in the video; and according to the highlight point extraction information, determining a plurality of video highlight points from the sample video to obtain a video highlight point set of the sample video.

19. An electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement the video clip method of any one of claims 1 to 9.

20. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the video editing method of any of claims 1 to 9.