CN111918094B

CN111918094B - Video processing method and device, electronic equipment and storage medium

Info

Publication number: CN111918094B
Application number: CN202010602733.8A
Authority: CN
Inventors: 薛学通; 任晖; 杨敏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2023-01-24
Anticipated expiration: 2040-06-29
Also published as: CN111918094A

Abstract

The application discloses a video processing method, a video processing device, electronic equipment and a storage medium, relates to the technical field of deep learning, artificial intelligence, computer vision and natural language processing, and can be applied to relevant scenes in the video processing field. The scheme is as follows: acquiring first characteristic information of a source video; acquiring second characteristic information of each candidate music material; mapping the first characteristic information and the second characteristic information in a characteristic space respectively to acquire the similarity of the source video and the candidate music materials; selecting a target music material from the candidate music materials according to the corresponding similarity of each candidate music material; target music material is loaded in the source video to generate a target video. According to the method and the device, through the mapping of the two types of characteristic information in the characteristic space, the target music materials with higher matching degree can be added to the source video, the addition of background music to the video is not needed to be performed through manual intervention, the accuracy and the efficiency of the video processing process are improved, and the labor cost is saved.

Description

Video processing method and device, electronic equipment and storage medium

Technical Field

Embodiments of the present application relate generally to the field of image processing technology, and more particularly, to the fields of deep learning, artificial intelligence, computer vision, and natural language processing technology.

Background

In recent years, with the rapid development of video processing technology, adding matched background music to video has become a very important link in the video processing process. The matched background music is accurately added to the video, so that convenience and entertainment can be brought to a user. Therefore, how to improve the accuracy in the video processing process has become one of important research directions.

Disclosure of Invention

The application provides a video processing method, a video processing device, electronic equipment and a storage medium.

According to a first aspect, there is provided a video processing method comprising:

acquiring first characteristic information of a source video;

acquiring second characteristic information of each candidate music material;

mapping the first characteristic information and the second characteristic information in a characteristic space respectively to obtain the similarity of the source video and the candidate music material;

selecting a target music material from the candidate music materials according to the similarity corresponding to each candidate music material; and

loading the target musical material in the source video to generate a target video.

According to a second aspect, there is provided a video processing apparatus comprising:

the first acquisition module is used for acquiring first characteristic information of a source video;

the second acquisition module is used for acquiring second characteristic information of each candidate music material;

a similarity obtaining module, configured to map the first feature information and the second feature information in a feature space, respectively, so as to obtain a similarity between the source video and the candidate music material;

the material selection module is used for selecting a target music material from the candidate music materials according to the similarity corresponding to each candidate music material; and

and the generating module is used for loading the target music material in the source video so as to generate a target video.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video processing method of the first aspect of the present application.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the video processing method of the first aspect of the present application.

According to a fifth aspect, there is provided a computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the video processing method of the first aspect of the application.

The embodiment provided by the application at least has the following beneficial technical effects:

according to the video processing method, the first characteristic information of the source video and the second characteristic information of each candidate music material are obtained, the similarity between the source video and the candidate music materials is obtained according to the obtained first characteristic information and the obtained second characteristic information, then the target music materials are selected according to the obtained similarity corresponding to each candidate music material, and further the target music materials are loaded in the source video to generate the target video, so that the video processing is realized. Therefore, by means of mapping of the two types of feature information in the feature space, more accurate target music materials with higher matching degree can be added to the source video, the addition of background music to the video is not dependent on manual intervention, the accuracy and efficiency in the video processing process are improved, and the labor cost is saved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic illustration according to a first embodiment of the present application;

FIG. 2 is a schematic diagram according to a second embodiment of the present application;

FIG. 3 is a schematic illustration according to a third embodiment of the present application;

FIG. 4 is a schematic diagram of a method for obtaining a first word vector through a continuous bag of words model;

FIG. 5 is a schematic illustration according to a fourth embodiment of the present application;

FIG. 6 is a schematic illustration according to a fifth embodiment of the present application;

FIG. 7 is a schematic illustration according to a sixth embodiment of the present application;

FIG. 8 is a schematic illustration according to a seventh embodiment of the present application;

FIG. 9 is a schematic illustration according to an eighth embodiment of the present application;

FIG. 10 is a schematic illustration according to a ninth embodiment of the present application;

FIG. 11 is a schematic illustration in accordance with a tenth embodiment of the present application;

fig. 12 is a block diagram of a video processing apparatus for implementing a video processing method according to an embodiment of the present application;

fig. 13 is a block diagram of a video processing apparatus for implementing a video processing method according to an embodiment of the present application;

FIG. 14 is a block diagram of video processing electronics for implementing embodiments of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application to assist in understanding, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A video processing method, apparatus, electronic device, and storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram according to a first embodiment of the present application. It should be noted that the main execution body of the video processing method of this embodiment is a video processing apparatus, and the video processing apparatus may specifically be a hardware device, or software in a hardware device, or the like. Among them, hardware devices such as terminal devices, servers, etc. As shown in fig. 1, the video processing method proposed in this embodiment includes the following steps:

s101, first characteristic information of a source video is obtained.

The source video can be any video; as another example, it may be any video album consisting of at least two photographs; as another example, it may be any video set consisting of multiple videos; also for example, it may be any video file consisting of at least one photograph and at least one video.

It should be noted that when attempting to acquire a source video, a video stored in advance in a local or remote storage area may be acquired, or a video may be directly captured. Optionally, the stored video or image may be retrieved from at least one of a local or remote video library, an image library to form a source video; alternatively, the video may be recorded directly to form the source video. The method for acquiring the source video is not limited, and can be selected according to actual conditions.

After the source video is acquired, feature extraction may be performed on the source video to obtain first feature information. The source video is composed of a plurality of frames of images, so that the extracted first characteristic information at least comprises image characteristics. The first feature information at least includes image features of the source video, for example, the image features mainly include color features, texture features, shape features, spatial relationship features, and the like of the image.

And S102, acquiring second characteristic information of each candidate music material.

Wherein, the candidate music material can be any audio; for another example, the audio may be any audio in popular songs obtained based on the historical click rate; as another example, any of the music materials that the user has collected may be used.

It should be noted that, when attempting to acquire candidate musical materials, candidate musical materials stored in advance in a local or remote storage area may be acquired, or the candidate musical materials may be recorded directly. Optionally, the stored audio files may be obtained from at least one of a local or remote audio library, an audio library stored in music playing software, to form candidate musical materials; alternatively, the audio file may be recorded directly to form the candidate musical materials. The method and the device for acquiring the candidate music materials are not limited, and can be selected according to actual conditions.

After the candidate musical material is obtained, feature extraction may be performed on the candidate musical material to obtain second feature information. The candidate musical material is composed of a plurality of pieces of audio, and therefore the extracted second characteristic information includes at least audio characteristics. The second feature information at least includes audio features of the candidate musical material, for example, the audio features mainly include spectral contrast, spectral center, chromatogram, mel Frequency Cepstrum Coefficient (MFCC), and the like.

S103, mapping the first characteristic information and the second characteristic information in a characteristic space respectively to acquire the similarity between the source video and the candidate music material.

In the embodiment of the present application, the first feature information and the second feature information are features of two different types of objects, and belong to respective feature spaces. That is, the first feature information and the second feature information are two different data distributions, and the similarity between the first feature information and the second feature information cannot be directly compared. Therefore, it is necessary to map the first feature information and the second feature information into the same feature space, and represent the two feature information in the mapped feature space, which is equivalent to converting two different data distributions, so that the first feature information and the second feature information can be represented in the same language, thereby obtaining the similarity between the source video and the candidate music material based on the mapped representation.

And S104, selecting a target music material from the candidate music materials according to the corresponding similarity of each candidate music material.

In the embodiment of the application, after the similarity between the source video and the candidate musical materials is obtained, the similarity corresponding to each candidate musical material may be extracted, and the similarities corresponding to each candidate musical material may be sorted, so as to select the target musical material from the plurality of candidate musical materials.

As a possible implementation manner, the similarity corresponding to each candidate musical material may be sorted in a descending order, and the candidate musical materials within the preset sorting range are selected as the target musical materials. For example, the top 5 candidate musical materials are used as background music that can be used as a source video, in which case, the top 5 candidate musical materials can be recommended to the user, and the user selects the final target musical material from the 5 candidate musical materials. Optionally, the target music material that best meets the personality of the user may be selected from 5 of the user figures.

And S105, loading the target music material in the source video to generate the target video.

In the embodiment of the application, after the target music material is selected, the target music material is loaded into the source video to generate the target video.

In the process of loading the target music material into the source video, it is necessary to synchronize data related to the image frame, lyrics, and audio playing time of the target music material, so that the target music material and the source video can be displayed in the target video synchronously.

According to the video processing method, the first characteristic information of the source video and the second characteristic information of each candidate music material are obtained, the similarity between the source video and the candidate music materials is obtained according to the obtained first characteristic information and the obtained second characteristic information, then the target music materials are selected according to the obtained similarity corresponding to each candidate music material, and further the target music materials are loaded in the source video to generate the target video, so that the video processing is realized. Therefore, by means of mapping of the two types of feature information in the feature space, more accurate target music materials with higher matching degree can be added to the source video, the addition of background music to the video is not dependent on manual intervention, accuracy and efficiency in the video processing process are improved, and labor cost is saved.

It should be noted that, in the present application, when first feature information of a source video is attempted to be obtained, feature extraction and entity identification may be performed on the source video to obtain an image feature and a first entity keyword, so as to generate the first feature information.

As a possible implementation manner, as shown in fig. 2, on the basis of the foregoing embodiment, the process of acquiring the first feature information of the source video in step S101 specifically includes the following steps:

s201, extracting the characteristics of the source video to acquire the image characteristics of the source video.

The image features refer to attributes that can be used as marks in an image, and include: grayscale features, texture features, shape features, and the like. The method comprises the steps of obtaining image characteristics of a source video, namely processing and analyzing information contained in the source video, and extracting information which is not easily interfered by random factors as the characteristics of the image, namely obtaining the image characteristics of the source video is a process of removing redundant information of the source video.

It should be noted that, when trying to obtain the image features of the source video, the grayscale features may be obtained by using a grayscale mean, a variance, and the like; texture features can be obtained by adopting a gray difference statistical method, a gray-gradient co-occurrence matrix and other modes; shape features may be obtained using invariant moment features.

S202, carrying out entity identification on the source video to obtain a first entity keyword of the source video.

The first entity keyword refers to a set of entity keywords acquired after source video content information is mined through Deep Learning (DL for short). For example, first entity keywords such as windows, tables, doors, men, heads up, and laughs are extracted from the acquired source video.

It should be noted that, when trying to obtain the first entity keyword of the source video, for example, a Convolutional Neural Network (CNN) based method may be adopted to perform entity identification on the source video to obtain the first entity keyword of the source video.

S203, generating first feature information according to the image features and the first entity keywords.

As a possible implementation manner, as shown in fig. 3, on the basis of the foregoing embodiment, the process of generating the first feature information according to the image feature and the first entity keyword in step S203 specifically includes the following steps:

s301, obtaining a first word vector of the first entity keyword.

The process of obtaining the first word vector refers to a process of converting the first entity keyword into a dense vector, that is, a process of converting a word represented by a natural language into a vector or a matrix form that can be understood by a computer. And the corresponding word vectors of the entity key words with similar semantics are similar.

In the present application, the manner of generating the first word vector is not limited, and may be selected according to actual situations. For example, the first word vector may be generated based on a statistical method; as another example, the first word vector may be generated based on a Language Model (Language Model).

As a possible implementation manner, a Continuous Bag of Words (CBOW) model in a Word to Vector (Word 2 vec) may be selected to obtain a first Word Vector of the first entity keyword.

For example, as shown in fig. 4, the entity keywords (t-1) to (t-n) and the entity keywords (t + 1) to (t + n) in the first entity keywords may be input into the CBOW model to obtain the first word vector (t).

S302, the image features and the first word vectors are spliced to obtain first feature information.

In the embodiment of the application, the image features and the first word vector can be spliced, so that the obtained first feature information is as close as possible to complete description of the source video.

It should be noted that, when trying to stitch the image feature and the first word vector, the first word vector may be reshaped (reshape) to a matrix consistent with the dimension of the image feature, so as to stitch the image feature and the first word vector to obtain the first feature information.

According to the video processing method, the source video is subjected to feature extraction and entity identification to obtain the image features of the source video and the first entity keywords of the source video, the first entity keywords are vectorized to obtain the first word vector, and first feature information is obtained according to the image features and the first word vector. Therefore, the first entity keyword can be vectorized on the basis of obtaining the image characteristics and the first entity keyword so as to obtain the first word vector with higher semantic characteristics, the obtained first characteristic information is more accurate, and the accuracy in the video processing process is further improved.

It should be noted that, in the present application, when trying to obtain the second feature information of each candidate musical material, feature extraction and speech information recognition may be performed on the candidate musical material to obtain the audio features and the second entity keywords of the candidate musical material, so as to generate the second feature information.

As a possible implementation manner, as shown in fig. 5, on the basis of the foregoing embodiment, the process of acquiring the second feature information of each candidate musical material in the foregoing step S102 specifically includes the following steps:

s401, extracting the characteristics of the candidate music materials to obtain the audio characteristics of the candidate music materials.

When the audio features of the candidate musical materials are attempted, the candidate musical materials may be characterized by means of pre-emphasis, framing, windowing, fast fourier transform, mel (Mel) filtering, logarithmic operation, discrete cosine transform, and the like, so as to obtain the audio features of the candidate musical materials.

S402, performing voice information recognition on the candidate music materials to obtain second entity keywords of the candidate music materials.

The second entity keyword refers to a set of entity keywords obtained by mining candidate music materials through Deep Learning (DL for short). For example, the second entity keywords such as sun, flower, photo, smile, etc. are extracted from the acquired lyric information in the "go to school song" candidate music material.

It should be noted that, when attempting to select the second entity keyword of the music material, an Optical Character Recognition (OCR) method or a language Recognition technology may be adopted to extract the entity keyword in the lyric information of the candidate music material, so as to obtain the second entity keyword of the candidate music material.

And S403, generating second characteristic information according to the audio characteristics and the second entity keywords.

As a possible implementation manner, as shown in fig. 6, on the basis of the foregoing embodiment, the process of generating the second feature information according to the audio feature and the second entity keyword in step S403 specifically includes the following steps:

s501, obtaining a second word vector of the second entity keyword.

The process of obtaining the second word vector refers to a process of converting the second entity keyword into a dense vector, that is, a process of converting a word represented by a natural language into a vector or a matrix which can be understood by a computer. And the corresponding word vectors of the entity key words with similar semantics are similar.

In the present application, the manner of generating the second word vector is not limited, and may be selected according to actual circumstances. For example, the first word vector may be generated based on a statistical method; as another example, the first word vector may be generated based on a Language Model (Language Model).

As a possible implementation manner, a continuous Word bag CBOW model in Word2vec can be selected, and the current Word is predicted according to the context words in a certain range, so as to obtain a second Word vector of the second entity keyword.

S502, splicing the audio features and the second word vectors to obtain second feature information.

In the embodiment of the application, the audio features and the second word vectors can be spliced, so that the obtained second feature information is as close as possible to the complete description of each candidate music material.

It should be noted that, when an attempt is made to concatenate the audio feature and the second word vector, the second word vector may be reshaped (reshape) to a matrix consistent with the audio feature dimension, so as to concatenate the audio feature and the second word vector to obtain the second feature information.

According to the video processing method, the audio features of the candidate music materials and the second entity keywords of the candidate music materials can be obtained by performing feature extraction and voice information recognition on the candidate music materials, the second entity keywords are vectorized to obtain the second word vector, and the second feature information is obtained according to the audio features and the second word vector. Therefore, the second entity keywords can be vectorized on the basis of obtaining the audio features and the second entity keywords, so that a second word vector with higher semantic features is obtained, the obtained second feature information is more accurate, and the accuracy in the video processing process is further improved.

It should be noted that, in the present application, when attempting to obtain the similarity between the source video and the candidate music material, metric learning may be performed on the first feature information and the second feature information, respectively, so as to obtain the first feature representation and the second feature representation of the source video in the feature space, and further obtain the similarity between the source video and the candidate music material.

As a possible implementation manner, as shown in fig. 7, on the basis of the foregoing embodiment, in the above step S103, a process of mapping the first feature information and the second feature information in the feature space respectively to obtain the similarity between the source video and the candidate music material specifically includes the following steps:

s601, metric learning is carried out on the first feature information to obtain a first feature representation of the source video in the feature space.

And S602, performing metric learning on the second characteristic information to acquire a second characteristic representation of the candidate music material in the characteristic space.

Metric Learning (Metric Learning), which refers to a space mapping method, can obtain a feature (Embedding) space through Metric Learning, in the feature space, all data are converted into a feature vector, the distance between feature vectors of similar samples is small, and the distance between feature vectors of dissimilar samples is large, so as to distinguish the data.

In the embodiment of the application, the first feature information and the second feature information can be respectively input into the target metric learning model to perform metric learning on the first feature information and the second feature information, so as to obtain a first feature representation of the source video in the feature space and a second feature representation of the candidate music material in the feature space.

And S603, acquiring the similarity between the first characteristic representation and the second characteristic representation as the similarity between the source video and the candidate music material.

In this embodiment of the application, similarity comparison processing may be performed according to the acquired first feature representation and the acquired second feature representation to acquire the similarity between the source video and the candidate music material.

It should be noted that the similarity between the source video and the candidate musical materials can be represented by the feature distance, and the smaller the feature distance, the higher the similarity between the source video and the candidate musical materials is. The specific calculation mode of the characteristic distance is not limited in the application, and can be selected according to actual conditions. For example, a cosine distance, a Minkowski distance, or the like may be acquired.

It should be noted that the target metric learning model is trained in advance. In the embodiment of the present application, as shown in fig. 8, the target metric learning model may be established in advance by:

s701, selecting sample data, wherein the sample data comprises a sample video and background music matched with the sample video.

The sample data may be collected in advance, so as to obtain the third characteristic information of the sample video and the fourth characteristic information of the background music in the following. The number of sample data may be preset, for example, 100 sample data may be acquired.

As a possible implementation manner, as shown in fig. 9, on the basis of the foregoing embodiment, the process of acquiring sample data specifically includes the following steps:

s801, obtaining the candidate sample video and the description information of the candidate sample video.

The description information refers to information for screening candidate sample videos. For example, the identification result of whether the candidate sample video contains background music, the click rate of the candidate sample video, and the like may be included.

S802, screening out sample videos from the candidate sample videos according to the description information.

In the embodiment of the application, the screening condition can be preset, and then based on the screening condition, the sample video is screened from the candidate sample video according to the description information.

For example, the predetermined screening conditions are: the background music and the click rate are more than 10%, and the description information of the candidate sample video A is as follows: the description information of the candidate sample video B, which contains background music and has a click rate of 2%, is as follows: the candidate sample video b may be screened as the sample video based on a predetermined screening condition at a click rate of 12% and including background music.

S702, respectively obtaining third characteristic information of the sample video and fourth characteristic information of the background music.

As a possible implementation manner, as shown in fig. 10, on the basis of the foregoing embodiment, the process of acquiring the third feature information and the fourth feature information specifically includes the following steps:

and S901, separating a sample video and background music from the sample data.

In the embodiment of the application, the sample video and the background music can be separated from the sample data according to the labels of the sample video and the background music.

And S902, inputting the sample video into a video channel for feature extraction to obtain third feature information.

In the embodiment of the application, the extracted sample video can be input into a video channel for feature extraction, so that the image features of the sample video and the third entity keywords of the sample video can be obtained, and further, third feature information can be generated according to the image features of the sample video and the third entity keywords of the sample video.

And S903, inputting the background music into an audio channel for feature extraction to obtain fourth feature information.

In the embodiment of the application, the background music can be input into the audio channel for feature extraction, so as to obtain the audio features of the background music and the fourth entity keywords of the background music, and further, fourth feature information can be generated according to the audio features of the background music and the fourth entity keywords of the background music.

And S703, training the metric learning model by using the third characteristic information and the fourth characteristic information to generate a target metric learning model, wherein the target metric learning model is used for metric learning of the first characteristic information and the second characteristic information.

In the embodiment of the application, loss calculation can be performed according to the acquired third feature information and the acquired fourth feature information, so that the metric learning model is trained, and then the target metric learning model is generated.

When attempting to perform the loss calculation based on the acquired third feature information and fourth feature information, an existing calculation method may be used. The specific manner of performing the loss calculation is not limited in the present application, and may be selected according to actual situations. For example, a contrast Loss (contrast Loss), a triple Loss (triple Loss), or the like may be used to train the metric learning model, so as to generate the target metric learning model. The specific calculation method for the comparison loss, the triple loss, and the like is the prior art, and is not described herein again.

In this embodiment, the model may be trained until convergence based on the third feature information and the fourth feature information, so that a trained target metric learning model may be obtained.

According to the video processing method, the complete and convergent target metric learning model can be obtained through pre-selecting sample data and training, so that after the first characteristic information and the second characteristic information are obtained, the first characteristic information and the second characteristic information are input into the target metric learning model for metric learning, and the first characteristic representation and the second characteristic representation in the characteristic space are generated, the similarity of the source videos and the candidate music materials in different modalities can be more accurately obtained, the technical problem that the similarity of the obtained source videos and the candidate music materials is inaccurate due to semantic gaps between the source videos and the candidate music materials in different modalities is solved, and the accuracy in the video processing process is further improved.

Fig. 11 is a schematic diagram according to a tenth embodiment of the present application. As shown in fig. 11, on the basis of the foregoing embodiment, the video processing method provided in this embodiment includes the following steps:

s1001, obtaining the candidate sample video and the description information of the candidate sample video. .

S1002, screening out sample videos from the candidate sample videos according to the description information.

And S1003, separating the sample video and the background music from the sample data.

And S1004, inputting the sample video into a video channel for feature extraction to obtain third feature information.

S1005, inputting the background music into the audio channel for feature extraction to obtain fourth feature information.

And S1006, training the metric learning model by using the third feature information and the fourth feature information to generate a target metric learning model, wherein the target metric learning model is used for metric learning of the first feature information and the second feature information.

And S1007, extracting the characteristics of the source video to obtain the image characteristics of the source video.

S1008, entity recognition is carried out on the source video to obtain a first entity keyword of the source video.

S1009, acquiring a first word vector of the first entity keyword.

S1010, the image features and the first word vectors are spliced to obtain first feature information.

And S1011, performing feature extraction on the candidate music materials to obtain the audio features of the candidate music materials.

And S1012, performing voice information identification on the candidate music materials to obtain second entity keywords of the candidate music materials.

S1013, a second word vector of the second entity keyword is obtained.

And S1014, splicing the audio features and the second word vector to acquire second feature information.

And S1015, performing metric learning on the first feature information to acquire a first feature representation of the source video in the feature space.

And S1016, performing metric learning on the second characteristic information to obtain a second characteristic representation of the candidate music material in the characteristic space.

And S1017, acquiring the similarity between the first feature representation and the second feature representation as the similarity between the source video and the candidate music material.

And S1018. Selecting the target musical material from the candidate musical materials according to the corresponding similarity of each candidate musical material.

And S1019, loading the target music material in the source video to generate the target video.

It should be noted that, for the descriptions of steps S1001 to S1019, reference may be made to the relevant descriptions in the above embodiments, and details are not repeated here.

It should be noted that the video processing method provided by the present application can be applied to a variety of scenes related to the video processing field.

Aiming at an automatic video generation application scene, after a complete source video is generated through an Artificial Intelligence (AI) technology, the similarity between the source video and candidate music materials can be automatically obtained by combining a natural language processing technology and a computer vision technology, and then according to the similarity corresponding to each candidate music material in an obtained music library, one music material with the highest similarity is selected from a plurality of candidate music materials to serve as a target music material, and the target music material is loaded into the source video to generate the target video, so that the automatic video generation process does not depend on any manual intervention, the full and thorough automatic video production can be realized, the automatic video generation efficiency is improved, and the labor cost is greatly reduced.

According to the application scene of the short video application program, after a user finishes video recording on a terminal such as a smart phone, 3-5 music materials with the highest similarity are selected from a plurality of candidate music materials and recommended to the user automatically according to the obtained similarity corresponding to each candidate music material in a music library, so that the user can select a target music material from the 3-5 music materials, and then the target music material is loaded into a source video to generate the target video after the user determines the target music material, so that the workload of selecting the proper music material from mass music materials in the music library by the user can be obviously reduced.

Further, in the application scene of the short video application program, the music library can be further limited, for example, the music library can be limited to collect songs in the music playing application program in the mobile phone of the user, so that personalized customization of music material recommendation can be realized on the basis of providing accurate background music matching for the user, and the user experience is further improved.

According to the video processing method, the first characteristic information of the source video and the second characteristic information of each candidate music material are obtained, the similarity between the source video and the candidate music materials is obtained according to the obtained first characteristic information and the obtained second characteristic information, then the target music materials are selected according to the obtained similarity corresponding to each candidate music material, and then the target music materials are loaded in the source video to generate the target video, so that the video processing is realized. Therefore, by means of mapping of the two types of feature information in the feature space, more accurate target music materials with higher matching degree can be added to the source video, the addition of background music to the video is not dependent on manual intervention, the accuracy and efficiency in the video processing process are improved, and the labor cost is saved.

Corresponding to the video processing methods provided by the above embodiments, an embodiment of the present application further provides a video processing apparatus, and since the video processing apparatus provided by the embodiment of the present application corresponds to the video processing methods provided by the above embodiments, the implementation manner of the video processing method is also applicable to the video processing apparatus provided by the embodiment, and is not described in detail in the embodiment. Fig. 12 to 13 are schematic structural diagrams of a video processing apparatus according to an embodiment of the present application.

As shown in fig. 12, the video processing apparatus 1000 includes: the system comprises a first obtaining module 100, a second obtaining module 200, a similarity obtaining module 300, a material selecting module 400 and a generating module 500. Wherein:

a first obtaining module 100, configured to obtain first feature information of a source video;

a second obtaining module 200, configured to obtain second feature information of each candidate musical material;

a similarity obtaining module 300, configured to map the first feature information and the second feature information in a feature space, respectively, so as to obtain a similarity between the source video and the candidate music material;

a material selecting module 400, configured to select a target music material from the multiple candidate music materials according to the similarity corresponding to each candidate music material; and

a generating module 500, configured to load the target music material in the source video to generate a target video.

In an embodiment of the present application, as shown in fig. 13, the similarity obtaining module 300 in fig. 12 includes: a feature representation obtaining unit 310, configured to perform metric learning on the first feature information to obtain a first feature representation of the source video in the feature space, and perform metric learning on the second feature information to obtain a second feature representation of the candidate musical material in the feature space; a similarity obtaining unit 320 configured to obtain a similarity between the first feature representation and the second feature representation as the similarity of the source video and the candidate musical material.

In an embodiment of the present application, as shown in fig. 13, the first obtaining module 100 in fig. 12 includes: a feature extraction unit 110, configured to perform feature extraction on the source video to obtain an image feature of the source video; an entity identifying unit 120, configured to perform entity identification on the source video to obtain a first entity keyword of the source video; a first generating unit 130, configured to generate the first feature information according to the image feature and the first entity keyword.

In the embodiment of the present application, as shown in fig. 13, the first generating unit 130 in fig. 12 includes: a first obtaining subunit 131, configured to obtain a first word vector of the first entity keyword; a second obtaining subunit 132, configured to splice the image feature and the first word vector to obtain the first feature information.

In an embodiment of the present application, as shown in fig. 13, the second obtaining module 200 in fig. 12 includes: a first obtaining unit 210 configured to perform feature extraction on the candidate musical materials to obtain audio features of the candidate musical materials; a second obtaining unit 220, configured to perform speech information recognition on the candidate musical material to obtain a second entity keyword of the candidate musical material; a second generating unit 230, configured to generate the second feature information according to the audio feature and the second entity keyword.

In an embodiment of the present application, as shown in fig. 13, the second generating unit 230 in fig. 12 includes: a third obtaining subunit 231, configured to obtain a second word vector of the second entity keyword; a fourth obtaining subunit 232, configured to splice the audio feature and the second word vector to obtain the second feature information.

In an embodiment of the present application, as shown in fig. 13, the similarity obtaining module 300 in fig. 12 further includes: a sample selecting unit 330, configured to select sample data, where the sample data includes a sample video and background music matched with the sample video; a third obtaining unit 340, configured to obtain third feature information of the sample video and fourth feature information of the background music, respectively; a third generating unit 350, configured to train a metric learning model to generate a target metric learning model, where the target metric learning model is used to perform metric learning on the first feature information and the second feature information, and the third feature information and the fourth feature information are used to perform metric learning.

In an embodiment of the present application, as shown in fig. 13, the sample selecting unit 330 in fig. 12 includes: a fifth acquiring subunit 331, configured to acquire a candidate sample video and description information of the candidate sample video; a screening subunit 332, configured to screen out the sample video from the candidate sample videos according to the description information.

In an embodiment of the present application, as shown in fig. 13, the third obtaining unit 340 in fig. 12 includes: a separating subunit 341, configured to separate the sample video and the background music from the sample data; a sixth obtaining subunit 342, configured to input the sample video into a video channel for feature extraction, so as to obtain the third feature information; a seventh obtaining subunit 343, configured to input the background music into an audio channel for feature extraction, so as to obtain the fourth feature information.

According to the video processing device of the embodiment of the application, the first characteristic information of the source video and the second characteristic information of each candidate music material are obtained, the similarity between the source video and the candidate music materials is obtained according to the obtained first characteristic information and the obtained second characteristic information, then the target music materials are selected according to the obtained similarity corresponding to each candidate music material, and then the target music materials are loaded in the source video to generate the target video, so that the video processing is realized. Therefore, by means of mapping of the two types of feature information in the feature space, more accurate target music materials with higher matching degree can be added to the source video, the addition of background music to the video is not dependent on manual intervention, accuracy and efficiency in the video processing process are improved, and labor cost is saved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 14, is a block diagram of a video processing electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 14, the electronic apparatus includes: one or more processors 1100, a memory 1200, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 14 illustrates an example of a processor 1100.

The memory 1200 is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the video processing method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the video processing method provided by the present application.

The memory 1200, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the video processing method in the embodiment of the present application (for example, the first obtaining module 100, the second obtaining module 200, the similarity obtaining module 300, the material selecting module 400, and the generating module 500 shown in fig. 12). The processor 1100 executes various functional applications of the server and data processing, i.e., implements the video processing method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 1200.

The memory 1200 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the positioning electronic device, and the like. Further, the memory 1200 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1200 may optionally include memory located remotely from the processor 1100, which may be connected to the positioning electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The video processing electronic device may further comprise: an input device 1300 and an output device 1400. The processor 1100, the memory 1200, the input device 1300, and the output device 1400 may be connected by a bus or other means, and the bus connection is exemplified in fig. 14.

The input device 1300 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the pointing electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output device 1400 may include a display device, an auxiliary lighting device (e.g., an LED), a haptic feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

The present application also provides a computer program product, which when executed by an instruction processor in the computer program product, implements the video processing method as described above.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A video processing method, comprising:

acquiring first characteristic information of a source video;

acquiring second characteristic information of each candidate music material;

inputting the first feature information into a target metric learning model for metric learning to obtain a first feature representation of the source video in a feature space, wherein the target metric learning model is obtained by pre-training;

inputting the second feature information into the target metric learning model for metric learning to obtain a second feature representation of the candidate musical material in the feature space;

acquiring the similarity between the first feature representation and the second feature representation as the similarity of the source video and the candidate music material;

loading the target music material in the source video to generate a target video;

the method further comprises the following steps:

selecting sample data, wherein the sample data comprises a sample video and background music matched with the sample video;

separating the sample video and the background music from the sample data;

inputting the sample video into a video channel for feature extraction to obtain third feature information; and

inputting the background music into an audio channel for feature extraction to obtain fourth feature information;

training a metric learning model by using the third feature information and the fourth feature information to generate a target metric learning model, wherein the target metric learning model is used for metric learning of the first feature information and the second feature information;

the obtaining of the first feature information of the source video includes:

performing feature extraction on the source video to acquire image features of the source video;

performing entity identification on the source video to acquire a first entity keyword of the source video;

acquiring a first word vector of the first entity keyword; and

splicing the image features and the first word vectors to obtain first feature information;

wherein the obtaining of the second feature information of each candidate musical material includes:

performing feature extraction on the candidate music materials to obtain audio features of the candidate music materials;

performing voice information identification on the candidate music materials to obtain second entity keywords of the candidate music materials;

acquiring a second word vector of the second entity keyword; and

and splicing the audio features and the second word vector to obtain the second feature information.

2. The video processing method according to claim 1, wherein said selecting sample data comprises:

acquiring a candidate sample video and description information of the candidate sample video; and

and screening the sample video from the candidate sample video according to the description information.

3. A video processing apparatus, comprising:

the material selection module is used for selecting a target music material from a plurality of candidate music materials according to the similarity corresponding to each candidate music material; and

a generating module, configured to load the target music material in the source video to generate a target video;

wherein, the similarity obtaining module includes:

a feature representation obtaining unit, configured to input the first feature information into a target metric learning model for metric learning to obtain a first feature representation of the source video in the feature space, and input the second feature information into the target metric learning model for metric learning to obtain a second feature representation of the candidate music material in the feature space, where the target metric learning model is trained in advance;

a similarity acquisition unit configured to acquire a similarity between the first feature representation and the second feature representation as the similarity of the source video and the candidate musical material;

wherein, the similarity obtaining module further comprises:

the system comprises a sample selecting unit, a background music generating unit and a processing unit, wherein the sample selecting unit is used for selecting sample data, and the sample data comprises a sample video and background music matched with the sample video;

a third obtaining unit, configured to obtain third feature information of the sample video and fourth feature information of the background music, respectively;

a third generating unit, configured to train a metric learning model to generate a target metric learning model, where the target metric learning model is used for metric learning of the first feature information and the second feature information, and the third feature information and the fourth feature information are used for training the metric learning model;

wherein, the third obtaining unit further comprises:

a separating subunit, configured to separate the sample video and the background music from the sample data;

a sixth obtaining subunit, configured to input the sample video into a video channel for feature extraction, so as to obtain the third feature information; and

a seventh obtaining subunit, configured to input the background music into an audio channel for feature extraction, so as to obtain the fourth feature information;

wherein, the first obtaining module comprises:

the characteristic extraction unit is used for extracting the characteristics of the source video to acquire the image characteristics of the source video;

the entity identification unit is used for carrying out entity identification on the source video so as to obtain a first entity keyword of the source video; and

the first generating unit is used for acquiring a first word vector of the first entity keyword; splicing the image features and the first word vectors to obtain first feature information;

wherein the second obtaining module includes:

a first obtaining unit configured to perform feature extraction on the candidate musical material to obtain an audio feature of the candidate musical material;

a second obtaining unit, configured to perform speech information recognition on the candidate music material to obtain a second entity keyword of the candidate music material; and

a second generating unit, configured to obtain a second word vector of the second entity keyword; and splicing the audio features and the second word vector to obtain the second feature information.

4. The video processing apparatus of claim 3, wherein the sample selection unit comprises:

a fifth obtaining subunit, configured to obtain a candidate sample video and description information of the candidate sample video; and

and the screening subunit is used for screening the sample video from the candidate sample video according to the description information.

5. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video processing method of any of claims 1-2.

6. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the video processing method of any one of claims 1-2.