CN114302225A

CN114302225A - Video dubbing method, data processing method, device and storage medium

Info

Publication number: CN114302225A
Application number: CN202111593542.0A
Authority: CN
Inventors: 邓俊祺; 康力; 陈思宇; 熊子钦; 王立波; 陈颖
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-08

Abstract

The application provides a video dubbing method, a data processing method, equipment and a storage medium, wherein the method comprises the following steps: determining a feature vector and a label of a video according to the video to be dubbed; matching music for the video according to the feature vectors and the labels of the video and the feature vectors and the labels of the music in the music matching library; wherein the feature vector of the video or music is used for representing the position of the video or music in the cross-modal space of the video and music. The method and the device can fuse the general information reflected by the label and the detail content reflected by the characteristic vector, more accurately position the music score range, improve the accuracy of video music score and improve the user experience.

Description

Video dubbing method, data processing method, device and storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video dubbing method, a data processing method, a device, and a storage medium.

Background

With the continuous development of internet technology, the spread of video content is more and more extensive. In various videos, dubbing music is a very important ring, and how to quickly and accurately realize dubbing music of videos becomes a hot problem.

At present, a commonly used method for matching music with video is to determine a tag corresponding to the video, and then select a suitable music from a music library according to the tag. The information of the tag is difficult to reflect the details of the video, and therefore, the accuracy of the video score is poor.

Disclosure of Invention

The embodiments of the present application mainly aim to provide a video dubbing music method, a data processing method, a device and a storage medium, so as to improve the accuracy of video dubbing music.

In a first aspect, an embodiment of the present application provides a video dubbing method, including:

determining a feature vector and a label of a video according to the video to be dubbed;

matching music for the video according to the feature vectors and the labels of the video and the feature vectors and the labels of the music in the music matching library;

wherein the feature vector of the video or music is used for representing the position of the video or music in the cross-modal space of the video and music.

In a second aspect, an embodiment of the present application provides a video dubbing method, including:

acquiring a video shot by a user for a commodity to be evaluated or recommended, and determining a feature vector and a label of the video;

displaying the video after the score corresponding to the commodity to be evaluated or recommended;

In a third aspect, an embodiment of the present application provides a data processing method, including:

determining a feature vector and a label of first data according to the first data to be processed;

selecting at least one second data from the plurality of second data to be fused with the first data according to the feature vector and the label of the first data and the feature vector and the label of the plurality of second data;

wherein a feature vector of the first or second data is used to characterize the position of the first or second data in a cross-modal space; the first data and the second data are any two of video, image, music, text, sensing data and scene characteristics.

In a fourth aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any of the above aspects.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method according to any one of the above aspects is implemented.

The video music matching method, the data processing method, the equipment and the storage medium can determine the characteristic vector and the label of the video according to the video to be matched, and determine the position of the video or music in the cross-mode space of the video and the music according to the characteristic vector and the label of the video and the characteristic vector and the label of each music in a music library, so that the video music matching can be searched in the cross-mode space, the rough information reflected by the label and the detailed content reflected by the characteristic vector are fused, the music matching range is positioned more accurately, the accuracy of the video music matching is improved, and the user experience is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a video and merchandise associated display provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of a video dubbing method according to an embodiment of the present application;

fig. 4 is a schematic flow chart illustrating the process of dubbing the music for the video according to the feature vectors and the tags of the music in the dubbing library according to the embodiment of the present application;

fig. 5 is a schematic diagram of a video score according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a model training provided in an embodiment of the present application;

fig. 7 is a schematic flowchart of another video dubbing method according to an embodiment of the present application;

fig. 8 is a schematic flowchart of a training method for a video score network according to an embodiment of the present disclosure;

fig. 9 is a schematic flowchart of an information processing method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terms referred to in this application are explained first:

dubbing music library: a library of a number of music suitable for use as a video soundtrack is stored.

Video dubbing: given a video, a list of music that is suitable as background music for that video is found from a library of soundtracks.

Video labeling: the label of the video, specifically, the label output after the video passes through the classification network, can be used to represent the category of the video.

Music labeling: the labels of music, which may be used to indicate the category of music, generally include genre labels, mood labels, scene labels, and so on.

And (3) label mapping: and a process of performing similarity calculation on the video tags and the music tags through semantic correlation so as to map one video tag to at least one music tag.

Video feature vector: the feature vector of the video, which is used for representing the position of the video in the "video-music" cross-modal space, may be a vector output after the video passes through a video feature vector extraction network.

Music feature vector: the feature vector of music, which is used to represent the position of music in the "video-music" cross-modal space, may be a vector output after music passes through the music feature vector extraction network.

Vector space retrieval: the process of a vector finding k adjacent vectors in its vector space. The video feature vector is used for searching the music feature vector, namely the process of searching k music feature vectors closest to the video feature vector in a cross-modal vector space. k may be a constant.

The following explains an application scenario and an inventive concept of the present application.

The method and the device for matching the video music in the photo album can be applied to any scene needing the video music, such as short video music, video clip music, product recommendation video music, video music in the photo album of the user and the like, and any scene of the video music can use the scheme provided by the embodiment of the application.

In the video consumption experience, the music experience is an important ring. The good music can correctly render the content and emotion conveyed by the video, and can achieve the effect of 1+1>2 together with the video; poor soundtrack will produce 1+1<2 effects. Therefore, the dubbing music of the video plays an extremely important role for the video.

In some designs, only the labels of the videos or only the feature vectors of the videos are considered separately when dubbing the music. The information of the label is approximate, and can reflect approximate video information but can not reflect details; instead, the information of the vector can characterize the content details, but lacks abstraction of the core tag. With either of these two methods, it is difficult to accurately calculate a suitable score.

In view of this, the embodiment of the present application provides a video score method, which allows basic emotion and style of the score to be associated with a video scene on a label and allows the original content of the video and the score to be associated across a modal space while allowing the basic emotion and style of the score to be associated with the video scene on the label, so that the final score effect is better than a scheme in which the label is considered only or the feature vector is considered only, and accuracy of the video score is improved.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. As shown in fig. 1, in a shopping APP, a seller (seller) may photograph a video for a recommended commodity through a terminal device, the terminal device may determine a corresponding feature vector and a corresponding tag according to the video, select appropriate background music from a music score library, and after adding the background music to the video, upload the video to a server and bind with the recommended commodity, so that a consumer (buyer) may browse videos corresponding to the commodities and select an interested commodity through its own terminal device.

In other scenes, a consumer can shoot a video for a commodity to be evaluated, after background music is added to the video, the video can be uploaded to the cloud and bound with the commodity, and other consumers can know the evaluation of the consumer who has purchased the commodity on the commodity through the video, so that the self shopping is guided.

Fig. 2 is a schematic diagram of a video and commodity associated display provided in an embodiment of the present application. As shown in fig. 2, in the shopping APP, a link or other information of a corresponding commodity can be displayed below the video playing interface, so that a consumer watching a video can know the commodity corresponding to the video conveniently. Of course, the video and the product may have other associated display methods, for example, the corresponding video may also be added to the product detail page, and the corresponding video may be manually or automatically played when the consumer browses the product detail page.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below may be combined with each other without conflict between the embodiments. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 3 is a schematic flowchart of a video dubbing method according to an embodiment of the present application. The execution subject of the method in this embodiment may be any device having a data processing function, such as a terminal device or a server. As shown in fig. 3, the method may include:

step 301, determining a feature vector and a label of a video to be dubbed according to the video.

Wherein the feature vector of the video is used for representing the position of the video in the cross-modal space of the video and the music. In the cross-modal space, the distance between the video and the music can be used for representing the matching degree of the video and the music, and the closer the distance is, the higher the matching degree is.

Alternatively, the feature vector may be a multi-dimensional vector that uniquely identifies the video. The tags of the videos may be used to represent categories of the videos.

And 302, matching music for the video according to the feature vectors and the labels of the video and the feature vectors and the labels of the music in the music matching library.

Wherein the feature vector of the music is used to characterize the position of the music in the cross-modal space of video and music.

Optionally, the music may be any type of sound such as a song, pure music, human voice, ambient sound, white noise, and the like. The music score library can contain a plurality of pieces of music, and after the characteristic vectors and the labels of the music and the videos are determined, the videos can be matched according to the characteristic vectors and the labels.

In practical application, the feature vectors and labels of music in the music score library can be predetermined, and when the video music score is needed, the feature vectors and labels of the video can be directly calculated and matched.

Alternatively, music that is the same as or similar to the tags of the video and is closer in distance to the feature vectors of the video may be selected as the music recommended to the video.

The information of the label is often approximate and can reflect general information but can not reflect details; in contrast, the information of the feature vector can characterize the content details, but lacks abstraction of the core tag. For example, a video of a casual wear type, whose tag is "casual wear," can correspond to a wide range of music tags due to lack of detailed description. After the feature vector is added, a detailed description is added, and the detailed description is not an explicit literal description, but is added with information about the shooting scene, the tone, the character and the like of the putting-on video, so that the accurate positioning of the score range is facilitated. A leisure wear under the scene background of a rural area and a leisure wear of a street racket are obviously suitable for different matches. The embodiment can fuse the information of the feature vector and the information of the label, so as to obtain an accurate score result.

In summary, the video music matching method provided by this embodiment may determine the feature vector and the tag of the video according to the video to be matched, and match the video according to the feature vector and the tag of the video and the feature vector and the tag of each piece of music in the music matching library, where the feature vector of the video or the piece of music is used to represent the position of the video or the piece of music in the cross-modal space of the video and the piece of music, so that the piece of music matched with the video can be retrieved in the cross-modal space, and the rough information reflected by the tag and the detailed content reflected by the feature vector are fused, so that the music matching range is more accurately located, the accuracy of the video music matching is improved, and the user experience is improved.

Fig. 4 is a schematic flow chart illustrating the process of dubbing the music for the video according to the feature vectors and the tags of the music in the dubbing library according to the embodiment of the present application. As shown in fig. 4, the dubbing music for the video according to the feature vector and the label of the video and the feature vector and the label of each music in the dubbing music library may include:

step 401, selecting a first music list from the music score library according to the feature vector of each music in the music score library and the feature vector of the video.

Alternatively, for each piece of music in the library of music scores, the degree of matching between the piece of music and the video may be calculated from the distance between the feature vector of the piece of music and the feature vector of the video, and the first music list may be selected from the pieces of music in the library of music scores according to the degree of matching.

For example, the feature vector of each music in the score library may be extracted in advance to obtain a score feature vector library. Correspondingly, selecting the first music list from the music score library according to the feature vector of each music in the music score library and the feature vector of the video may include: and according to the feature vector of the video, performing cross-modal vector retrieval in the dubbing music feature vector library, determining music with the distance between the feature vector and the feature vector of the video meeting the requirement, and forming a first music list.

The requirement may be set according to actual needs, for example, a distance threshold may be set, and if the distance threshold is smaller than the distance threshold, the requirement is considered to be satisfied, otherwise, the requirement is considered to be not satisfied. Alternatively, the distance may be a euclidean distance between the feature vector of the video and the feature vector of the music.

Because the characteristic vectors of the music and the videos can represent the positions of the music and the videos in the cross-modal space, at least one piece of music adjacent to the video to be dubbed can be quickly and accurately found in the cross-modal space by searching the cross-modal vector in the dubbing music characteristic vector library, a first music list is formed according to the adjacent music, and the efficiency and the accuracy of determining the music list corresponding to the characteristic vectors can be improved.

Step 402, selecting a second music list from the score library according to the labels of the music in the score library and the labels of the videos.

Optionally, the music and the video may be matched according to the tags of the music and the tags of the video, and the matching music forms the second music list.

In an optional implementation manner, the tags of the music and the tags of the video may be tags that can be directly matched, for example, there are 100 tags that are selectable for the music, and the tags are consistent with the 100 tags that are selectable for the video, so that after the tags of the video to be dubbed are obtained, the music corresponding to the tags is directly searched.

In another alternative implementation, the tags for music and the tags for video may not be tags that can be directly matched. Illustratively, the tags of the videos can be tags conforming to the e-commerce platform style, and the tags of the music can be tags conforming to the music platform style, so that the tags of the videos and the music are difficult to directly correspond, and the music tags corresponding to the video tags can be determined by using a semantic mapping method.

Optionally, selecting the second music list from the score library according to the tags of the music in the score library and the tags of the videos may include: performing semantic mapping according to the labels of the video, and determining at least one music label matched with the labels of the video; and selecting a second music list corresponding to the matched at least one music label from the music score library according to the labels of the music in the music score library.

Through semantic mapping, at least one music label can be matched for the label of the video, labels of different platform styles can be fused, and the use requirements under different scenes are met.

In one example, the semantic mapping may be implemented by way of a table lookup. For example, a mapping table may be set, where the mapping table includes a correspondence between video tags and music tags, and the mapping table may determine the music tags corresponding to the tags of the video to be dubbed.

In another example, semantic mapping may be implemented through a tag mapping network. Alternatively, a music tag mapping network and a video tag mapping network may be provided. Performing semantic mapping according to the tags of the video, and determining at least one music tag matched with the tags of the video, wherein the steps comprise: inputting the video label to a video label mapping network to obtain a feature vector corresponding to the video label; inputting each music label into a music label mapping network to obtain a characteristic vector corresponding to each music label; and determining at least one music label matched with the label of the video according to the distance between the characteristic vector of the label of the video and the characteristic vector corresponding to each music label.

The music label mapping network and the video label mapping network can respectively extract the feature vectors of the music labels and the feature vectors of the video labels, and the distance between the feature vectors can be used for representing the matching degree of the two labels. For a video to be dubbed, one or more music tags with the highest matching degree with the tags of the video can be determined, and corresponding music is selected according to the determined music tags to form a second music list. Or, a music tag with a matching degree greater than a certain threshold may be selected as the at least one matched music tag, if there is no music tag with a matching degree greater than the threshold, it may be considered that no tag is matched, and accordingly, the final tag matching result is that 0 music tags are matched, in this case, the dubbing list of the video may be determined directly based on the first music list corresponding to the feature vector, for example, the top k pieces of music may be selected from the first music list to form the dubbing list.

The music label mapping network and the video label mapping network are used for respectively extracting the feature vectors of the music labels and the feature vectors of the video labels, and the music labels matched with the video labels are determined according to the distance of the feature vectors, so that the semantic mapping accuracy can be effectively improved.

Step 403, determining a score list of the video according to an intersection of the first music list and the second music list.

Optionally, after the first music list and the second music list are obtained, the first music list and the second music list may be fused, and a dubbing list of the video may be selected.

For example, in a case where an intersection of the first music list and the second music list is not empty, the score list may contain at least one music in the intersection. The music extracted from the intersection can meet the requirement that both the feature vector and the label are matched, the video music matching can be better realized, and the music matching accuracy is improved. In the case where the intersection of the first music list and the second music list is empty, music may be preferentially selected from the first music list.

The soundtrack list may include at least one music, and background music may be added to the video based on music in the soundtrack list or may be added to the video based on music selected by the user from the soundtrack list.

Fig. 5 is a schematic diagram of a video score according to an embodiment of the present application. As shown in fig. 5, the video to be dubbed music is divided into two paths for information extraction:

extracting a video feature vector from one path of video to obtain a video feature vector; extracting a music characteristic vector of each music in the music score library to obtain a music characteristic vector library; and performing cross-modal vector retrieval on the video feature vectors in a music feature vector library to obtain a music list 1, namely a first music list.

And the other path of the video classification is carried out to obtain a video label, the video label is matched with at least one music label through semantic mapping, and exemplarily, one video label may be matched with a plurality of music labels: music label 1, music label 2, music label 3, …, each music label can have certain matching degree; and extracting corresponding songs from the music score library by the music labels according to the matching degree to obtain a music score list 2, namely a second music list. And the first music list and the second music list are fused to obtain a dubbing list.

Optionally, when the first music list and the second music list are merged, the first music list may be used as a reference.

In an alternative implementation, determining the score list of the video according to the intersection of the first music list and the second music list may include: taking an intersection of the first music list and the second music list, and sequencing the music in the intersection according to the matching degree of the music in the intersection and the feature vector of the video; if the number of the music in the intersection is larger than or equal to the preset number, taking the preset number of music to form a music matching list; and if the number of the music in the intersection is smaller than the preset number, taking the rest part of the music in the first music list and the intersection to form a music matching list.

The remaining music in the first music list may refer to music in the first music list that does not belong to the intersection, and when the number of music in the intersection is insufficient, a part of music may be selected from the remaining music to form an ancillary music list together with the intersection.

Exemplarily, assuming that the first music list is M and the length is M, the second music list is N and the length is N, the score list to be returned is K and the length is K. The intersection J of the first music list and the second music list can be taken firstly, the length is J, if the J of the intersection is larger than or equal to k, the first k songs of the J can be used as a returned score list; if J < k, then take J, then the remaining k-J songs are taken within M-J.

The music in M and the music in the intersection J may be sorted according to the matching degree with the feature vector of the video, and specifically, the smaller the distance between the feature vector of a certain music and the feature vector of the video is, the earlier the sorting may be. When the rest k-J songs are taken from the M-J, the k-J songs with the minimum distance can be taken from the M-J based on the distance of the feature vectors.

The intersection of the first music list and the second music list is selected, the music meeting the requirement is selected from the intersection according to the matching degree of the first music list and the second music list, the music with the number meeting the requirement is selected to form the dubbing list, the music with matched labels and approximate feature vectors can be selected to the maximum extent, and the accuracy of video dubbing is further improved.

In an alternative implementation, if the tags of the video match at least two music tags, each matching music tag has its corresponding second music list; determining the score list of the video according to the intersection of the first music list and the second music list may include: and according to the priorities of the matched music labels, sequentially carrying out the following processing on the second music list corresponding to each music label until the number of the music in the dubbing list reaches a preset number: intersecting the first music list and the second music list, and adding at least part of the music in the intersection to an anaplerotic list; wherein the priority of the music label is determined by the matching degree of the music label and the label of the video.

For example, for a video to be dubbed, the matching degree of each music tag with the tags of the video may be calculated, and a fixed number of music tags with the highest matching degree may be selected from the music tags, or a plurality of music tags with matching degree higher than the threshold matching degree may be selected as the music tags matching with the tags of the video. The priority corresponding to each music label is determined by the matching degree of the music label and the label of the video, and the higher the matching degree is, the higher the priority is.

Assuming that the first music list is M and the length is M, L music labels matched with the video are provided, the second music list corresponding to the ith label is Ni, the length is Ni, and the value of i is 1, 2, … … and L; the score list to be returned is K and the length is K. The L second music lists may be sequentially processed according to priority.

Specifically, the second music list N1 with the highest priority, that is, the highest matching degree, may be processed first, the intersection J1 between N1 and M is taken, the length is J1, if J1 is greater than or equal to k, the first k pieces of music of J1 may be used as the returned score list, if J < k, J1 is taken and added to the score list, and then, the second music list N2 with the next priority may be processed.

When Ni (i is more than 1) is processed, the intersection Ji of Ni and M is taken, the length is Ji, if Ji is more than or equal to the residual quantity, the previous residual quantity of music in Ji is added into the music list, wherein the residual quantity is equal to k minus the quantity of music in the current music list, if Ji is less than the residual quantity, Ji is added into the music list, and the next second music list is continuously processed until the quantity of music in the music list reaches k.

By selecting a plurality of matched music labels for the video and sequentially processing the second music lists corresponding to the priorities, the music in the intersection of the second music lists corresponding to the music labels and the first music lists is added into the music list, the variety of music matching can be improved, and the personalized requirements of different users can be met.

In other optional implementations, for each second music list, intersecting the first music list and the second music list, and adding at least part of the music in the intersection to the score list may include: determining the ratio of the number of the music selected from the second music list to the preset number according to the priority of the music label corresponding to the second music list; and selecting corresponding amount of music from the second music list to add to the score list according to the corresponding ratio of the second music list.

For example, when performing tag matching, the first 3 music tags with the highest matching degree are selected, and the corresponding ratio is 50%, 30%, and 20%, respectively, and assuming that the finally returned music list includes k pieces of music, 50% k pieces of music may be selected from the second music list corresponding to the music tag with the highest matching degree, 30% k pieces of music may be selected from the second music list corresponding to the music tag with the second highest matching degree, and 20% k pieces of music may be selected from the second music list corresponding to the music tag with the third highest matching degree, so as to form a music matching list including k pieces of music.

The second music list corresponding to the music labels with the priority levels can be comprehensively and uniformly considered by setting the corresponding quantity ratio for the plurality of matched music labels respectively, so that the balance and comprehensiveness of the video music matching are improved.

In this embodiment, a first music list is selected from the music library according to the feature vector of each music in the music library and the feature vector of the video, a second music list is selected from the music library according to the tag of each music in the music library and the tag of the video, and the music list of the video is determined according to the intersection of the first music list and the second music list, so that the feature vector and the tag can be quickly fused, and the finally obtained music list meets both the requirement of tag matching and the requirement that the feature vector is closer in distance, and both the accuracy and the efficiency are considered.

In one or more embodiments of the present application, the label and the feature vector may be determined by a classification network and a feature vector extraction network.

Optionally, the music classification network and the video classification network may be trained through training samples; and after the trained music classification network and video classification network are obtained, training a music characteristic vector extraction network and a video characteristic vector extraction network based on the music classification network and the video classification network.

The input of the music characteristic vector extraction network is the output of the middle layer of the music classification network, and the input of the video characteristic vector extraction network is the output of the middle layer of the video classification network; the music classification network is used for determining labels of music, the video classification network is used for determining labels of videos, the music feature vector extraction network is used for extracting feature vectors of music, and the video feature vector extraction network is used for extracting feature vectors of videos.

Optionally, the training samples may include music and a tag real value corresponding thereto, and the video and a tag real value corresponding thereto, and the tag real value may be determined by a manual marking method. According to the actual values of the labels of the music and the videos, the music classification network and the video classification network can be trained respectively. The music classification network and the video classification network, the music feature vector extraction network and the video feature vector extraction network can be any type of deep learning network, such as a convolutional neural network and the like.

The music classification network and the video classification network may each comprise multiple layers. For example, the music classification network may include a plurality of sequentially connected sub-networks, and the sub-network may be any type of network capable of extracting features, such as a ResNet network or other CNN network, and each sub-network may serve as a layer, or each convolutional layer in the network may serve as a layer.

Therefore, the input of the music feature vector extraction network is the output of the middle layer of the music classification network, and it can also be understood that the music classification network includes a first network and a second network, the first network is used for extracting the features of the music, the second network is used for determining the labels of the music according to the features extracted by the first network, and the input of the music feature vector extraction network is the output of the first network, and the feature vector of the music can be determined according to the features extracted by the first network. The form of the first network and the second network is not limited.

Optionally, one output may be led out from any layer in the middle of the music classification network as the input of the music feature vector extraction network. Illustratively, the features output by the 2 nd layer in the music classification network can be input into the music feature vector extraction network to obtain the feature vector of the music. The network structure and training method corresponding to the video are similar to those of music, and are not described herein again.

Optionally, before the video data and the audio data are input to the network, the video data and the audio data may be further processed by an algorithm such as digital signal processing, for example, time domain and frequency domain conversion is performed on the audio data, and the processed video data and audio data are input to a corresponding network for feature extraction.

By training the music classification network and the video classification network and then training the music characteristic extraction network and the video characteristic extraction network according to the trained music classification network and video classification network, the joint training process of the characteristic extraction network can be guided by the knowledge learned by pre-training, and the efficiency and the accuracy of training the characteristic extraction network are improved.

Optionally, training the music feature vector extraction network and the video feature vector extraction network based on the music classification network and the video classification network may include: the method comprises the steps of extracting the distance between a feature vector output by a network and a feature vector output by a video feature vector extraction network, a label predicted value output by a music classification network and a label predicted value output by the video classification network through the music feature vector, constructing a loss function, and training the music feature vector extraction network and the video feature vector extraction network based on the loss function according to a positive sample group and a negative sample group which are obtained in advance.

Wherein the positive samples comprise matched music and video, and the negative samples comprise unmatched music and video; the distance between the loss function and the positive sample is in a positive correlation relationship, and the distance between the loss function and the negative sample is in a negative correlation relationship; and the difference between the predicted value and the true value of the label is in positive correlation with the loss function.

Optionally, the positive sample group may include a plurality of positive samples, the negative sample group may include a plurality of negative samples, and a plurality of pieces of music and a plurality of pieces of video may be obtained in advance and divided into a plurality of positive and negative samples by manual labeling. The music and video in the positive sample group and the negative sample may be the same as or different from the music and video in the training sample when training the classification network.

The music characteristic vector extraction network and the video characteristic vector extraction network can be trained simultaneously, the loss function during training can consider the distance between the extracted characteristic vectors of music and video, and the parameters of the two characteristic vector extraction networks are optimized during training, so that the smaller the distance between the characteristic vectors of music and video in a positive sample is, the better the distance between the characteristic vectors of music and video in a negative sample is, the larger the distance between the characteristic vectors of music and video in a negative sample is, the better the distance between the characteristic vectors of music and video in a negative sample is. The closer the distance between the feature vectors of music and video is, the higher the similarity between the feature vectors of music and video is, and the closer the position in the cross-modal space is, the higher the matching degree between music and video is.

In the training process, besides the distance between the feature vectors, the labels of music and videos can be considered, and optionally, the loss function may include at least the following two parts: the difference between the predicted value of the music label and the real value and the difference between the predicted value of the video label and the real value aim to ensure that the predicted value of the music label or the video label output by the video classification network is closer to the real value of the label as well as better.

Fig. 6 is a schematic diagram illustrating a principle of model training according to an embodiment of the present application. As shown in fig. 6, video data is input to a video classification network, resulting in video tags. After the video data passes through a part of the video classification network, data obtained from the middle layer of the video classification network can be input into the video feature vector extraction network to obtain the feature vector of the video. Similar processing is performed on music, so that labels and feature vectors of the music can be obtained. The label obtained here can be a label prediction value.

During training, the video classification network and the music classification network can be trained firstly, then the video classification network, the music classification network, the video feature vector extraction network and the music feature vector extraction network are subjected to combined training continuously, and loss functions considered in the combined training comprise the distance of feature vectors, labels of music and videos and the like.

After the joint training is completed, the feature vectors of the video and the music can be extracted by using the video feature vector extraction network and the music feature vector extraction network obtained after the joint training, and the extraction process can be referred to as a training process. In determining the labels of the video and the music, a classification network after the joint training can be used, and a classification network before the joint training can also be used.

Through the combined training method, the feature vectors of the matched music and video can be close to each other as much as possible, the feature vectors of the unmatched music and video can be far away from each other as much as possible, in addition, during training, the feature vector extraction network is restrained through the actual label, the distribution of the extracted feature vectors can be optimized, and the diversity and the accuracy of video dubbing music are improved.

The embodiment of the application also provides a video dubbing method associated with the commodity. Fig. 7 is a flowchart illustrating another video dubbing method according to an embodiment of the present application. As shown in fig. 7, the method includes:

step 701, acquiring a video shot by a user for a commodity to be evaluated or recommended, and determining a feature vector and a label of the video.

Optionally, the user may be a seller or a consumer, the user may take a video for a target commodity, and the target commodity may be a commodity to be evaluated or recommended.

And step 702, matching music for the video according to the feature vectors and the labels of the video and the feature vectors and the labels of the music in the music matching library.

Optionally, the video may be dubbed based on the method described in any of the above embodiments.

And step 703, displaying the video after the score corresponding to the commodity to be evaluated or recommended.

Optionally, the video after the score is dubbed may be used for performing associated display with at least one of a link, a name, and a keyword of the commodity. An example of this presentation can be seen in fig. 2.

The video dubbing method provided by the embodiment can acquire videos shot by users for commodities to be evaluated or recommended, determine feature vectors and labels of the videos, dub the videos according to the feature vectors and the labels of the videos and feature vectors and labels of music in a dubbing music library, and display the dubbed videos corresponding to the commodities to be evaluated or recommended, wherein the feature vectors of the videos or the music are used for representing positions of the videos or the music in a cross-modal space of the videos and the music, so that the accuracy of video dubbing can be improved, and the display effect of products can be improved.

Fig. 8 is a flowchart illustrating a training method for a video score network according to an embodiment of the present application. As shown in fig. 8, the method includes:

step 801, obtaining a training sample.

Step 802, training a video music network according to the training samples, wherein the video music model comprises a video characteristic vector extraction network, a music characteristic vector extraction network, a video classification network and a music classification network.

The video characteristic vector extraction network and the music characteristic vector extraction network are respectively used for extracting a characteristic vector of a video and a characteristic vector of music; the feature vector of the video or music is used for representing the position of the video or music in the cross-modal space of the video and music; the video classification network and the music classification network are used for determining labels of videos and labels of music, respectively.

The trained video score model is used for determining the feature vector and the label of the video to be scored and the feature vector and the label of each music in the score library so as to realize video score.

Optionally, training the video score network according to the training sample, including: training a music classification network and a video classification network through training samples; after the trained music classification network and video classification network are obtained, training a music characteristic vector extraction network and a video characteristic vector extraction network based on the music classification network and the video classification network; the input of the music characteristic vector extraction network is the output of the middle layer of the music classification network, and the input of the video characteristic vector extraction network is the output of the middle layer of the video classification network.

Optionally, training the music feature vector extraction network and the video feature vector extraction network based on the music classification network and the video classification network includes: extracting the distance between the feature vectors output by the network and the feature vectors output by the video feature vector extraction network, the label predicted value output by the music classification network and the label predicted value output by the video classification network through the music feature vectors, constructing a loss function, and training the music feature vector extraction network and the video feature vector extraction network based on the loss function according to a pre-obtained positive sample group and a pre-obtained negative sample group; wherein the positive samples comprise matched music and video, and the negative samples comprise unmatched music and video; the distance between the loss function and the positive sample is in a positive correlation relationship, and the distance between the loss function and the negative sample is in a negative correlation relationship; and the difference between the predicted value and the true value of the label is in positive correlation with the loss function.

The implementation principle and the technical effect of the model training method provided by this embodiment may be referred to in the foregoing embodiments, and are not described herein again.

Fig. 9 is a schematic flowchart of an information processing method according to an embodiment of the present application. As shown in fig. 9, the method includes:

step 901, determining a feature vector and a label of first data according to the first data to be processed.

And 902, selecting at least one second data from the plurality of second data to be fused with the first data according to the feature vector and the label of the first data and the feature vector and the label of the plurality of second data.

The foregoing embodiment has described the example where the first data is a video and the second data is music, on this basis, the first data and the second data may be replaced by other information, and fusion of the first data and the second data is achieved based on a similar principle. Optionally, the fusion may refer to fusion in any manner, for example, simultaneous playing/displaying, splicing together for playing/displaying, and the like.

In one example, the first data is a video and the second data is a text, which can be implemented as a video text.

In another example, the first data is music and the second data is an image, which can be implemented as a music background image.

In another example, the first data is sensing data, the second data is music, and the sensing data may be vehicle driving speed, road information, roadside device information, surrounding obstacle information, and the like, and is matched with corresponding music, matched with warm and relaxed music under better road conditions, and matched with refreshing music under more obstacles road conditions, so that a user can be more directly integrated into the current driving environment, and user experience is improved.

In another example, the first data may be a video, the second data may be scene characteristics, the user may shoot a video with a portrait, and extract the portrait in the video, and according to the characteristics of the portrait, such as clothing, actions, expressions, and the like, corresponding scene characteristics, such as scenes of garden scenery, scenes of city leisure, and the like, are added, the user may add various scenes in the video without actually going deep into a specific scene, and the recommended scene has higher matching with the video shot by the user, so that the efficiency of shooting the video by the user and the user experience are improved.

In the information processing method provided by this embodiment, a feature vector and a tag of first data may be determined according to the first data to be processed, and at least one second data is selected from the plurality of second data to be fused with the first data according to the feature vector and the tag of the first data and the feature vector and the tag of the plurality of second data, where the feature vector of the first data or the second data is used to characterize a position of the first data or the second data in a cross-modal space; the first data and the second data are any two data of video, image, music, text, sensing data and scene characteristics, and proper second data can be selected for the first data according to the label and the characteristic vector for fusion, so that the accuracy of information fusion is improved.

In the information processing method, optionally, selecting at least one second data from the plurality of second data to be fused with the first data according to the feature vector and the label of the first data and the feature vector and the label of the plurality of second data, and the method includes:

selecting a second data list corresponding to the feature vector according to the feature vector of each second data and the feature vector of the first data;

selecting a second data list corresponding to the label according to the label of each second data and the label of the first data;

and selecting at least one second data to be fused with the first data according to the intersection of the second data list corresponding to the feature vector and the second data list corresponding to the label.

Optionally, selecting at least one second data to be fused with the first data according to an intersection of the second data list corresponding to the feature vector and the second data list corresponding to the tag, where the method includes:

taking an intersection of the second data list corresponding to the feature vector and the second data list corresponding to the label, and sorting the second data in the intersection according to the matching degree of the second data and the feature vector of the first data;

if the number of the second data in the intersection is larger than or equal to the preset number, taking the second data with the preset number and fusing the second data with the first data;

and if the quantity of the second data in the intersection is less than the preset quantity, taking the rest part of the second data in the second data list corresponding to the feature vector and the intersection to be fused with the first data.

Optionally, if the tag of the first data matches at least two second data tags, each matching second data tag has a corresponding second data list;

selecting at least one second data to be fused with the first data according to the intersection of the second data list corresponding to the feature vector and the second data list corresponding to the tag, wherein the fusion comprises:

and according to the priorities of the matched second data labels, sequentially carrying out the following processing on the second data list corresponding to each second data label until the quantity of the selected second data reaches a preset quantity:

taking an intersection of a second data list corresponding to the feature vector and a second data list corresponding to the second data label, and selecting at least part of second data in the intersection;

wherein the priority of the second data tag is determined by the matching degree of the second data tag and the tag of the first data.

Optionally, the method further includes: extracting the feature vector of each second data to obtain a feature vector library;

correspondingly, selecting a second data list corresponding to the feature vector according to the feature vector of each second data and the feature vector of the first data, including:

and performing cross-modal vector retrieval in the feature vector library according to the feature vector of the first data, determining second data with the distance between the feature vector and the feature vector of the first data meeting the requirement, and forming a second data list corresponding to the feature vector.

Optionally, the method further includes:

training a second data classification network and a first data classification network through training samples;

after the trained second data classification network and the trained first data classification network are obtained, training a second data feature vector extraction network and a first data feature vector extraction network based on the second data classification network and the first data classification network;

the input of the second data characteristic vector extraction network is the output of the middle layer of the second data classification network, and the input of the first data characteristic vector extraction network is the output of the middle layer of the first data classification network;

the second data classification network is used for determining the label of second data, the first data classification network is used for determining the label of first data, the second data feature vector extraction network is used for extracting the feature vector of the second data, and the first data feature vector extraction network is used for extracting the feature vector of the first data.

Optionally, training the second data feature vector extraction network and the first data feature vector extraction network based on the second data classification network and the first data classification network includes:

extracting the distance between the feature vector output by the network and the feature vector output by the first data feature vector extraction network, the label predicted value output by the second data classification network and the label predicted value output by the first data classification network through the second data feature vector, constructing a loss function, and training the second data feature vector extraction network and the first data feature vector extraction network based on the loss function according to a pre-obtained positive sample group and a pre-obtained negative sample group;

wherein the positive samples comprise matched second data and first data, and the negative samples comprise unmatched second data and first data;

the distance between the loss function and the positive sample is in a positive correlation relationship, and the distance between the loss function and the negative sample is in a negative correlation relationship; and the difference between the predicted value and the true value of the label is in positive correlation with the loss function.

Optionally, selecting a second data list corresponding to the tag according to the tag of each second data and the tag of the first data, where the selecting includes:

performing semantic mapping according to the tags of the first data, and determining at least one second data tag matched with the tags of the first data;

and selecting a second data list corresponding to the matched at least one second data label according to the label of each second data.

Optionally, performing semantic mapping according to the tag of the first data, and determining at least one second data tag matched with the tag of the first data, including:

inputting the label of the first data into a first data label mapping network to obtain a feature vector corresponding to the label of the first data;

inputting each second data label into a second data label mapping network to obtain a feature vector corresponding to each second data label;

and determining at least one second data label matched with the label of the first data according to the distance between the characteristic vector of the label of the first data and the characteristic vector corresponding to each second data label.

For the implementation principle and the technical effect of each data processing method provided by this embodiment, reference may be made to the foregoing embodiments, which are not described herein again.

The methods provided by the embodiments of the present application may be applied to a server, or may also be applied to a terminal device, or may also be deployed on the server in part of steps and deployed on the terminal device in part of steps.

Corresponding to the above method, an embodiment of the present application further provides a video dubbing apparatus, including:

the first determination module is used for determining a feature vector and a label of a video according to the video to be dubbed;

the first score module is used for scoring the video according to the characteristic vector and the label of the video and the characteristic vector and the label of each music in the score library;

An embodiment of the present application further provides a video dubbing apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring videos shot by users for commodities to be evaluated or recommended and determining characteristic vectors and labels of the videos;

the second score module is used for scoring the video according to the feature vectors and the labels of the video and the feature vectors and the labels of the music in the score library;

The embodiment of the present application further provides a training apparatus for a video dubbing network, including:

the second acquisition module is used for acquiring a training sample;

the training module is used for training a video music network according to the training samples, and the video music model comprises a video characteristic vector extraction network, a music characteristic vector extraction network, a video classification network and a music classification network; the video characteristic vector extraction network and the music characteristic vector extraction network are respectively used for extracting a characteristic vector of a video and a characteristic vector of music; the feature vector of the video or music is used for representing the position of the video or music in the cross-modal space of the video and music; the video classification network and the music classification network are respectively used for determining the labels of videos and the labels of music;

An embodiment of the present application further provides a data processing apparatus, including:

the second determining module is used for determining a feature vector and a label of the first data according to the first data to be processed;

the fusion module is used for selecting at least one second data from the plurality of second data to be fused with the first data according to the feature vector and the label of the first data and the feature vector and the label of the plurality of second data;

For specific implementation principles and technical effects of the devices provided in the embodiments of the present application, reference may be made to the foregoing embodiments, which are not described herein again.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic device of the present embodiment may include:

at least one processor 1001; and

a memory 1002 communicatively coupled to the at least one processor;

the memory 1002 stores instructions executable by the at least one processor 1001, and the instructions are executed by the at least one processor 1001 to cause the electronic device to perform the method according to any one of the embodiments.

Alternatively, the memory 1002 may be separate or integrated with the processor 1001.

For the implementation principle and the technical effect of the electronic device provided by this embodiment, reference may be made to the foregoing embodiments, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method described in any one of the foregoing embodiments is implemented.

The present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method described in any of the foregoing embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A method for video dubbing music, comprising:

2. The method of claim 1, wherein dubbing the video according to the feature vector and the label of the video and the feature vector and the label of each music in the dubbing library comprises:

selecting a first music list from a music score library according to the feature vector of each music in the music score library and the feature vector of the video;

selecting a second music list from the score library according to the labels of the music in the score library and the labels of the videos;

and determining a dubbing list of the video according to the intersection of the first music list and the second music list.

3. The method of claim 2, wherein determining the score list for the video according to the intersection of the first music list and the second music list comprises:

taking an intersection of the first music list and the second music list, and sequencing the music in the intersection according to the matching degree of the music in the intersection and the feature vector of the video;

if the number of the music in the intersection is larger than or equal to the preset number, taking the preset number of music to form a music matching list;

and if the number of the music in the intersection is smaller than the preset number, taking the rest part of the music in the first music list and the intersection to form a music matching list.

4. The method of claim 2, wherein if the tags of the video match at least two music tags, each matching music tag has its corresponding second music list;

determining a score list of the video according to an intersection of the first music list and the second music list, including:

and according to the priorities of the matched music labels, sequentially carrying out the following processing on the second music list corresponding to each music label until the number of the music in the dubbing list reaches a preset number:

intersecting the first music list and the second music list, and adding at least part of the music in the intersection to an anaplerotic list;

wherein the priority of the music label is determined by the matching degree of the music label and the label of the video.

5. The method according to any one of claims 2-4, further comprising:

extracting the feature vector of each music in the score library to obtain a score feature vector library;

correspondingly, selecting a first music list from the music score library according to the feature vector of each music in the music score library and the feature vector of the video, comprises:

and according to the feature vector of the video, performing cross-modal vector retrieval in the dubbing music feature vector library, determining music with the distance between the feature vector and the feature vector of the video meeting the requirement, and forming a first music list.

6. The method of claim 5, further comprising:

training a music classification network and a video classification network through training samples;

after the trained music classification network and video classification network are obtained, training a music characteristic vector extraction network and a video characteristic vector extraction network based on the music classification network and the video classification network;

the input of the music characteristic vector extraction network is the output of the middle layer of the music classification network, and the input of the video characteristic vector extraction network is the output of the middle layer of the video classification network;

the music classification network is used for determining labels of music, the video classification network is used for determining labels of videos, the music feature vector extraction network is used for extracting feature vectors of music, and the video feature vector extraction network is used for extracting feature vectors of videos.

7. The method of claim 6, wherein training the music feature vector extraction network and the video feature vector extraction network based on the music classification network and the video classification network comprises:

extracting the distance between the feature vectors output by the network and the feature vectors output by the video feature vector extraction network, the label predicted value output by the music classification network and the label predicted value output by the video classification network through the music feature vectors, constructing a loss function, and training the music feature vector extraction network and the video feature vector extraction network based on the loss function according to a pre-obtained positive sample group and a pre-obtained negative sample group;

wherein the positive samples comprise matched music and video, and the negative samples comprise unmatched music and video;

8. The method according to any one of claims 2-4, wherein selecting a second music list from the library of music tracks based on the tags of the respective music in the library of music tracks and the tags of the videos comprises:

performing semantic mapping according to the labels of the video, and determining at least one music label matched with the labels of the video;

and selecting a second music list corresponding to the matched at least one music label from the music score library according to the labels of the music in the music score library.

9. A method for video dubbing music, comprising:

10. A data processing method, comprising:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any of claims 1-10.

12. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-10.