CN114528762A

CN114528762A - Model training method, device, equipment and storage medium

Info

Publication number: CN114528762A
Application number: CN202210148011.9A
Authority: CN
Inventors: 于家硕; 蒲俊福; 单瀛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-24
Anticipated expiration: 2042-02-17
Also published as: CN114528762B

Abstract

The application discloses a model training method, a model training device, model training equipment and a storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like. The method comprises the steps of obtaining visual information and audio information in a dance video, extracting starting point characteristics of the audio information, wherein the starting point characteristics are used for representing musical principle prime points of audio in the dance video, and predicting the musical principle prime points of the visual information by using an initial model, so that the initial model is optimized according to the musical principle prime points of the visual information and the starting point characteristics of the audio information, the musical principle prime points of the visual information are aligned with the musical principle prime points of the audio, and a pre-training model is obtained. The method and the device align the musical principle points of the visual information and the audio as the agent task to realize model pre-training, and consider some unique characteristics in the dance video, so that the pre-training model is better suitable for downstream tasks of various dance types without data annotation, and the performance is ideal.

Description

Model training method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model training method, apparatus, device, and storage medium.

Background

Representation Learning (Representation Learning) refers to using a specific training method to obtain a task-independent (task-adaptive) pre-training model, which can be applied to several related specific tasks with slight modification and obtain good performance.

The current characterization learning mainly focuses on audio-visual multi-modal pre-training of generalized videos, for example, video and audio correlation (AVC) is used as an agent task, and a model needs to judge whether a given audio and video sequence comes from the same video, so that the model is pre-trained by using audio and video. Or, by using video and audio time sequence synchronization (AVTS) as an agent task, the model needs to judge whether a given audio and video sequence is time sequence corresponding, and then performs model pre-training by using audio and video.

However, although these training methods are free from the limitation of data annotation, these methods do not take into account some unique features in dance videos, and therefore cannot be adapted to downstream tasks of dance videos of a special type, i.e. cannot be directly applied to some dance/music-related downstream tasks, and even applied to some dance tasks, the performance of these methods is not ideal.

Disclosure of Invention

In order to solve the technical problems, the application provides a model training method, a device, equipment and a storage medium, wherein the music principle points of visual information and the music principle points of audio are aligned to serve as agent tasks to realize model pre-training, and some unique characteristics in dance videos are considered, so that the pre-training model obtained through training can be better suitable for a plurality of dance/music related downstream tasks without data annotation, and the performance is ideal.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a model training method, where the method includes:

acquiring visual information and audio information in a dance video;

extracting a starting point feature of the audio information based on the spectrogram of the audio information, wherein the starting point feature is used for representing musical key points of the audio in the dance video;

predicting the musical key points of the visual information by using an initial model;

and optimizing the initial model according to the musical principle prime points of the visual information and the initial point characteristics of the audio information so as to align the musical principle prime points of the visual information with the musical principle prime points of the audio to obtain a pre-training model.

In one aspect, an embodiment of the present application provides a model training apparatus, where the apparatus includes an acquisition unit, an extraction unit, a prediction unit, and an optimization unit:

the obtaining unit is used for obtaining visual information and audio information in the dance video;

the extracting unit is used for extracting a starting point feature of the audio information based on the spectrogram of the audio information, wherein the starting point feature is used for representing musical key points of the audio in the dance video;

the prediction unit is used for predicting the musical key points of the visual information by using an initial model;

and the optimization unit is used for optimizing the initial model according to the musical principle prime points of the visual information and the initial point characteristics of the audio information so as to align the musical principle prime points of the visual information with the musical principle prime points of the audio to obtain a pre-training model.

In one aspect, an embodiment of the present application provides an electronic device for model training, where the electronic device includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the model training method of the preceding aspect according to instructions in the program code.

In one aspect, the present application provides a computer-readable storage medium for storing program code for executing the model training method of the foregoing aspect.

In one aspect, the present application provides a computer program product, which includes a computer program, and when executed by a processor, the computer program implements the model training method of the foregoing aspect.

According to the technical scheme, when the dance video is used for pre-training, the visual information and the audio information in the dance video can be acquired. Because some unique features, such as musical key elements of rhythm, melody, beat and the like, exist in the dance video, and basically all the unique features need to be considered in some downstream tasks related to dance/music, in order to ensure that a pre-training model obtained through pre-training can be ideal in the downstream task of the dance video of a special type, the initial point feature of audio information can be extracted based on a spectrogram of the audio information, the initial point feature is used for representing musical key elements of audio in the dance video, and the musical key elements of visual information are predicted by using the initial model, so that the initial model can be optimized according to the musical key elements of the visual information and the initial point feature of the audio information, so that the musical key elements of the visual information are aligned with the musical key elements of the audio to obtain the pre-training model. The method and the device align the music principle element points of the visual information and the music principle element points of the audio to serve as agent tasks to realize model pre-training, and take some unique characteristics in dance videos into consideration, so that the pre-training model obtained through training can be better suitable for a plurality of downstream tasks related to dance/music under the condition of no need of data annotation, and the performance is ideal.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and for a person of ordinary skill in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is an architecture diagram of an application scenario of a model training method according to an embodiment of the present application;

FIG. 2 is a flow chart of a model training method provided in an embodiment of the present application;

FIG. 3 is a block diagram of a model training method according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating comparison results of three dance video smart checkpoint re-creation modes provided by an embodiment of the application;

FIG. 5 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a structural diagram of a terminal according to an embodiment of the present application;

fig. 7 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

First, terms to which embodiments of the present application may relate are explained:

characterization Learning (Representation Learning): the method is characterized in that a task-independent (task-oriented) pre-training model is obtained by using a specific training method, and the model can be applied to a plurality of related specific tasks and obtains good performance after being slightly modified. There are two main types of characterization learning, namely supervised characterization learning (supervised pre-training) and unsupervised characterization learning (unsupervised pre-supervised training).

Unsupervised pre-training: the method is characterized in that large-scale data training is carried out in a specific mode without data labeling, and the obtained pre-training model can be applied to a plurality of related downstream tasks.

Proxy task (pretext task): based on some prior information in the data, a task without data annotation is designed for unsupervised pre-training of the model.

Downstream task (downstream task): in contrast to agent tasks, this refers to specific tasks to which a pre-trained model obtained through unsupervised learning can be applied.

Audio-video Learning (Audio-Visual Learning): one type of multi-modal learning refers to model training and application using and combining information from both modalities, sound (i.e., audio information) and visual (i.e., visual information).

AVC: Audio-Visual coherence task (AVC), one of the agent tasks in Audio-video learning, requires a model to determine whether a given Audio-video sequence is from the same video, i.e., whether the video and Audio content are correlated.

AVTS: Audio-Visual Temporal Synchronization (AVTS) Audio-Visual Temporal Synchronization task, one of the agent tasks in Audio-Visual learning, requires a model to determine whether a given Audio-Visual sequence is from a synchronized segment of a video, i.e., whether video and Audio content are time-synchronized.

HOOF: optical Flow direction Histograms (HOOFs) are used to characterize temporal motion information.

The audio-visual multi-modal pre-training method for generalized videos provided in the related art, although being free from the limitation of data annotation, cannot be directly applied to some dance/music related downstream tasks, such as intelligent dance/music creation. Even though it may be applied to certain dance tasks, such as dance classification and retrieval, these methods do not perform well. This is because these methods do not take into account some unique features in dance videos, such as the dancer's movements, the melody, the rhythm of the music, etc., and thus do not accommodate downstream tasks of this particular type of dance video.

In addition, many technical solutions related to dance videos provided in the related art are supervised learning methods related to specific tasks. Although these methods are effective, one cannot be applied to multiple dance/music related tasks, resulting in poor generalization of the technical solution. In addition, the methods need to rely on a large amount of data labels, the performance of the methods depends on the scale of the data to a great extent, and the methods have great limitations on the landing and application of the technical scheme.

In order to solve the above technical problems, embodiments of the present application provide a model training method, in which a music principle element point of visual information and a music principle element point of audio are aligned to serve as an agent task to implement model pre-training, and some unique features in a dance video are considered, so that a trained pre-training model can be better adapted to a plurality of dance/music-related downstream tasks without data annotation, and the performance is ideal.

As shown in fig. 1, fig. 1 is a schematic diagram illustrating an architecture of an application scenario of a model training method. A server 101 may be included in this scenario. The server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services.

The server 101 may obtain visual information and audio information in the dance video. Since some unique features exist in the dance video, such as musical key elements of rhythm, melody, beat, etc., and basically all the unique features need to be considered in some downstream tasks related to dance/music, in order to ensure that the pre-training model obtained through pre-training can be ideal in the downstream tasks of the dance video of this particular type, the server 101 may extract the starting point features of the audio information based on the spectrogram of the audio information, wherein the starting point features are used for representing musical key elements of the audio in the dance video. The music theory element reflects some unique characteristics in the dance video, such as rhythm, melody, beat, variation amplitude, strength and the like, and the application is mainly introduced by taking the music theory element as the rhythm point.

The server 101 may also predict the musical key points of the visual information by using the initial model, so that the initial model may be further optimized according to the musical key points of the visual information and the initial point characteristics of the audio information, so that the musical key points of the visual information and the musical key points of the audio information are aligned to obtain a pre-training model.

It should be noted that the pre-training model obtained by training through the method provided by the embodiment of the present application can be applied to several downstream tasks, such as dance classification, dance-music retrieval, video re-creation of a karaoke, and the like. The two downstream tasks of dance classification and dance-music retrieval are applicable to user recommendations and searches of short/long video platforms. The downstream task of the checkpoint dance video re-creation can be applied to dance video intelligent creation of a short video platform, and has important practical application value. In the application of downstream tasks, the performance which is comparable to or even superior to that of a full supervision method can be obtained only by slightly changing the existing pre-training model. The method for processing the downstream tasks through the pre-training model can be applied to various scenes including but not limited to cloud technology, artificial intelligence, intelligent transportation, driving assistance and the like.

It should be noted that fig. 1 is only described by taking the server as an example to execute the model training method, and in some cases, the terminal may also execute the model training method. The terminal can be a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an aircraft and the like, but is not limited thereto.

When the pre-trained model is applied to the downstream task, the downstream task may be processed by the server or the terminal through the pre-trained model. Similarly, the server for processing the downstream task may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing a cloud computing service. The terminal for processing the downstream task may be a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an aircraft, etc., but is not limited thereto.

It is understood that the methods provided by the embodiments of the present application may involve Artificial Intelligence (AI), which is a theory, method, technique, and application that utilizes a digital computer or a machine controlled by a digital computer to simulate, extend, and extend human Intelligence, perceive the environment, acquire knowledge, and use the knowledge to obtain optimal results. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

The method provided by the embodiment of the application can particularly relate to a Computer Vision technology, and the Computer Vision technology (Computer Vision, CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, following and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image which is more suitable for human eyes to observe or is transmitted to an instrument to detect. For example, the musical key points of visual information can be predicted by computer vision techniques, and visual features can be extracted by image semantic understanding in computer vision techniques.

The method provided by the embodiment of the application particularly relates to key technologies of Speech Technology (Speech Technology), namely, an automatic Speech recognition Technology, a Speech synthesis Technology and a voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. For example, a starting point feature of the audio information may be extracted by a speech recognition technique, and subsequently an audio feature may be extracted by speech recognition in the speech recognition technique.

The method provided by the embodiment of the application can particularly relate to Machine Learning (ML), which is a multi-field cross subject, and relates to multi-field subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. For example, a pre-trained model is derived based on machine learning training.

Next, a model training method performed by a server will be taken as an example, and details of the model training method provided in the embodiments of the present application will be described with reference to the drawings.

Referring to fig. 2, fig. 2 shows a flow chart of a model training method, the method comprising:

s201, obtaining visual information and audio information in the dance video.

In the embodiment of the application, the dance video can be used as a training sample, and generally, the dance video reflects that a dancer dances according to music, so that the dance video can include visual information representing dance content and audio information representing music content, and a server can acquire the visual information and the audio information of the dance video.

S202, extracting the starting point characteristics of the audio information based on the spectrogram of the audio information, wherein the starting point characteristics are used for representing musical key points of the audio in the dance video.

For audio information, the server may extract a feature of a start point (onset) of the audio information, which may represent a start time of an audio intensity, based on a spectrogram of the audio information, and may be regarded as a musical element point (e.g., a rhythm point) of the audio information, in the following manner:

where X (N, k) represents the audio signal strength of the nth time sequence position of the kth frequency position, N represents the size of the time sequence window, q represents the time sequence offset, v () represents the audio signal strength of the nth time sequence position, h represents the step size, w (q) represents the hamming window function, N, q and h can be set according to actual requirements.

Referring to fig. 3, fig. 3 is a flow chart illustrating the structure of a model training method. As shown in 301 in fig. 3, the audio information may be an audio waveform, and the audio waveform is subjected to Fast Fourier Transform (FFT) of a certain time sequence window to obtain a spectrogram, which may be specifically referred to the above formula. In 301 of fig. 3, a spectrogram of the Audio information is analyzed by spectral flux (spectral flux) to obtain an Onset envelope (Audio Onset Envelopes) of the Audio, and the Onset envelope of the Audio is processed by selecting Local maxima (Picking Local maxima) to obtain musical key points of the Audio.

Specifically, the musical key points of the audio frequency can be obtained by extracting the starting point features of the audio frequency information according to the spectrogram based on the following formula:

wherein OE (n) represents a curve of musical key points of the audio, X_ref(n- μ, k) represents the audio signal intensity at the kth frequency position with respect to the nth- μ timing position, which is obtained by performing maximum value calibration on X (n, k), max () represents the calculated maximum value, and μ represents the timing delay.

And S203, predicting the musical key points of the visual information by using the initial model.

For the visual information, the established initial model can be used for predicting the input visual information in the embodiment of the application to obtain the musical key points of the visual information.

In one possible implementation, the initial model may include a network of explicit music theory element alignment branches that are used to achieve explicit music theory element alignment (e.g., explicit tempo alignment). In this case, the manner of predicting the musical key of the visual information using the initial model may be to predict the musical key of the visual information by aligning the explicit musical key with the branch network.

It should be noted that the explicit music theory element alignment branch network may include an action information extraction module and a video music theory element predictor, and in this case, the implementation manner of predicting the music theory element points of the visual information may be that the action information extraction module extracts the action information according to the visual information, and then the video music theory element predictor predicts the action information according to the action information to obtain the music theory element points of the visual information.

The motion information extraction module may be a module for extracting motion information, and the manner of extracting motion information may include many ways, such as optical flow extraction, motion capture, trajectory following, gesture estimation, and the like. At this time, the motion information extraction module may include an optical flow extraction network module and a histogram calculation module, and the extracted motion information may be an optical flow direction histogram feature.

Referring to 302 in fig. 3, after the visual information is input to the explicit music element alignment branch network, the visual information may be encoded by a visual encoder of a light stream extraction network (PWC-Net) module, and then the encoded visual information may be subjected to light stream extraction by using the PWC-Net module to obtain a light stream. Then, histogram calculation is carried out on the extracted optical flows through a histogram calculation module to obtain optical flow direction histogram features, so that the video music element predictor can carry out prediction based on the optical flow direction histogram features to obtain music element points of visual information. The optical flow direction histogram feature can obtain the amplitude and direction of optical flow change on the premise of avoiding interference of factors such as noise, camera angle and the like, so that the accuracy of prediction is improved.

The calculation formula of the optical flow direction histogram feature is as follows:

wherein H (n, k) represents the optical flow direction histogram feature of the nth time series position of the kth frequency position, (x, y) represents the two-dimensional space position coordinate, M_t(x, y) represents the optical flow magnitude at the t-th time node of the position point with the position coordinate (x, y) in the optical flow direction histogram, P_t(x, y) representsThe arctangent value of the position point with the position coordinate of (x, y) at the t-th time node,

is a function of 0/1.

The calculation formula of (a) is as follows:

wherein theta is a hyper-parameter and can be set according to actual requirements, phi is P_t(x，y)。

In one possible implementation, the explicit music element alignment branch network may further comprise a three primary color injector, the three primary colors being Red-Green-Blue (RGB), and thus the three primary color injector may be referred to as an RGB injector. In this case, the video musical element predictor predicts the musical element points of the visual information based on the motion information by extracting the three primary color information (i.e., RGB information) from the visual information by the three primary color injector and then predicting the musical element points of the visual information based on the three primary color information and the motion information by the video musical element predictor.

After obtaining the three primary color information and the motion information (e.g., optical flow direction histogram feature), the three primary color information and the motion information may be input in combination to a video music element predictor for music element point generation of visual information. The predictor uses a linear projection layer to predict based on the first-order difference of the three-primary-color information and the light stream direction histogram characteristics, and the method is realized as follows:

wherein,

representing visual informationA () represents an activation function, W_eRepresenting a weight parameter, is a learnable parameter, U_motAll-connected layer, f 'representing motion information'_motA first order difference result representing the optical flow direction histogram feature,

representing an exclusive OR operation, U_injAll-connected layer, f 'representing three primary color information'_injRepresenting the first order difference result of the trichromatic information, b_eIndicating a bias, is a learnable parameter.

According to the embodiment of the application, the RGB information and the motion information (such as the optical flow direction histogram feature shown in FIG. 3) are combined to predict the musical key points of the visual information, so that the musical key points are predicted based on more visual information, and the prediction accuracy is further improved.

In one possible implementation, the three primary color injector is composed of two parts, the first part being an audio-guided cross-modal attention module and the second part being a timing attention module, i.e. the three primary color injector comprises a cross-modal attention module and a timing attention module (see 303 in fig. 3). In this case, the extracting of the trichromatic information from the visual information by the trichromatic injector may be performed by determining a cross-modal association between the visual information and the first audio information by a cross-modal attention module, and performing a time-sequential attention calculation on the cross-modal association by a time-sequential attention module to obtain the trichromatic information.

The cross-modal attention module can acquire cross-modal association between the visual information and the first audio information, so that the audio information is introduced into the three-primary color information, and therefore when the three-primary color information and the action information are combined, the effect of enhancing visual representation by using the audio information is achieved.

The audio modal input of the cross-modal attention module is first audio information, the first audio information can be audio characteristics obtained by a spectrogram of the audio information in a dance video through a plurality of convolution layers, the video modal input is visual information obtained by encoding the visual information in the dance video through a visual encoder of PWC-Net, and the cross-modal attention module is implemented in the following manner:

wherein c denotes a channel dimension, a denotes first audio information,

attention weight representing channel dimension, σ () representing a nonlinear activation function, W₁A weight parameter (which can be learned) is represented,

representing the first fully-connected layer, p, in the channel dimension_aRepresenting a global average pooling layer, f_aThe representation of the audio features is carried out,

indicating RGB full link layer under channel dimension, an indication of same or operation, f_rgbRepresenting RGB features;

representing the RGB features after audio feature enhancement of the first audio information in channel dimension, i representing the index of the matrix, k representing the number of elements in the matrix,

to represent

The ith element of the matrix is then selected,

denotes f_rgbThe ith element in the matrix;

s represents the dimension of the space and,

attention weight representing spatial dimension, softmax () representing activation function, δ () representing bi-tangent function, W₂A weight parameter (which can be learned) is represented,

representing the fully connected layer in the spatial dimension,

representing the RGB fully-connected layer in the spatial dimension,

RGB features representing channel dimensions;

f_a：rgbrepresenting the RGB information that is ultimately obtained and,

in the dimension of representation space

The ith element in the matrix is then selected,

representing the dimension of the channel

The ith element in the matrix.

It should be noted that, in order to avoid the dependence of the entire model on audio information and enable the model to be applied to a single-modality downstream task, the explicit music element alignment branch network provided in the embodiment of the present application may further include an audio gating module (see 304 in fig. 3), which may select a first audio information from the audio information of the dance video to be input to the cross-modality attention module, for example, a fixed portion (e.g., the first audio information) of each input batch (audio information of the dance video) may be led to the cross-modality attention module with a certain probability, and another portion is subjected to the same dimension transformation through linear projection and is not subjected to the attention mechanism.

The second part of the trichromatic injector is a time sequence attention module, and the time sequence attention module can capture feature interaction between long distances, so that time sequence attention calculation is carried out on cross-mode correlation through the time sequence attention module to obtain trichromatic information. The implementation of this part is as follows:

f_rgb1，f_rgb2＝AudioDropout(f_rgb，p)

AudioDropout(f，p)＝f[b*p：]，f[：b*p]

f_a：rgb1＝AGVA(f_rgb1，f_a)

f′_rgb2＝Linear(Tile(f_rgb2))

f_a：rgb＝Concat(f_a：rgb1，f′_rgb2)

f_inj＝Att(f_a：rgb，f_a：rgb，f_a：rgb)

wherein AudioDropout () represents an audio gating module, p is a parameter, for example, the above specific probability, by which the audio information in the dance video can be divided into two parts, the first part (for example, the first audio information) is input to the cross-mode attention module for attention processing, the second part is subjected to linear projection processing, f_rgb1Denotes the first part, f_rgb2Denotes the second part, f_rgbRepresenting audio information in a dance video;

b denotes batch size, f [ b × p: represents the first portion, f [: b p represents a second moiety;

f_a：rgb1representing cross-modal associations obtained after attention processing by a cross-modal attention module, AGV A () representing attention processing, f_aAudio features, f 'representing audio information in dance video'_rgb2Representing the result after Linear projection processing, Linear () representing a Linear processing function, Tile () representing a dimension transformation operation, f_a：rgbRepresents that f is_a：rgb1And f'_rgb2As a result of the merge, Concat () represents the merge function, f_injRepresenting the result obtained after the processing of the time-sequential attention module (i.e. the three primary color information finally obtained), Att () represents the time-sequential self-attention.

S204, optimizing the initial model according to the music principle prime points of the visual information and the initial point characteristics of the audio information so as to align the music principle prime points of the visual information with the music principle prime points of the audio to obtain a pre-training model.

In the embodiment of the application, the music principle prime point of the audio frequency can be used as a standard, the music principle prime point of the predicted visual information is aligned with the music principle prime point of the audio frequency to be used as an optimization target (namely, a proxy task) to optimize the initial model until the music principle prime point of the predicted visual information is aligned with the music principle prime point of the audio frequency, and the model obtained by optimization at the moment is used as a pre-training model.

When the initial model includes the explicit music theory element alignment branch network, the implementation manner of S204 may be to construct a first loss function according to the music theory element points of the visual information and the start point features of the audio information, and optimize the explicit music theory element alignment branch network according to the first loss function to obtain the pre-training model.

In a possible implementation manner, in order to avoid the imbalance of information between the optimistic prime point and the non-optimistic prime point, in the embodiment of the present application, focal local may be used as the first loss function for optimization, where the optimization manner is as follows:

wherein,

represents the value of the focal loss function, α_tThe representation super-parameters can be set according to actual requirements.

According to the technical scheme, when the dance video is used for pre-training, the visual information and the audio information in the dance video can be acquired. Because some unique features, such as musical key elements of rhythm, melody, beat and the like, exist in the dance video, and basically all the unique features need to be considered in some downstream tasks related to dance/music, in order to ensure that a pre-training model obtained through pre-training can be ideal in the downstream task of the dance video of a special type, the initial point feature of audio information can be extracted based on a spectrogram of the audio information, the initial point feature is used for representing musical key elements of audio in the dance video, and the musical key elements of visual information are predicted by using the initial model, so that the initial model can be optimized according to the musical key elements of the visual information and the initial point feature of the audio information, so that the musical key elements of the visual information are aligned with the musical key elements of the audio to obtain the pre-training model. The method and the device have the advantages that the music principle element points of the visual information and the music principle element points of the audio are aligned to serve as agent tasks to achieve model pre-training, and some unique characteristics in dance videos are considered, so that the pre-training model obtained through training can be better suitable for a plurality of downstream tasks related to dance/music under the condition that data labeling is not needed, and performance is ideal.

It can be understood that, since the musical key points of the audio and the musical key points of the visual information are extracted based on the audio-video information (i.e., the audio information and the visual information) in the dance video, the musical key points are actually hidden in the video stream and the audio stream. Based on this, the embodiment of the present application proposes implicit musical key alignment, and uses AVC and AVTS as proxy tasks to determine whether a given audio and video are from correlated/time-synchronized video through a model, which is also equivalent to implicitly determining musical key correlation/time-synchronization.

In this case, the initial model may further include an implicit music theory element alignment branch network (see 305 in fig. 3), so that the visual features are obtained according to the visual information of the dance video and the audio features are obtained according to the audio information of the dance video through the implicit music theory element alignment branch network. If the dance video comprises a positive sample and a negative sample, the dominant music theory element alignment branch network is optimized according to the first loss function, and the pre-training model is obtained by constructing a second loss function according to the visual characteristics of the positive sample, the audio characteristics of the positive sample and the audio characteristics of the negative sample, then constructing a target loss function according to the first loss function and the second loss function, and performing joint optimization on the dominant music theory element alignment branch network and the recessive music theory element alignment branch network according to the target loss function.

Through the joint optimization of the explicit music theory element alignment branch network and the implicit music theory element alignment branch network, the generalization of the pre-training model can be increased, so that the pre-training model can be applied to more downstream tasks.

For implicit music factor alignment, the Triplet Loss is used for optimization in the embodiment of the application, and for each positive and negative sample pair, the model inputs the visual information of the positive sample and the audio information of the positive and negative samples to form a triple. The embodiment of the application aims to reduce the interval between the visual information of the positive sample and the audio information of the positive sample and enlarge the interval between the visual information of the positive sample and the audio information of the negative sample, and the implementation mode is as follows:

wherein,

denotes triple Loss, N denotes the size of the timing window, f'_vThe visual information is represented by a visual representation,f′_a：posaudio information, f 'representing a positive sample'_a：negThe audio information representing the negative examples is,

the square of the norm is represented, and alpha represents a threshold value over-parameter, which can be set according to actual requirements.

And finally, carrying out joint optimization on the explicit music theory element alignment branch network and the implicit music theory element alignment branch network in the following mode:

wherein,

representing the objective function, λ₁And λ₂And the representation hyperparameter can be set according to actual requirements, and N represents the size of a time sequence window.

In a possible implementation manner, the implicit music theory element alignment branch network may be a double-tower model, the double-tower model includes a first branch network and a second branch network, the visual characteristics are obtained according to the visual information of the dance video through the implicit music theory element alignment branch network, and the visual characteristics are obtained according to the visual information of the dance video through the first branch network; and obtaining the audio characteristics according to the audio information of the dance video through the second branch network.

Specifically, the double-tower model can be optimized based on a double-tower model of a Transformer architecture. And for the visual information, the visual encoder of the PWC-Net network is used for encoding the visual information, and the audio convolutional network consisting of a plurality of convolutional layers is used as an audio encoder for encoding the audio information. The audio information may also be feature extracted by a feature extractor prior to being encoded. Outputting final Visual characteristics to optimize through a Visual converter (Visual transformer) and a Visual Embedding layer (Visual Embedding) by the coded Visual information; the encoded Audio information passes through an Audio transformer (Audio transformer) and an Audio Embedding layer (Audio Embedding), and final Audio characteristics are output for optimization. After the visual feature and the audio feature are obtained, the two can be combined to obtain the cross-modal feature for output, and the visual feature and the audio feature can be output independently.

It should be noted that, in the implicit music theory element alignment process, AVC and AVTS may be used as proxy tasks, and for AVC and AVTS, negative samples need to be selected first. The negative examples may be chosen differently due to the different proxy tasks. For the AVC agent task, the visual information of the original dance video and the audio information of the other dance video are combined to form a negative sample irrelevant to the audio and video. In this case, it is necessary to avoid a negative sample sampling problem, i.e., another dance video selected at random has exactly the same music as the original dance video, so that additional filtering of the audio information of the two dance videos is required. In this case, the audio information that is a negative sample can be selected only when the difference between the musical key distribution of the two audio information is significant. Taking the original dance video as a first dance video and the other dance video as a second dance video as an example, when the negative sample is selected, the audio information of the first dance video and the audio information of the second dance video can be obtained, and if the similarity score between the audio information of the first dance video and the audio information of the second dance video is greater than zero, the visual information of the first dance video and the audio information of the second dance video form the negative sample.

The calculation formula of the similarity score is as follows:

wherein S is_rhyRepresenting a similarity score between audio information of a first dance video and audio information of a second dance video, T representing a time node, T representing a total timeThe length of the utility model is long,

representing the audio information in the first dance video at the tth time node,

representing the audio information in the second dance video at the t-th time node, and alpha representing the hyper-parameter.

Typically, only the similarity score S_rhyIf the value is more than 0, the condition of screening is met, and the audio information can be used as the audio information of the negative sample.

For the negative sample of the AVTS, the negative sample can be manufactured by performing time sequence offset in the same section of dance video. Because the music principle prime points may have similar time sequence intervals in distribution, after time sequence offset of random time length, the offset time length of the time sequence offset is just equal to a multiple of the time sequence interval, so that the music principle prime points of the audio frequency after the time sequence offset are just synchronous with the music principle prime points of the original dance video, and a negative sample is unreasonable, thereby additional screening is needed. In this case, the randomly selected offset time length cannot be equal to a multiple of the duration of the optimistic primes. However, because the intervals of the music principle prime points of each dance video are different, the embodiment of the application uses the beats for substitution, selects the duration of one eighth beat as a basic duration unit, and the offset duration cannot be equal to the multiple of the one eighth beat, that is, obtains the visual information and the audio information of the first dance video, performs time sequence offset on the visual information according to the offset time, and forms a negative sample by the visual information after the time sequence offset and the audio information of the first dance video. The offset duration is not equal to a multiple of a basic duration unit of the audio information, and the basic duration unit is a duration of one eighth beat of the audio information. The specific screening formula is as follows:

wherein f is_sftNumber of video frames representing offset (which can be scaled to offset)Duration), mod denotes the remainder, k_fpsRepresenting the sampling rate, k, of the first dance video_bpmRepresenting a quarter beat of audio information.

The pre-training model can be obtained through training by the introduction, and the obtained pre-training model can be used for various downstream tasks. The embodiment of the application is mainly introduced by taking the application to three downstream tasks as examples, namely dance classification, dance-music cross-modal retrieval and dance video smart checkpoint re-creation.

For dance classification, in a video platform, there are several specific dance categories, such as jazz, street dance, modern dance, etc., and after a user uploads a dance video, a downstream task can be processed based on a pre-training model obtained in an embodiment of the present application: and performing automatic classification based on contents on the downstream task model obtained by fine adjustment on the dance classification, thereby facilitating the watching and searching of users.

In one possible implementation, a classification layer may be added on the basis of the pre-trained model. At the moment, the dance classification mode can be that the dance video to be classified is obtained, the visual characteristic and the audio characteristic of the dance video to be classified are obtained through a pre-training model, and the musical key points of the audio frequency in the dance video to be classified and the musical key points of the visual information in the dance video to be classified are obtained; splicing the visual characteristics and the audio characteristics of the dance videos to be classified, and musical principle elements of the audio frequency in the dance videos to be classified and musical principle elements of the visual information in the dance videos to be classified to obtain splicing characteristics; and according to the splicing characteristics, classifying through a classification layer to obtain a classification result of the dance videos to be classified.

In one possible implementation, two different linear layer projections and nonlinear function activations may also be added between the pre-trained model and the classification layer.

During model training of dance classification downstream tasks, pre-training parameters on agent tasks can be directly used for transfer learning.

For dance-music cross-modal retrieval, in a video platform, a user often performs music/dance retrieval corresponding to a specific dance picture or music. The application scene can be further expanded to scenes such as dance music score, dance recommendation corresponding to music, and intelligent dance grading. The application can process the downstream tasks based on the pre-training model obtained in the embodiment of the application: dance-music cross-modal retrieval is achieved without the need for additional training.

In a possible mode, if dance retrieval corresponding to specific music is carried out according to the specific music, the audio to be retrieved can be obtained; determining a first audio characteristic and a first music principle prime point of the audio to be retrieved through a pre-training model, and determining a first visual characteristic and a second music principle prime point according to visual information of a dance video in a retrieval database through the pre-training model; respectively calculating a first similarity between a first audio feature and a first visual feature and a second similarity between the first music element point and the second music element point; obtaining a similarity matrix based on the first similarity and the second similarity; and determining the visual information of the dance video matched with the audio to be retrieved from a retrieval database according to the similarity matrix.

If music retrieval corresponding to the dance is carried out according to the specific dance, the visual information of the dance video to be retrieved can be obtained; determining a second visual characteristic and a third music principle prime point according to the visual information of the dance video to be retrieved through a pre-training model, and determining a second audio characteristic and a fourth music principle prime point according to the audio information of the dance video in the retrieval database through the pre-training model; respectively calculating a third similarity between the second visual characteristic and the second audio characteristic and a fourth similarity between the third music element point and the fourth music element point; obtaining a similarity matrix based on the third similarity and the fourth similarity; and determining audio information matched with the visual information of the dance video to be retrieved from the retrieval database according to the similarity matrix.

The similarity matrix may be obtained by weighted combination of the first similarity and the second similarity, or by weighted combination of the third similarity and the fourth similarity, and the weighted combination is as follows:

S_hyb＝λ₃S_e+(1-λ₃)S_r

wherein S is_hybRepresenting a similarity matrix, S, obtained by weighted combination_eDenotes a first similarity (or a third similarity), S_rDenotes a second similarity (or fourth similarity), λ₃The representation hyperparameter can be determined according to actual requirements.

For dance video smart-card point re-creation, in a video platform, a user may perform secondary creation of a dance video, such as matching an original dance with a new piece of music. The application scenario can be based on the pre-training model obtained in the embodiment of the application to perform downstream task processing: the technical scheme of the card point dance video re-creation helps a user to perform automatic intelligent creation.

In one possible mode, the dance video smart card point re-creation mode can be that dance videos to be created and audios to be created are obtained; determining musical key elements of a dance video to be created and musical key elements of an audio to be created by using a pre-training model; and aligning the musical key points of the dance video to be created and the musical key points of the audio to be created by using a time sequence acceleration/deceleration, time sequence offset or dynamic time programming mode to generate the stuck point video.

There are three implementations of this section: the first is time sequence offset, namely, the shorter one of two sequences (the music key element points of the dance video to be created and the music key element points of the audio to be created) is searched on the other sequence in a sliding window mode to obtain more accurate positions corresponding to the music key elements, so that the click is completed in a time sequence offset mode; the second is time sequence acceleration/deceleration, namely the time sequence length between every two music principle elements of two sequences becomes completely consistent, the acceleration can be realized by video frame insertion, and the deceleration can be realized by deleting video frames; the third is dynamic time planning, the method searches the optimal corresponding mode of each optimistic element point in the two sequences in a dynamic planning mode, the method can be realized by the idea of dynamic planning, the distance between each optimistic element point is used as the cost, the sum of the costs between the two sequences is made to be smaller as much as possible, and the specific realization method is as follows:

c(i，j)＝d(i，j)+M{c(i-1，j-1)，c(i-1，j)，c(i，j-1)}

d(i，j)＝|P_i-P_j|

wherein c (i, j) represents the cumulative distance of the dynamic time plan between the ith music element point and the jth music element point, the smaller c (i, j) is, the better d (i, j) represents the distance between the ith music element point and the jth music element point, M () represents the minimized function, c (i-1, j-1) represents the cumulative distance of the dynamic time plan between the ith-1 music element point and the jth music element point, c (i-1, j) represents the cumulative distance of the dynamic time plan between the ith-1 music element point and the jth music element point, c (i, j-1) represents the cumulative distance of the dynamic time plan between the ith music element point and the jth music element point, and P (i, j-1) represents the cumulative distance of the dynamic time plan between the ith music element point and the jth-1 music element point, and P_iIndicates the time sequence position of the ith music element point, P_jAnd represents the time sequence position of the j-th music element point.

It should be noted that the pre-training model obtained in the embodiment of the present application performs ideally in the downstream task performance, and the results of comparing the performance and performance of the pre-training model provided in the embodiment of the present application on the three downstream tasks are as follows.

For dance classification, the results of comparing the performance and performance of the pre-trained model provided in the examples of the application with those of the pre-trained model provided in the related art can be seen in table 1:

TABLE 1

Table 1 shows that the pre-training model provided in the embodiment of the present application is compared with two full-supervision methods Castro et al and MDR on a lots-Dance data set, and is compared with three audio/video self-supervision methods Multisensory, AVTS and LLA, and the result shows that the pre-training model provided in the embodiment of the present application can significantly exceed all the full-supervision methods and self-supervision methods, and proves the effectiveness of the pre-training model on a Dance classification task. For example, the data size is similar, but the accuracy of dance classification reaches 81.7%, which is higher than other methods provided by the related art.

For dance-music cross-modal retrieval, the results of comparing the performance and performance of the pre-trained model provided in the examples of the application with those provided in the related art can be seen in table 2:

TABLE 2

Table 2 shows the performance of the pre-trained model provided in the embodiments of the present application in cross-modality search on a data set of Dance-50. The pre-training model provided by the embodiment of the application is compared with three audio and video self-monitoring methods, and the result shows that the performance of the pre-training model provided by the embodiment of the application is remarkably superior to that of other audio and video self-monitoring methods, and the performance of each evaluation index exceeds 15%. The evaluation indexes are respectively R @1, R @5, P @10 and mAP, and the numerical values of the evaluation indexes are 0.622, 0.924, 0.661 and 0.633 in sequence and are higher than the numerical values of the corresponding evaluation indexes in the related art. Where R denotes the probability of the sample predicted to be positive out of all samples predicted to be positive, P denotes the probability of the sample predicted to be positive out of all samples predicted to be positive, and map (mean Average precision) denotes the Average accuracy.

For dance video smart card point re-creation, the comparison result of the three dance video smart card point re-creation modes provided by the application embodiment can be seen in fig. 4. FIG. 4 shows dance video smart-card-site authoring visualization results, which are provided for comparison for qualitative analysis due to no available quantitative analysis. Fig. 4 shows the three dance video smart card point re-creation results provided by the embodiment of the present application, and the results show that although the re-creation video with time sequence acceleration/deceleration corresponds well to the rhythm point, there are a lot of time sequence acceleration/deceleration, which is not smooth in appearance, and although the time sequence deviation is smooth in viewing, there is no time sequence acceleration/deceleration, so that there are some places where the correspondence is not good, and the dynamic time planning can obtain the best results, and can also maintain excellent viewing continuity on the premise of obtaining good rhythm point correspondence.

It should be noted that, on the basis of the implementation manners provided by the above aspects, the present application may be further combined to provide further implementation manners.

Based on the model training method provided in the embodiment corresponding to fig. 2, the embodiment of the present application further provides a model training apparatus 500. Referring to fig. 5, the model training apparatus 500 includes an obtaining unit 501, an extracting unit 502, a predicting unit 503, and an optimizing unit 504:

the acquiring unit 501 is configured to acquire visual information and audio information in a dance video;

the extracting unit 502 is configured to extract, based on the spectrogram of the audio information, a starting point feature of the audio information, where the starting point feature is used to represent a musical element point of an audio in the dance video;

the prediction unit 503 is configured to predict a musical key point of the visual information by using an initial model;

the optimizing unit 504 is configured to optimize the initial model according to the musical key points of the visual information and the start point features of the audio information, so that the musical key points of the visual information are aligned with the musical key points of the audio to obtain a pre-training model.

In a possible implementation manner, the initial model includes an explicit music theory element alignment branch network, and the prediction unit 503 is configured to:

predicting the musical principle points of the visual information by aligning the dominant musical principle with a branch network;

the optimizing unit 504 is configured to:

constructing a first loss function according to the musical key points of the visual information and the initial point characteristics of the audio information;

and optimizing the dominant music theory element alignment branch network according to the first loss function to obtain the pre-training model.

In a possible implementation manner, the explicit music theory element alignment branch network includes an action information extraction module and a video music theory element predictor, and the prediction unit 503 is configured to:

extracting action information according to the visual information through the action information extraction module;

and predicting through the video music theory element predictor according to the action information to obtain music theory element points of the visual information.

In a possible implementation, the explicit music element alignment branch network further includes a three primary color injector, and the prediction unit 503 is configured to:

extracting, by the tri-chromatic injector, tri-chromatic information from the visual information;

and predicting the music principle points of the visual information according to the three-primary-color information and the action information through the video music principle element predictor.

In a possible implementation manner, the three primary color injector includes a cross-mode attention module and a time sequence attention module, and the prediction unit 503 is configured to:

determining, by the cross-modal attention module, a cross-modal association between the visual information and first audio information;

and performing time sequence attention calculation on the cross-modal association through the time sequence attention module to obtain the three primary color information.

In a possible implementation manner, the explicit music theory element alignment branch network further includes an audio gating module, and the prediction unit 503 is further configured to select, by the audio gating module, the first audio information from the audio information of the dance video to be input to the cross-modal attention module before determining the cross-modal association between the visual information and the audio information by the cross-modal attention module.

In a possible implementation manner, the motion information extraction module includes an optical flow extraction network module and a histogram calculation module, and the motion information is an optical flow direction histogram feature.

In a possible implementation manner, the initial model further includes an implicit music theory element alignment branch network, and the apparatus further includes a determining unit:

the determining unit is used for aligning the branch network through the implicit music theory element, obtaining a visual characteristic according to the visual information of the dance video and obtaining an audio characteristic according to the audio information of the dance video;

the dance video comprises a positive sample and a negative sample, and the optimization unit 504 is configured to:

constructing a second loss function according to the visual characteristics of the positive sample, the audio characteristics of the positive sample and the audio characteristics of the negative sample;

constructing a target loss function according to the first loss function and the second loss function;

and performing joint optimization on the explicit music theory element alignment branch network and the implicit music theory element alignment branch network according to the target loss function to obtain the pre-training model.

In a possible implementation manner, the implicit music theory element alignment branch network is a double-tower model, the double-tower model includes a first branch network and a second branch network, and the determining unit is configured to:

obtaining visual characteristics according to the visual information of the dance video through the first branch network;

and obtaining audio characteristics according to the audio information of the dance video through the second branch network.

In a possible implementation manner, the negative examples are selected in a manner that:

acquiring audio information of a first dance video and audio information of a second dance video;

if the similarity score between the audio information of the first dance video and the audio information of the second dance video is larger than zero, forming the visual information of the first dance video and the audio information of the second dance video into the negative sample;

or,

acquiring visual information and audio information of a first dance video;

and performing time sequence offset on the visual information according to offset time, and forming the negative sample by the visual information after the time sequence offset and the audio information of the first dance video, wherein the offset time length is not equal to the multiple of a basic time length unit of the audio information, and the basic time length unit is the time length of one eighth beat of the audio information.

In a possible implementation manner, the apparatus further includes a determining unit, a splicing unit, and a classifying unit:

the obtaining unit 501 is further configured to obtain a dance video to be classified;

the determining unit is used for obtaining the visual characteristics and the audio characteristics of the dance video to be classified, the musical key points of the audio frequency in the dance video to be classified and the musical key points of the visual information in the dance video to be classified through the pre-training model;

the splicing unit is used for splicing the visual characteristics and the audio characteristics of the dance video to be classified, the musical key points of the audio frequency in the dance video to be classified and the musical key points of the visual information in the dance video to be classified to obtain splicing characteristics;

and the classification unit is used for classifying through the classification layer according to the splicing characteristics to obtain a classification result of the dance videos to be classified.

In one possible implementation, the apparatus further includes a determining unit and a calculating unit:

the obtaining unit 501 is further configured to obtain an audio to be retrieved;

the determining unit is used for determining a first audio characteristic and a first music key point of the audio to be retrieved through the pre-training model, and determining a first visual characteristic and a second music key point according to visual information of a dance video in a retrieval database through the pre-training model;

the calculating unit is used for calculating a first similarity between the first audio feature and the first visual feature and a second similarity between the first music element point and the second music element point respectively;

the determining unit is further configured to obtain a similarity matrix based on the first similarity and the second similarity;

the determining unit is further configured to determine, from the retrieval database, visual information of a dance video matched with the audio to be retrieved according to the similarity matrix;

or,

the obtaining unit 501 is further configured to obtain visual information of a dance video to be retrieved;

the determining unit is used for determining a second visual characteristic and a third music principle prime point according to the visual information of the dance video to be retrieved through the pre-training model, and determining a second audio characteristic and a fourth music principle prime point according to the audio information of the dance video in the retrieval database through the pre-training model;

the calculating unit is configured to calculate a third similarity between the second visual feature and the second audio feature, and a fourth similarity between the third music element point and the fourth music element point, respectively;

the determining unit is further configured to obtain a similarity matrix based on the third similarity and the fourth similarity;

and the determining unit is further used for determining audio information matched with the visual information of the dance video to be retrieved from the retrieval database according to the similarity matrix.

In one possible implementation manner, the apparatus further includes a determining unit and a generating unit:

the obtaining unit 501 is further configured to obtain a dance video to be created and an audio to be created;

the determining unit is used for determining music key points of the dance video to be authored and music key points of the audio to be authored by utilizing the pre-training model;

and the generating unit is used for aligning the music key points of the dance video to be created and the music key points of the audio to be created by using a time sequence acceleration/deceleration, time sequence offset or dynamic time planning mode to generate the click video.

The embodiment of the application further provides an electronic device for model training, which can be a terminal, and takes the terminal as a smart phone as an example:

fig. 6 is a block diagram illustrating a partial structure of a smartphone according to an embodiment of the present application. Referring to fig. 6, the smart phone includes: radio Frequency (RF) circuit 610, memory 620, input unit 630, display unit 640, sensor 650, audio circuit 660, WiFi (WiFi) module 670, processor 680, and power supply 690. The input unit 630 may include a touch panel 631 and other input devices 632, the display unit 640 may include a display panel 641, and the audio circuit 660 may include a speaker 661 and a microphone 662. It will be appreciated that the smartphone configuration shown in fig. 6 is not intended to be limiting of smartphones and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

The memory 620 may be used to store software programs and modules, and the processor 680 may execute various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the smartphone, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 680 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, and performs various functions of the smart phone and processes data by operating or executing software programs and/or modules stored in the memory 620 and calling data stored in the memory 620. Optionally, processor 680 may include one or more processing units; preferably, the processor 680 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 680.

In this embodiment, the processor 680 in the smart phone may perform the following steps:

acquiring visual information and audio information in a dance video;

Referring to fig. 7, fig. 7 is a block diagram of a server 700 provided in this embodiment, and the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 722 (e.g., one or more processors) and a memory 732, and one or more storage media 730 (e.g., one or more mass storage devices) storing an application program 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Further, the central processor 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server 700.

The Server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, and/or one or more operating systems 741, such as a Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMAnd so on.

In this embodiment, the central processor 722 in the server 700 may perform the following steps:

acquiring visual information and audio information in a dance video;

According to an aspect of the present application, a computer-readable storage medium is provided, which is used for storing program codes, and the program codes are used for executing the model training method described in the foregoing embodiments.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the embodiment.

The description of the flow or structure corresponding to each of the above drawings has emphasis, and a part not described in detail in a certain flow or structure may refer to the related description of other flows or structures.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of model training, the method comprising:

acquiring visual information and audio information in a dance video;

extracting a starting point feature of the audio information based on the spectrogram of the audio information, wherein the starting point feature is used for representing music key points of the audio in the dance video;

and optimizing the initial model according to the music principle prime points of the visual information and the initial point characteristics of the audio information so as to align the music principle prime points of the visual information and the music principle prime points of the audio to obtain a pre-training model.

2. The method of claim 1, wherein the initial model comprises an explicit music element alignment branch network, and wherein predicting music element points of the visual information using the initial model comprises:

optimizing the initial model according to the musical theory key points of the visual information and the initial point characteristics of the audio information to align the musical theory key points of the visual information with the musical theory key points of the audio information to obtain a pre-training model, wherein the pre-training model comprises the following steps:

3. The method of claim 2, wherein the explicit music theory element alignment branch network comprises an action information extraction module and a video music theory element predictor, and wherein predicting music theory element points of the visual information through the explicit music theory element alignment branch network comprises:

4. The method of claim 3, wherein the explicit music element alignment branch network further comprises a tri-chromatic injector, and wherein predicting the music element points of the visual information by the video music element predictor according to the motion information comprises:

and predicting the music principle points of the visual information according to the three primary color information and the action information by the video music principle predictor.

5. The method of claim 4, wherein the three primary color injector comprises a cross-mode attention module and a timing attention module, and wherein extracting, by the three primary color injector, three primary color information from the visual information comprises:

6. The method of claim 5, wherein the network of explicit music element alignment branches further comprises an audio gating module, and wherein prior to determining, by the cross-modal attention module, the cross-modal association between the visual information and the audio information, the method further comprises:

and selecting the first audio information from the audio information of the dance video through the audio gating module and inputting the first audio information into the cross-modal attention module.

7. The method according to any one of claims 2-6, wherein the action information extraction module comprises an optical flow extraction network module and a histogram calculation module, and the action information is an optical flow direction histogram feature.

8. The method of claim 2, wherein the initial model further comprises an implicit musical element alignment branching network, the method further comprising:

aligning the branch network through the implicit music theory element, obtaining visual characteristics according to the visual information of the dance video, and obtaining audio characteristics according to the audio information of the dance video;

the dance video comprises a positive sample and a negative sample, the dominant music theory element alignment branch network is optimized according to the first loss function, and the pre-training model is obtained and comprises the following steps:

9. The method of claim 8, wherein the implicit musical theory element alignment branch network is a double-tower model, the double-tower model comprises a first branch network and a second branch network, and the obtaining of the visual feature according to the visual information of the dance video and the audio feature according to the audio information of the dance video through the implicit musical theory element alignment branch network comprises:

10. The method of claim 8, wherein the negative examples are selected by:

or,

acquiring visual information and audio information of a first dance video;

11. The method according to any one of claims 8-10, further comprising:

obtaining dance videos to be classified;

obtaining the visual characteristics and the audio characteristics of the dance video to be classified, as well as musical theory key points of the audio frequency in the dance video to be classified and musical theory key points of the visual information in the dance video to be classified through the pre-training model;

splicing the visual characteristics and the audio characteristics of the dance video to be classified, and musical key elements of the audio frequency in the dance video to be classified and musical key elements of the visual information in the dance video to be classified to obtain splicing characteristics;

and according to the splicing characteristics, classifying through the classification layer to obtain a classification result of the dance video to be classified.

12. The method according to any one of claims 8-10, further comprising:

acquiring audio to be retrieved;

determining a first audio characteristic and a first music principle prime point of the audio to be retrieved through the pre-training model, and determining a first visual characteristic and a second music principle prime point according to visual information of a dance video in a retrieval database through the pre-training model;

respectively calculating a first similarity between the first audio feature and the first visual feature and a second similarity between the first music element point and the second music element point;

obtaining a similarity matrix based on the first similarity and the second similarity;

determining the visual information of the dance video matched with the audio to be retrieved from the retrieval database according to the similarity matrix;

or,

acquiring visual information of a dance video to be retrieved;

determining a second visual characteristic and a third music principle prime point according to the visual information of the dance video to be retrieved through the pre-training model, and determining a second audio characteristic and a fourth music principle prime point according to the audio information of the dance video in the retrieval database through the pre-training model;

calculating a third similarity between the second visual feature and the second audio feature and a fourth similarity between the third music principle element point and the fourth music principle element point respectively;

obtaining a similarity matrix based on the third similarity and the fourth similarity;

and determining audio information matched with the visual information of the dance video to be retrieved from the retrieval database according to the similarity matrix.

13. The method according to any one of claims 1-6, further comprising:

obtaining a dance video to be created and an audio to be created;

determining the musical key points of the dance video to be created and the musical key points of the audio to be created by utilizing the pre-training model;

and aligning the music principle points of the dance video to be created and the music principle points of the audio to be created by using a time sequence acceleration/deceleration, time sequence offset or dynamic time planning mode to generate a click video.

14. A model training device is characterized by comprising an acquisition unit, an extraction unit, a prediction unit and an optimization unit:

and the optimization unit is used for optimizing the initial model according to the music principle prime points of the visual information and the initial point characteristics of the audio information so as to align the music principle prime points of the visual information with the music principle prime points of the audio information and obtain a pre-training model.

15. An electronic device for model training, the electronic device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-13 according to instructions in the program code.

16. A computer-readable storage medium for storing program code, which when executed by a processor causes the processor to perform the method of any one of claims 1-13.

17. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1-13 when executed by a processor.