CN111898753A

CN111898753A - Music transcription model training method, music transcription method and corresponding device

Info

Publication number: CN111898753A
Application number: CN202010779114.6A
Authority: CN
Inventors: 孔秋强; 王雨轩
Original assignee: ByteDance Inc
Current assignee: ByteDance Inc
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2020-11-06
Anticipated expiration: 2040-08-05

Abstract

The embodiment of the disclosure discloses a training method of a music transcription model, a music transcription method and a corresponding device, wherein the method comprises the following steps: acquiring training data, wherein each training sample in the training data comprises an audio characteristic vector of a sample audio, a sample music score corresponding to the sample audio, and a first sample time characteristic value and a second sample time characteristic value corresponding to each frame in the sample audio; training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after the training is finished as a music transcription model; the input of the model is an audio feature vector of the sample audio, and the output of the model comprises a first prediction time feature value, a second prediction time feature value and a prediction music score corresponding to each frame in the sample audio. The training method provided by the embodiment of the disclosure can improve the accuracy of music transcription, so that the music score obtained by transcription is closer to the real expression of audio, and the applicability is high.

Description

Music transcription model training method, music transcription method and corresponding device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a music transcription model training method, a music transcription method, and a corresponding apparatus.

Background

The Automatic Music Transcription (AMT) can translate original Music audio into symbol marks, mainly including the start time and end time of each note in the Music audio, and has wide application in Music teaching, Music appreciation, Music theory analysis, and the like.

However, the conventional music transcription method mainly performs transcription by predicting whether a note exists in each frame of each audio, and the accuracy is low. Due to the fact that musical notes in a music composition are more and melodic variations are various, in a traditional music transcription method, the relativity of frames and musical notes is prone to be deviated, and therefore the obtained music score is prone to be different from the real music expression of music audio.

Therefore, how to further improve the accuracy of music transcription becomes an urgent problem to be solved.

Disclosure of Invention

The disclosed embodiments provide a music transcription model training method, a music transcription method, and a corresponding apparatus, which are provided in order to introduce concepts in a simplified form, which will be described in detail in the following detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, an embodiment of the present disclosure provides a method for training a music transcription model, where the method includes:

acquiring training data, wherein each training sample in the training data comprises an audio feature vector of a sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, wherein for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note start time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note end time point of the frame;

training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after the training is finished as a music transcription model;

the input of the model is the audio feature vector of the sample audio, and the output of the model comprises the first prediction time feature value, the second prediction time feature value and the prediction music score corresponding to each frame in the sample audio;

the total loss function includes a first training loss function, a second training loss function, and a third training loss function, wherein a value of the first training loss function represents a difference between a sample score corresponding to the sample audio and the predicted score, a value of the second training loss represents a difference between a first sample time feature value corresponding to the sample audio and the first predicted time feature value, and a value of the third training loss function represents a difference between a second sample time feature value corresponding to the sample audio and the second predicted time feature value.

In a second aspect, embodiments of the present disclosure provide a music transcription method, including:

acquiring audio to be processed, and determining an audio characteristic vector corresponding to the audio to be processed;

inputting the audio characteristic vector of the audio to be processed into a music transcription model, and obtaining a music score corresponding to the audio to be processed based on an output result of the music transcription model;

the music transcription model is obtained by training through the training method of the music transcription model provided by the embodiment of the disclosure.

In a third aspect, an embodiment of the present disclosure provides a training apparatus for a music transcription model, including:

a training data obtaining module, configured to obtain training data, where each training sample in the training data includes an audio feature vector of a sample audio, a sample score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, where, for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note start time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note end time point of the frame;

the training module is used for training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after the training is finished as a music transcription model;

In a fourth aspect, an embodiment of the present disclosure provides a music transcription apparatus, including:

the audio processing device comprises a to-be-processed audio acquisition module, a to-be-processed audio acquisition module and a to-be-processed audio acquisition module, wherein the to-be-processed audio acquisition module is used for acquiring audio to be processed and determining an audio characteristic vector corresponding to the audio to be processed;

the transcription module is used for inputting the audio characteristic vector of the audio to be processed into a music transcription model and obtaining a music score corresponding to the audio to be processed based on the output result of the music transcription model;

the music transcription model is obtained through training by a training device of the music transcription model provided by the embodiment of the disclosure.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, which includes a processor and a memory, where the processor and the memory are connected to each other;

the memory is used for storing computer programs;

the processor is configured to execute the training method of the music transcription model provided by the embodiment of the disclosure and/or the music transcription method provided by the embodiment of the disclosure when the computer program is called.

In a sixth aspect, the present disclosure provides a computer readable medium, which stores a computer program, where the computer program is executed by a processor to implement the training method of the music transcription model provided by the present disclosure and/or the music transcription method provided by the present disclosure.

In the embodiment of the present disclosure, by using the sample audio feature vector, the sample volume, the music score corresponding to the sample audio in each training sample in the training data, and the time feature value corresponding to each frame in the sample audio, the initial neural network model can be comprehensively trained from the aspects of the note starting time point, the note ending time point, the volume, the whole music score, and the like, and the model is further modified and trained by combining the loss functions corresponding to the above factors, so as to improve the transcription accuracy of the music transcription model obtained after training. On the other hand, the embodiment of the present disclosure may also transcribe the audio to be processed based on the music transcription model to obtain a corresponding music score, and the applicability is high.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a training method of a music transcription model provided by an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for determining a sample time characteristic value according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a distribution of note onset time points provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a scenario for determining a first sample temporal feature value according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an initial neural network model provided by an embodiment of the present disclosure;

FIG. 6 is another schematic diagram of an initial neural network model provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of another structure of an initial neural network model provided in an embodiment of the present disclosure

FIG. 8 is a schematic diagram of a model testing method provided by an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of a training apparatus for a music transcription model provided in an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a music transcription apparatus provided in an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing the devices, modules or units, and are not used for limiting the devices, modules or units to be different devices, modules or units, and also for limiting the sequence or interdependence relationship of the functions executed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Referring to fig. 1, fig. 1 is a schematic flowchart of a training method of a music transcription model provided in an embodiment of the present disclosure. As shown in fig. 1, a method for training music transcription provided by an embodiment of the present disclosure may include the following steps:

and S11, acquiring training data.

Specifically, the training data in the present disclosure includes a plurality of training samples (also referred to as training targets), each of which includes an audio feature vector of a sample audio, a sample score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio.

The sample audio is a time-domain signal and cannot directly participate in the training of the music transcription model, so that the sample characteristics included in each training sample can be converted into audio characteristic vectors. That is, when performing music transcription model training, the input of the model is the audio feature vector of each sample audio. Alternatively, the audio feature vector of the sample audio may be a logarithmic mel-frequency spectrum feature of the audio. Specifically, the sample audio may be converted from the time domain to the frequency domain, the log mel spectrum feature of each frame of the sample audio is extracted, and the audio feature vector of the sample audio is obtained based on the log mel spectrum feature of each frame. The logarithmic mel spectrum X may be represented by a matrix of T × F, where T represents the total number of frames of the logarithmic mel spectrum, and F represents the number of frequency points of the logarithmic mel spectrum.

The sample audio in the present disclosure may be obtained based on various ways, and may specifically be determined based on the requirements of the actual application scenario, which is not limited herein. Such as recording music to obtain sample audio, obtaining sample audio from other devices, networks based on data transmission, and obtaining sample audio from storage spaces such as databases, data carriers, etc. Moreover, the music content of the sample audio includes, but is not limited to, piano playing music, guitar playing music, and the like, and may also be determined based on the requirements of the actual application scenario, which is not limited herein.

Wherein, for any frame in any sample audio, the first sample temporal feature value corresponding to the frame characterizes the time difference between the middle time point of the frame and the nearest note-starting time point (onset) of the frame, and the second sample temporal feature value corresponding to the frame characterizes the time difference between the middle time point of the frame and the nearest note-ending time point (offset) of the frame.

Optionally, the first sample temporal feature value and the second sample temporal feature value corresponding to each frame of the sample audio may be determined in a manner as shown in fig. 2. Fig. 2 is a schematic flowchart of a method for determining a sample time characteristic value according to an embodiment of the present disclosure. As shown in fig. 2, the method for determining a sample time characteristic value provided by the embodiment of the present disclosure includes the following steps:

and S111, performing framing processing on the sample audio to obtain each frame of the sample audio, and determining the middle time point of each frame of the sample audio.

Specifically, the sample audio may be subjected to framing processing based on a window function to obtain each frame of the sample audio, so as to determine an intermediate time point of each frame of the sample audio, and determine a first sample time characteristic value and a second sample time characteristic value corresponding to each frame of the sample audio on the basis of the intermediate time point of each frame of the sample audio. The window function includes, but is not limited to, a rectangular window, a haining window, a hamming window, and the like, and may be determined based on the actual application scenario requirements, which is not limited herein.

S112, the note start time point and the note end time point of each note included in the sample audio are obtained, and for each frame, the target note start time point and the target note end time point closest to the frame time are determined.

Specifically, since the sample audio is composed of different notes, the note start time point and note end time point of the different notes, and the duration of each note constitute the sample audio. Therefore, after determining the middle time point of each frame of the sample audio, the note-starting time point and the note-ending time point of each note in the sample audio can be obtained. Further, for each frame in the sample audios, a target note start time point and a target note end time point closest to the frame may be determined, so as to determine corresponding first and second sample time characteristic values based on the corresponding target note start time point and target note end time point of each frame.

As an example, fig. 3 is a schematic diagram of a distribution of note onset time points provided by an embodiment of the present disclosure, in which 6 frames of a sample audio are shown. In fig. 3, it is assumed that there are 3 note onset time points for 6 frames of sample audio, and 3 note onset time points are distributed as shown in fig. 3. For the 2 nd frame, the note start time point closest to the time of the 2 nd frame is time point 1, i.e. time point 1 is the target start time point corresponding to the 2 nd frame; for the 3 rd frame, the note start time point closest to the time thereof is time point 2, i.e. time point 2 is the target start time point corresponding to the 2 nd frame; for frame 4, the note start time point closest to the time of frame 4 is time point 2, i.e., time point 2 is the target start time point corresponding to frame 4. Based on the above method, the start time of the target note corresponding to each frame can be determined, and the end time of the target note corresponding to each frame can be determined similarly. It should be noted that the middle time point of each frame, the note start time point of the note, and the note end time point of the note are relative time points, i.e. time positions for representing in the sample audio.

And S113, determining a first sample time characteristic value and a second sample time characteristic value corresponding to each frame of the sample audio based on the middle time point of each frame of the sample audio and the target note starting time point and the target note ending time point corresponding to each frame.

Specifically, for each frame in each sample audio, a time difference between an intermediate time point of the frame and a start time point of a corresponding target note (hereinafter referred to as a first time difference for convenience of description) and a time difference between an end time point of the corresponding target note (hereinafter referred to as a second time difference for convenience of description) may be determined.

Taking the first time difference corresponding to the frame as an example, when the first time difference corresponding to the frame is greater than the preset time difference, determining the first sample time characteristic corresponding to the frame as 0; when the first time difference corresponding to the frame is less than or equal to the preset time difference, g (delta) will be passed_onset)＝1-α|Δ_onsetI, determining a first sample time characteristic value corresponding to the frame, wherein delta_offsetFor the first time difference corresponding to the frame, α is the normalization coefficient, g (Δ)_offset) Is the first sample temporal feature value corresponding to the frame. Since the first sample time characteristic value corresponding to each frame of the sample audio is a constant from 0 to 1, α is used to normalize the first time difference corresponding to each frame to a constant from 0 to 1. Therefore, it is easy to derive that, for sample audio, the first time sample feature value corresponding to each frame can be expressed as:

wherein, Delta_ADenotes a preset time difference, i is a frame index, g (Δ)_i) The first sample time characteristic value corresponding to the ith frame.

Referring to fig. 4, fig. 4 is a schematic view of a scenario for determining a first sample temporal feature value according to an embodiment of the present disclosure. As shown in fig. 4, assuming that there is only one note onset time point in the sample audio, the note onset time point is the target note onset time point corresponding to all frames. In fig. 4, it is easily found that the first time differences (Δ respectively) of the 1 st frame to the 5 th frame correspond to₁、Δ₂、Δ₃、Δ₄And Δ₅) Are all greater than a preset time difference delta_AAnd therefore, in combination with the above formula,the corresponding first sample time characteristic values are respectively g (delta)₁)、g(Δ₂)、g(Δ₃)、g(Δ₄) And g (Δ)₅). For the 6 th frame and the 7 th frame, the first time difference corresponding to the 6 th frame and the 7 th frame is obviously larger than the preset time difference delta_ATherefore, it can be determined that the first sample temporal feature values corresponding to the 6 th frame and the 7 th frame are both 0 based on the above formula.

Similarly, based on the above implementation manner, the second sample time characteristic value corresponding to each frame in the sample audio may be determined based on the second time difference corresponding to each frame in the sample audio, which is not described herein again.

And S12, training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after the training is finished as the music transcription model.

Specifically, when the initial neural network model is trained based on training data, the audio feature vector of the sample audio of each training sample may be used as the input of the model, and the output of the model is the first predicted time feature value, the second predicted time feature value and the predicted score corresponding to each frame in the sample audio.

The model architecture of the initial neural network model in the embodiment of the present disclosure is not limited in the embodiment of the present disclosure, and the initial neural network model includes, but is not limited to, network models based on neural networks such as a Recurrent Neural Network (RNN), a Long Short-Term Memory artificial neural network (LSTM), a Gated Recurrent Unit (GRU), and the like, and may be specifically determined based on actual application scenario requirements, and is not limited herein.

For each frame, the corresponding first prediction time characteristic value represents the prediction time difference between the middle time point of the frame and the nearest note start time point of the frame, and the corresponding second prediction time characteristic value represents the prediction time difference between the middle time point of the frame and the nearest note start time point of the frame.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an initial neural network model provided by the embodiment of the present disclosure. The initial neural network model shown in fig. 5 is mainly composed of initial neural networks (hidden layers), and in the process of training the initial neural network model based on training data, after the initial neural network model is trained based on training samples of the training data, a first prediction time characteristic value and a second prediction time characteristic value corresponding to each frame of a sample audio of the training samples and a prediction music score corresponding to the sample audio can be output.

And continuously performing iterative training on the model based on the training samples and the total loss function of the model until the total loss function corresponding to the model is converged. When the total loss function corresponding to the model converges, it indicates that the audio transcription capability and accuracy of the model tend to be stable at this time, and the model at the end of training may be determined as the music transcription model at this time.

Wherein, the total loss function l corresponding to the model_totalIncluding a first training loss function, a second training loss function, and a third training loss function, i.e./_total＝l_frame+l_onset+l_offset。

The first training loss function is:

I_frame(t, k) represents a sample score corresponding to the sample audio, P_frame(t, k) represents the predicted score output by the model,/_bce(I_frame(t,k),P_frame(t, k) represents the difference between the sample score and the predicted score in the same frame, and may be represented by a corresponding cross entropy loss function, or determined by other loss functions, which are not limited herein.

The second training loss function is:

G_onset(t, k) represents a first sample temporal feature value, R, corresponding to each frame of sample audio_onset(t, k) represents a first predicted temporal feature value, l, corresponding to each frame of the model output_bce(G_onset(t,k),R_onsetFirst sample time-specific per frame in (t, k) sample audioThe difference between the eigenvalue and the first predicted temporal eigenvalue may be represented by a corresponding cross-entropy loss function, or may be determined by other loss functions, without limitation.

The third training loss function is:

G_offset(t, k) represents a second sample temporal feature value, R, corresponding to each frame of sample audio_offset(t, k) represents a second predicted temporal feature value, l, corresponding to each frame of the model output_bce(G_offset(t,k),R_offset(t, k) represents the difference between the second sample temporal feature value and the second predicted temporal feature value of each frame in the sample audio, and may be represented by a corresponding cross-entropy loss function, or determined by other loss functions, which is not limited herein.

For example, for a piano audio, the number of total notes is the number of notes corresponding to 88 keys of a piano, where K is 88, and K is the T-th note of the K notes. Wherein, the cross entropy loss function can be determined by the following method: l_bce-y ln p- (1-y) ln (1-p). For the first training loss function, y represents whether a frame in the sample audio contains a note, and is 1 if contained, or 0 if not contained. p represents the probability of the predicted note that the frame contains the note, and 1-p represents the probability of not containing the note. For another example, for the second training loss function, y represents a first sample temporal feature value of a frame in the sample audio, and p represents a corresponding first predicted temporal feature value of the frame.

Wherein the value of the first training loss function characterizes a difference between a sample score corresponding to the sample audio and a predicted score output by the model.

The value of the second training loss function represents the difference between the first sample time characteristic value corresponding to the sample audio and the first predicted time characteristic value output by the model, namely the difference between the degree of correspondence between each frame of the sample audio and the note onset time point and the degree of correspondence between each frame of the sample audio and the note onset time point predicted by the model.

The value of the third training loss function represents a difference between a second sample time characteristic value corresponding to the sample audio and a second predicted time characteristic value output by the model, namely, a difference between a degree of correspondence between each frame of the sample audio and the note ending time point and a degree of correspondence between each frame of the sample audio and the note ending time point predicted by the model.

Optionally, since the note start time point of the sample audio is influenced by the volume (Velocity) of the sample audio to a certain extent, the first sample time characteristic value of each frame of the sample audio is influenced by the volume, and therefore, in order to further improve the stability and accuracy of the final music transcription model, the influence caused by the volume of the sample audio can be considered in the process of training the initial neural network model. The volume of the sample audio can represent the playing speed of each note in the sample audio.

With reference to fig. 6, fig. 6 is another structural diagram of the initial neural network model provided in the embodiment of the present disclosure. As shown in fig. 6, in this case, each training sample further includes a volume corresponding to the sample audio, and the output of the model further includes a sample volume of the sample audio, and at this time, the output of the model during each training process further includes a predicted volume corresponding to the sample audio. Wherein, the hidden layer in the initial neural network model is mainly composed of a neural network.

Further, in this case, the model trains the corresponding total loss function l_totalIn addition to the first training loss function, the second training loss function, and the third training loss function, a fourth training loss function characterizing a difference between a sample volume of the sample audio and a predicted volume corresponding to the sample audio is included.

Specifically, when determining the fourth training loss function, the first loss l corresponding to each frame of the sample audio may be determined based on the sample volume and the predicted volume corresponding to each frame of the sample audio_bce(I_vel(t,k),P_vel(t, k)). Wherein, I_vel(t, k) represents the sample volume, P, corresponding to each frame of sample audio_vel(t, k) represents the predicted volume per frame of the model predicted sample audio, the first loss l_bce(I_vel(t,k),P_vel(t, k)) represents the difference between the sample volume of the same frame of sample audio and the corresponding predicted volume of the sample audio, and may be represented by the corresponding cross entropy loss function as well, or determined by other loss functions, without limitation.

Further, the note characterization value I may be based on each frame of the sample audio_onset(t, k), and a first loss corresponding to each frame, determining a second loss I corresponding to each frame of the sample audio_onset(t,k)·l_bce(I_vel(t,k),P_vel(t, k)). Wherein, I_vel(t, k) represents the sample volume, P, corresponding to each frame of sample audio_vel(t, k) represents the predicted volume corresponding to each frame of sample audio predicted by the model, and the characteristic value I of the musical note_onset(t, k) represents whether a note exists in each frame of the sample audio (if exists, the corresponding note token is 1, if does not exist, the corresponding note token is 0), l_bce(I_vel(t,k),P_vel(t, k)) represents the difference between the sample volume of the same frame of sample audio and the corresponding predicted volume of the sample audio, which may be represented by the corresponding cross entropy loss as such, or determined by other loss functions, without limitation.

Based on the implementation mode, a fourth training loss function in the process of model training based on training data can be obtained

In this case, the total loss function for model training is l_total＝l_frame+l_onset+l_offset+l_vel。

As an alternative, fig. 7 shows another structural schematic diagram of the initial neural network provided by the present disclosure, and as shown in fig. 7, the initial neural network model is formed by a GRU-based neural network, and the input of the model may be a log mel-frequency spectrum diagram of the sample audio, and the sample feature vectors of the sample audio are obtained based on each convolution layer. The model obtains a predicted volume and a second predicted time characteristic value corresponding to the sample audio respectively based on the independent GRU neural network model. And meanwhile, obtaining a first predicted time characteristic value corresponding to each frame of the sample audio through the GRU neural network model based on an output result of the model corresponding to the predicted volume and a primary first predicted time characteristic value obtained by the independent neural network model. On the other hand, the initial neural network model may synthesize the first prediction time characteristic value, the second prediction time characteristic value, and the sample characteristic vector of the sample audio to obtain a prediction score corresponding to the sample audio. Optionally, to further ensure the accuracy of music transcription of the music transcription model obtained by training, the model at the end of training may be further tested when the total loss function corresponding to the model converges, so as to determine the model at the end of training as the music transcription model when the test result satisfies the preset test condition.

Specifically, the test data may be obtained, where each test audio of the test data includes a test score corresponding to the test audio, and a first test time characteristic value and a second test time characteristic value corresponding to each frame. The first test time characteristic value represents the time difference between the middle time point of the corresponding frame and the nearest note starting time point, and the second test time characteristic value represents the time difference between the middle time point of the corresponding frame and the nearest note ending time point.

Furthermore, the distribution of the note start time point and the note end time point of each note in the test audio can be accurately obtained based on the first test time characteristic value and the second test time characteristic value corresponding to each frame of the test audio. Specifically, the frame interval having the local maximum value of the first test time characteristic value may be determined according to the first test time characteristic value corresponding to each frame of the test audio. The frame interval includes 3 frames with first test time characteristic values different from 0, and the first test time characteristic value of the middle frame is the maximum value of the first test time characteristic values in the frame interval. That is, there are note-on time points, i.e., roots, in the frame intervalAnd determining the accurate distribution of the note starting time points according to the first test time characteristic values of the 3 frames in the frame interval. With reference to fig. 8, fig. 8 is a schematic diagram of a model testing method provided in the embodiment of the present disclosure. As shown in fig. 8, it is assumed that the abscissa A, B, C corresponds to different frame middle time points, and the ordinate corresponds to the first test time characteristic value thereof. Based on mathematical methods AB can be extended to D, where DC is perpendicular to the abscissa. The DC midpoint E is further determined such that EF and AD parallel to the abscissa are compared to G. Wherein, the horizontal coordinate difference between B and H (or G) is the time difference of the middle time point of the frame corresponding to the distance B between the note starting time points, namely

When the first test time characteristic value corresponding to A is greater than the first test time characteristic value corresponding to C (both are less than the first test time characteristic value corresponding to B),

by analogy, the precise distribution of the note starting time points of each note in the test audio can be obtained.

Similarly, the precise distribution of the note ending time points of each note in the test audio can be determined based on the second test time characteristic value corresponding to each frame in the test audio, and is not described herein again.

On the other hand, the test audio is transcribed based on the model at the end of training to obtain a predicted music score of the test audio, and then the note starting time point and the note ending time point of each note in the test music score are determined. When the note start time point and note end time point of each note in the test music score as the test result satisfy the preset test condition, the model at the end of training can be determined as the music transcription model.

The preset test condition may be determined based on the actual application scenario requirement, and is not limited herein. For example, when the note-on time point and the note-off time point of each note in the test audio respectively coincide with the note-on time point and the note-off time point in the test score, or the error is within a preset range, it may be determined that the test result satisfies the preset test condition. For another example, when the note start time point and note end time point of a certain proportion of notes in the test audio are respectively consistent with the note start time point and note end time point in the test score or the error is within a preset range, it may be determined that the test result satisfies the preset test condition.

When the test result does not meet the preset test condition, the model can be trained continuously based on the training data until the test result meets the preset test condition, and the model at the end of training is determined as the music transcription model.

In some possible embodiments, the embodiment of the present disclosure may further obtain an audio to be processed and determine an audio feature vector corresponding to the audio to be processed, and the specific determination manner may refer to a related implementation manner of step S11 on the sample audio in fig. 1, which is not described herein again. Further, the embodiment of the present disclosure may input the audio feature vector of the audio to be processed into the music transcription model, so as to obtain the music score corresponding to the audio to be processed based on the output result of the music transcription model. Such as transcribing piano accompaniment into piano rolling window, transcribing guitar playing into guitar score, etc.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a training apparatus for a music transcription model provided in an embodiment of the present disclosure. The device 1 provided by the embodiment of the disclosure comprises:

a training data obtaining module 11, configured to obtain training data, where each training sample in the training data includes an audio feature vector of a sample audio, a sample score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, where, for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note start time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note end time point of the frame;

the training module 12 is configured to train the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determine the model after the training as a music transcription model;

In some possible embodiments, each of the training samples further includes a sample volume corresponding to the sample audio, the input of the model further includes a sample volume of the sample audio, the output of the model further includes a predicted volume corresponding to the sample audio, and the total loss function further includes a fourth training loss function, and a value of the fourth training loss function represents a difference between the sample volume and the predicted volume of the sample audio.

In some possible embodiments, each training sample further includes a note representation value of each frame included in the sample audio, where the note representation value represents whether a note start time point is included in a frame; a sample volume of the sample audio includes a sample volume of each frame included in the sample audio, and the predicted volume includes a predicted volume of each frame;

wherein, the training module 12 is configured to:

for the sample audio, calculating a first loss corresponding to each frame of the sample audio based on a sample volume and a predicted volume corresponding to each frame of the sample audio;

determining a second loss corresponding to each frame of the sample audio based on the note characterization value of each frame of the sample audio and the corresponding first loss;

and obtaining the fourth training loss function based on the second loss corresponding to each frame of each sample audio.

In some possible embodiments, the first training loss function, the second training loss function, and the third training loss function are cross entropy loss functions, respectively.

In some possible embodiments, the training data obtaining module 11 is further configured to:

performing framing processing on the sample audio to obtain each frame of the sample audio, and determining the middle time point of each frame of the sample audio;

determining a note start time point and a note end time point of each note included in the sample audio, and for each frame, determining a target note start time point and a target note end time point which are closest to the frame time;

and determining a first sample time characteristic value and a second sample time characteristic value corresponding to each frame of the sample audio based on the target note starting time point and the target note ending time point corresponding to each frame of the sample audio.

for each frame, determining a first time difference between the middle time point of the frame and the start time point of the corresponding target note, and a second time difference between the middle time point of the frame and the end time point of the corresponding target note;

determining a first sample time characteristic value corresponding to each frame in the sample audio based on a first time difference corresponding to each frame in the sample audio;

and determining a second sample time characteristic value corresponding to each frame in the sample audio based on a second time difference corresponding to each frame in the sample audio.

In some possible embodiments, the training data obtaining module 11 is configured to:

for each frame, if the first time difference corresponding to the frame is greater than the preset time difference, determining the first sample time characteristic value corresponding to the frame as 0;

if the first time difference corresponding to the frame is less than or equal to the preset time difference, the first sample time characteristic value corresponding to the frame is determined by the following method:

g(Δ_onset)＝1-α|Δ_onset|

wherein, Delta_onsetFor the first time difference corresponding to the frame, α is the normalization coefficient, g (Δ)_onset) Is the first sample temporal feature value corresponding to the frame.

for each frame, if the second time difference corresponding to the frame is greater than the preset time difference, determining the second sample time characteristic value corresponding to the frame as 0;

if the second time difference corresponding to the frame is less than or equal to the preset time difference, the first sample time characteristic value corresponding to the frame is determined by the following method:

g(Δ_offset)＝1-α|Δ_offset|

wherein, Delta_offsetFor the second time difference corresponding to the frame, α is the normalization coefficient, g (Δ)_offset) The corresponding second sample temporal feature value for the frame.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a music transcription apparatus provided in an embodiment of the present disclosure. The device 2 provided by the embodiment of the disclosure comprises:

a to-be-processed audio acquiring module 21, configured to acquire a to-be-processed audio and determine an audio feature vector corresponding to the to-be-processed audio;

a transcription module 22, configured to input the audio feature vector of the audio to be processed into a music transcription model, and obtain a music score corresponding to the audio to be processed based on an output result of the music transcription model;

the music transcription model is obtained by training the music transcription model in the training method of fig. 1.

Referring now to FIG. 11, shown is a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

The electronic device includes: a memory and a processor, wherein the processor may be referred to as the processing device 601 described below, and the memory may include at least one of a Read Only Memory (ROM)602, a Random Access Memory (RAM)603, and a storage device 608, which are described below:

as shown in fig. 11, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring training data, wherein each training sample in the training data comprises an audio feature vector of a sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, wherein for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note start time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note end time point of the frame; training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after the training is finished as a music transcription model; the input of the model is the audio feature vector of the sample audio, and the output of the model comprises the first prediction time feature value, the second prediction time feature value and the prediction music score corresponding to each frame in the sample audio; the total loss function includes a first training loss function, a second training loss function, and a third training loss function, wherein a value of the first training loss function represents a difference between a sample score corresponding to the sample audio and the predicted score, a value of the second training loss represents a difference between a first sample time feature value corresponding to the sample audio and the first predicted time feature value, and a value of the third training loss function represents a difference between a second sample time feature value corresponding to the sample audio and the second predicted time feature value.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring audio to be processed, and determining an audio characteristic vector corresponding to the audio to be processed; inputting the audio characteristic vector of the audio to be processed into a music transcription model, and obtaining a music score corresponding to the audio to be processed based on an output result of the music transcription model; the music transcription model is obtained by training through the training method of the music transcription model provided by the embodiment of the disclosure.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module or unit does not in some cases constitute a limitation on the unit itself, for example, a training module may also be described as a "music transcription model training module".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or electronic device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or electronic device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, an example provides a training method of a music transcription model, including:

In a possible implementation, each of the training samples further includes a sample volume corresponding to the sample audio, the input of the model further includes a sample volume of the sample audio, the output of the model further includes a predicted volume corresponding to the sample audio, and the total loss function further includes a fourth training loss function, and a value of the fourth training loss function represents a difference between the sample volume and the predicted volume of the sample audio.

In a possible embodiment, each of the training samples further includes a note representation value of each frame included in the sample audio, where the note representation value represents whether a note start time point is included in a frame; a sample volume of the sample audio includes a sample volume of each frame included in the sample audio, and the predicted volume includes a predicted volume of each frame;

wherein the fourth training loss function is obtained by:

In one possible embodiment, the first training loss function, the second training loss function, and the third training loss function are cross entropy loss functions, respectively.

In a possible embodiment, the method further includes:

In one possible implementation, the determining a first sample time characteristic value and a second sample time characteristic value corresponding to each frame of the sample audio based on a target note start time point and a target note end time point corresponding to each frame of the sample audio includes:

In one possible embodiment, the determining the first sample temporal feature value corresponding to each frame in the sample audio based on the first time difference corresponding to each frame in the sample audio includes:

g(Δ_onset)＝1-α|Δ_onset|

In one possible embodiment, the determining the second sample time characteristic value corresponding to each frame in the sample audio based on the second time difference corresponding to each frame in the sample audio includes:

g(Δ_offset)＝1-α|Δ_offset|

Example two provides a music transcription method, according to one or more embodiments of the present disclosure, including:

the music transcription model is obtained by training through the method.

Example three provides a training apparatus of a music transcription model corresponding to example one, including:

wherein, the training module is used for:

In some possible embodiments, the training data obtaining module is further configured to:

In some possible embodiments, the training data obtaining module is configured to:

g(Δ_onset)＝1-α|Δ_onset|

g(Δ_offset)＝1-α|Δ_offset|

Example four provides a music transcription apparatus corresponding to example two, including:

the music transcription model is obtained by training through the training method of the music transcription model in the first example.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for training a music transcription model, the method comprising:

the input of the model is an audio feature vector of the sample audio, and the output of the model comprises the first prediction time feature value, the second prediction time feature value and the prediction music score corresponding to each frame in the sample audio;

the total loss function includes a first training loss function whose values characterize a difference between a sample score corresponding to the sample audio and the predicted score, a second training loss function whose values characterize a difference between a first sample temporal feature value corresponding to the sample audio and the first predicted temporal feature value, and a third training loss function whose values characterize a difference between a second sample temporal feature value corresponding to the sample audio and the second predicted temporal feature value.

2. The method of claim 1, wherein each training sample further comprises a corresponding sample volume of the sample audio, wherein the input to the model further comprises the sample volume of the sample audio, wherein the output to the model further comprises a corresponding predicted volume of the sample audio, and wherein the total loss function further comprises a fourth training loss function, wherein the value of the fourth training loss function characterizes a difference between the sample volume and the predicted volume of the sample audio.

3. The method of claim 2, wherein each training sample further comprises a note characterization value of each frame included in the sample audio, the note characterization value characterizing whether a frame includes a note start time point; a sample volume of the sample audio comprises a sample volume of each frame included in the sample audio, the predicted volume including a predicted volume of the each frame;

wherein the fourth training loss function is obtained by:

determining a second loss corresponding to each frame of the sample audio based on the note characterization value and the corresponding first loss of each frame of the sample audio;

4. The method of claim 1, wherein the first training loss function, the second training loss function, and the third training loss function are each cross entropy loss functions.

5. The method of claim 1, further comprising:

obtaining the note starting time point and the note ending time point of each note contained in the sample audio, and determining the target note starting time point and the target note ending time point which are closest to the frame time for each frame;

and determining a first sample time characteristic value and a second sample time characteristic value corresponding to each frame of the sample audio based on the middle time point of each frame of the sample audio and the target note starting time point and the target note ending time point corresponding to each frame.

6. A method as claimed in claim 5, wherein the determining the first and second sample temporal feature values corresponding to each frame of the sample audio based on the target note-starting time point and the target note-ending time point corresponding to each frame of the sample audio comprises:

7. The method of claim 6, wherein the determining the first sample temporal feature value corresponding to each frame in the sample audio based on the first time difference corresponding to each frame in the sample audio comprises:

g(Δ_onset)＝1-α|Δ_onset|

8. The method of claim 6, wherein the determining the second sample temporal feature value for each frame in the sample audio based on the second time difference for each frame in the sample audio comprises:

g(Δ_offset)＝1-α|Δ_offset|

9. A music transcription method, the method comprising:

wherein the music transcription model is trained by the method of any one of claims 1 to 8.

10. A training apparatus for a music transcription model, the training apparatus comprising:

a training data obtaining module, configured to obtain training data, where each training sample in the training data includes an audio feature vector of a sample audio, a sample score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, where for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note start time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note end time point of the frame;

11. A music transcription device, characterized in that the music transcription device comprises:

the device comprises a to-be-processed audio acquisition module, a to-be-processed audio acquisition module and a to-be-processed audio acquisition module, wherein the to-be-processed audio acquisition module is used for acquiring a to-be-processed audio and determining an audio characteristic vector corresponding to the to-be-processed audio;

the transcription module is used for inputting the audio characteristic vector of the audio to be processed into a music transcription model and obtaining a music score corresponding to the audio to be processed based on an output result of the music transcription model;

wherein the music transcription model is obtained by training through the method of any one of 1 to 8.

12. An electronic device comprising a processor and a memory, the processor and the memory being interconnected;

the memory is used for storing a computer program;

the processor is configured to perform the method of any one of claims 1 to 8 or to perform the method of claim 9 when the computer program is invoked.

13. A computer-readable medium, characterized in that the computer-readable medium stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 8 or to implement the method of claim 9.