CN111898753B

CN111898753B - Training method of music transcription model, music transcription method and corresponding device

Info

Publication number: CN111898753B
Application number: CN202010779114.6A
Authority: CN
Inventors: 孔秋强; 王雨轩
Original assignee: ByteDance Inc
Current assignee: ByteDance Inc
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2024-07-02
Anticipated expiration: 2040-08-05
Also published as: CN111898753A

Abstract

The embodiment of the disclosure discloses a training method of a music transcription model, a music transcription method and a corresponding device, wherein the method comprises the following steps: acquiring training data, wherein each training sample in the training data comprises an audio feature vector of sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio; training the initial neural network model based on training data until the model converges to a corresponding total loss function, and determining the model after training is finished as a music transcription model; the input of the model is an audio feature vector of the sample audio, and the output of the model comprises a first prediction time feature value, a second prediction time feature value and a prediction music score corresponding to each frame in the sample audio. The training method provided by the embodiment of the disclosure can improve the accuracy of music transcription, so that the transcribed music score is closer to the true expression of the audio, and the applicability is high.

Description

Training method of music transcription model, music transcription method and corresponding device

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a training method of a music transcription model, a music transcription method and a corresponding device.

Background

The automatic music transcription (Automatic Music Transcription, AMT) can translate the original music audio into symbol marks, mainly comprising the starting time, the ending time and the like of each note in the music audio, and has wide application in the aspects of music teaching, music appreciation, music theory analysis and the like.

However, the traditional music transcription method mainly carries out transcription by predicting whether notes exist in each audio frame or not, and has low accuracy. Because of the large number of notes and the varied melodies in a musical composition, in the traditional music transcription method, the relativity of frames and notes often deviates, so that the obtained score often differs from the actual musical expression of the music audio.

Therefore, how to further improve the accuracy of music transcription is a problem to be solved.

Disclosure of Invention

The disclosed embodiments provide a training method of a music transcription model, a music transcription method and corresponding apparatuses, the summary of which is provided to introduce concepts in a simplified form that are further described below in the detailed description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, an embodiment of the present disclosure provides a method for training a music transcription model, the method including:

Acquiring training data, wherein each training sample in the training data comprises an audio feature vector of sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, wherein for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note starting time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note ending time point of the frame;

Training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after training is finished as a music transcription model;

The input of the model is an audio feature vector of the sample audio, and the output of the model comprises the first prediction time feature value, the second prediction time feature value and the prediction music score corresponding to each frame in the sample audio;

The total loss function includes a first training loss function, a second training loss function, and a third training loss function, wherein a value of the first training loss function characterizes a difference between a sample score corresponding to the sample audio and the predicted score, a value of the second training loss characterizes a difference between a first sample time feature value corresponding to the sample audio and the first predicted time feature value, and a value of the third training loss function characterizes a difference between a second sample time feature value corresponding to the sample audio and the second predicted time feature value.

In a second aspect, embodiments of the present disclosure provide a music transcription method, the method comprising:

acquiring audio to be processed, and determining an audio feature vector corresponding to the audio to be processed;

Inputting the audio feature vector of the audio to be processed into a music transcription model, and obtaining a music score corresponding to the audio to be processed based on an output result of the music transcription model;

The music transcription model is obtained through training by the training method of the music transcription model provided by the embodiment of the disclosure.

In a third aspect, an embodiment of the present disclosure provides a training apparatus for a music transcription model, the apparatus including:

the training data acquisition module is used for acquiring training data, each training sample in the training data comprises an audio feature vector of sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, wherein for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note starting time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note ending time point of the frame;

the training module is used for training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after training is finished as a music transcription model;

In a fourth aspect, embodiments of the present disclosure provide a music transcription apparatus, the apparatus including:

the audio processing module is used for processing the audio to be processed and determining the audio feature vector corresponding to the audio to be processed;

The transcription module is used for inputting the audio feature vector of the audio to be processed into a music transcription model, and obtaining a music score corresponding to the audio to be processed based on the output result of the music transcription model;

The music transcription model is obtained through training by the training device of the music transcription model.

In a fifth aspect, embodiments of the present disclosure provide an electronic device including a processor and a memory, the processor and the memory being interconnected;

the memory is used for storing a computer program;

The processor is configured to execute the training method of the music transcription model provided by the embodiment of the present disclosure and/or the music transcription method provided by the embodiment of the present disclosure when the computer program is invoked.

In a sixth aspect, the disclosed embodiments provide a computer readable medium storing a computer program that is executed by a processor to implement the training method of the music transcription model provided by the disclosed embodiments and/or the music transcription method provided by the disclosed embodiments.

In the embodiment of the disclosure, through the sample audio feature vector, the sample volume, the music score corresponding to the sample audio in each training sample in the training data and the time feature value corresponding to each frame in the sample audio, the initial neural network type model can be comprehensively trained in terms of the note starting time point, the note ending time point, the volume, the whole music score and the like, and further correction training is carried out on the model by combining the loss functions corresponding to the factors, so that the transcription accuracy of the music transcription model obtained after training is improved. On the other hand, the embodiment of the disclosure can also transcribe the audio to be processed based on the music transcription model to obtain the corresponding music score, so that the applicability is high.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flow chart of a training method of a music transcription model provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for determining a sample time feature value provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a distribution of note onset time points provided by embodiments of the present disclosure;

FIG. 4 is a schematic illustration of a scenario for determining a first sample time feature value provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an initial neural network model provided by an embodiment of the present disclosure;

FIG. 6 is another schematic structural view of an initial neural network model provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of another embodiment of an initial neural network model provided by an embodiment of the present disclosure

FIG. 8 is a schematic diagram of a model test method provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a training apparatus for a music transcription model provided by an embodiment of the present disclosure;

Fig. 10 is a schematic structural view of a music transcription device provided in an embodiment of the present disclosure;

Fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are used merely to distinguish one device, module, or unit from another device, module, or unit, and are not intended to limit the order or interdependence of the functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Referring to fig. 1, fig. 1 is a flowchart of a training method of a music transcription model according to an embodiment of the disclosure. As shown in fig. 1, the training method for music transcription provided in the embodiment of the present disclosure may include the following steps:

S11, training data are acquired.

Specifically, the training data in the present disclosure includes a plurality of training samples (may also be referred to as training targets), where each training sample includes an audio feature vector of sample audio, a sample score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio.

Because the sample audio is a time domain signal and cannot directly participate in training of the music transcription model, sample features included in each training sample can be converted into audio feature vectors. That is, in the training of the music transcription model, the input of the model is the audio feature vector of each sample audio. Alternatively, the audio feature vector of the sample audio may be a logarithmic mel-spectral feature of the audio. Specifically, the sample audio may be converted from a time domain to a frequency domain, the logarithmic mel spectrum feature of each frame of the sample audio may be extracted, and the audio feature vector of the sample audio may be obtained based on the logarithmic mel spectrum feature of each frame. The logarithmic mel spectrogram X may be represented by a matrix of t×f, T represents the total frame number of the logarithmic mel spectrogram, and F represents the frequency point number of the logarithmic mel spectrogram.

The sample audio in the present disclosure may be obtained in a variety of manners, and may specifically be determined based on actual application scene requirements, which is not limited herein. Such as recording music to obtain sample audio, obtaining sample audio from other devices, networks, and from storage spaces such as databases, data carriers, etc., based on data transmission, etc. Also, the music content of the sample audio includes, but is not limited to, piano playing music, guitar playing music, etc., and may be specifically determined based on the actual application scene requirements, without limitation.

Wherein, for any frame in any sample audio, the first sample time feature value corresponding to the frame characterizes a time difference between the middle time point of the frame and the nearest note start time point (onset) of the frame, and the second sample time feature value corresponding to the frame characterizes a time difference between the middle time point of the frame and the nearest note end time point (offset) of the frame.

Alternatively, the determination of the first sample time feature value and the second sample time feature value corresponding to each frame of the sample audio may be described with reference to fig. 2. Fig. 2 is a flowchart illustrating a method for determining a sample time feature value according to an embodiment of the present disclosure. As shown in fig. 2, the method for determining a sample time feature value provided in the embodiment of the present disclosure includes the following steps:

s111, carrying out framing processing on the sample audio to obtain frames of the sample audio, and determining a middle time point of each frame of the sample audio.

Specifically, the frame processing can be performed on the sample audio based on the window function to obtain each frame of the sample audio, so that the middle time point of each frame of the sample audio is determined, and the first sample time characteristic value and the second sample time characteristic value corresponding to each frame of the sample audio are determined on the basis of the middle time point of each frame of the sample audio. The window functions include, but are not limited to, rectangular window, hanning window, hamming window, etc., which can be specifically determined based on the actual application scene requirements, and are not limited herein.

S112, acquiring a note start time point and a note end time point of each note contained in the sample audio, and determining a target note start time point and a target note end time point which are closest to the frame time for each frame.

Specifically, since the sample audio is composed of different notes, note start time points and note end time points of the different notes, and durations of the notes constitute the sample audio. Thus, after determining the middle time point of each frame of the sample audio, the note start time point and the note end time point of each note in the sample audio can be obtained. Further, for each frame in each sample audio, a target note start time point and a target note end time point closest to the frame may be determined, so that the corresponding first sample time feature value and second sample time feature value are determined based on the target note start time point and the target note end time point corresponding to each frame.

As an example, fig. 3 is a schematic diagram of distribution of note onset time points, which shows 6 frames of one sample audio, provided by an embodiment of the present disclosure. In fig. 3, it is assumed that there are 3 note-start time points for 6 frames of sample audio, and the 3 note-start time point distribution is shown in fig. 3. For the 2 nd frame, the note starting time point closest to the note starting time point is a time point 1, and the instant 1 is a target starting time point corresponding to the 2 nd frame; for the 3 rd frame, the note starting time point with the nearest time distance is a time point 2, namely the time point 2 is a target starting time point corresponding to the 2 nd frame; for the 4 th frame, the note start time point closest to the time is time point 2, i.e., time point 2 is the target start time point corresponding to the 4 th frame. Based on the above method, the starting time point of the target note corresponding to each frame can be determined, and similarly, the ending time point of the target note corresponding to each frame can be determined. It should be noted that, the middle time point of each frame, the note start time point of the note, and the note end time point of the note are relative time points, that is, are used to represent the time positions in the sample audio.

S113, determining a first sample time characteristic value and a second sample time characteristic value corresponding to each frame of the sample audio based on the middle time point of each frame of the sample audio, and the target note starting time point and the target note ending time point corresponding to each frame.

Specifically, for each frame in each sample audio, a time difference between the middle time point of the frame and the corresponding target note start time point (hereinafter referred to as a first time difference for convenience of description) and a time difference between the middle time point and the corresponding target note end time point (hereinafter referred to as a second time difference for convenience of description) may be determined.

Taking the first time difference corresponding to the frame as an example, when the first time difference corresponding to the frame is greater than the preset time difference, determining the time characteristic of the first sample corresponding to the frame as 0; when the first time difference corresponding to the frame is less than or equal to the preset time difference, then the first sample time feature value corresponding to the frame is determined by g (Δ _onset)＝1-αΔ_onset |, where Δ _onset is the first time difference corresponding to the frame, α is a normalization coefficient, and g (Δ _onset) is the first sample time feature value corresponding to the frame.

Wherein Δ _A represents a preset time difference, i is a frame index, and g (Δ _i) is a first sample time feature value corresponding to the ith frame.

Referring to fig. 4, fig. 4 is a schematic view of a scenario for determining a first sample time feature value provided by an embodiment of the present disclosure. As shown in fig. 4, it is assumed that only one note start time point exists in the sample audio, and the note start time point is the target note start time point corresponding to all frames. As can be readily seen in fig. 4, the first time differences (Δ ₁、Δ₂、Δ₃、Δ₄ and Δ ₅, respectively) corresponding to the 1 st to 5 th frames are smaller than the predetermined time difference Δ _A, so that the corresponding first sample time feature values are g (Δ ₁)、g(Δ₂)、g(Δ₃)、g(Δ₄) and g (Δ ₅), respectively, by combining the above formulas. For the 6 th frame and the 7 th frame, the first time difference corresponding to the 6 th frame and the 7 th frame is obviously larger than the preset time difference delta _A, so that the time characteristic values of the first samples corresponding to the 6 th frame and the 7 th frame can be determined to be 0 based on the above formula.

Similarly, based on the implementation manner, the second sample time feature value corresponding to each frame in the sample audio can be determined based on the second time difference corresponding to each frame in the sample audio, which is not described herein.

And S12, training the initial neural network model based on training data until the model converges to the corresponding total loss function, and determining the model after training as a music transcription model.

Specifically, when the initial neural network model is trained based on training data, an audio feature vector of sample audio of each training sample can be used as an input of the model, and output of the model is a first prediction time feature value, a second prediction time feature value and a prediction score corresponding to each frame in the sample audio.

The model architecture of the initial neural network model in the embodiments of the present disclosure is not limited in the embodiments of the present disclosure, and the initial neural network model includes, but is not limited to, a network model based on a neural network such as a cyclic neural network (Recurrent Neural Networks, RNN), a Long Short-Term Memory (LSTM), a gated cyclic unit (Gated Recurrent Unit, GRU), etc., which may be specifically determined based on actual application scenario requirements, but is not limited herein.

Wherein, for each frame, the corresponding first prediction time feature value characterizes the prediction time difference between the middle time point of the frame and the nearest note starting time point of the frame, and the corresponding second prediction time feature value characterizes the prediction time difference between the middle time point of the frame and the nearest note starting time point of the frame.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an initial neural network model according to an embodiment of the present disclosure. The initial neural network model shown in fig. 5 is mainly composed of each initial neural network (hidden layer), and in the process of training the initial neural network model based on training data, after training the initial neural network model based on training samples of the training data, a first prediction time feature value, a second prediction time feature value and a prediction score corresponding to each frame of sample audio of the training samples can be output.

And (3) based on the total loss function of each training sample and the model, carrying out continuous iterative training on the model until the total loss function corresponding to the model converges. When the total loss function corresponding to the model converges, the audio transcription capability and accuracy of the model tend to be stable at the moment, and the model at the end of training can be determined as a music transcription model at the moment.

The total loss function l _total corresponding to the model includes a first training loss function, a second training loss function, and a third training loss function, i.e., l _total＝l_frame+l_onset+l_offset.

The first training loss function is: I _frame (t, k) represents the sample score corresponding to the sample audio, P _frame (t, k) represents the predicted score output by the model, and l _bce(I_frame(t,k),P_frame (t, k) represents the difference between the sample score and the same frame of the predicted score, which may be represented by the corresponding cross entropy loss function, or determined by other loss functions, without limitation.

The second training loss function is: G _onset (t, k) represents a first sample time feature value corresponding to each frame of sample audio, R _onset (t, k) represents a first prediction time feature value corresponding to each frame of model output, and the difference between the first sample time feature value and the first prediction time feature value per frame in l _bce(G_onset(t,k),R_onset (t, k) sample audio may be represented by a corresponding cross entropy loss function or determined by other loss functions, without limitation.

The third training loss function is: g _offset (t, k) represents a second sample time feature value corresponding to each frame of the sample audio, R _offset (t, k) represents a second prediction time feature value corresponding to each frame of the model output, and l _bce(G_offset(t,k),R_offset (t, k) represents a difference between the second sample time feature value and the second prediction time feature value of each frame of the sample audio, which may be represented by a corresponding cross entropy loss function, or determined by other loss functions, without limitation.

For one sample audio, T represents the total frame number of the sample audio, T represents the T frame in the sample audio, K represents the number of total notes in the music field to which the sample audio belongs, for example, for piano audio, the number of total notes is the number of notes corresponding to 88 keys of the piano, and k=88, K represents the T note in the K notes. Wherein, the cross entropy loss function can be determined by the following method: l _bce = -ylnp- (1-y) ln (1-p). For the first training loss function, y represents whether a certain frame in the sample audio contains a certain note, if so, it is 1, and if not, it is 0.p represents the probability of the predicted note that the frame contains the note, and 1-p represents the probability of not containing the note. For another example, for the second training loss function, y represents a first sample time feature value of a frame in the sample audio, and p represents a first predicted time feature value corresponding to the frame.

Wherein the value of the first training loss function characterizes a difference between the sample score corresponding to the sample audio and the predicted score output by the model.

The value of the second training loss function characterizes the difference between the first sample time characteristic value corresponding to the sample audio and the first prediction time characteristic value output by the model, namely the difference between the corresponding degree of each frame of the sample audio and the note starting time point and the corresponding degree of each frame of the sample audio predicted by the model and the note starting time point.

The value of the third training loss function characterizes the difference between the second sample time characteristic value corresponding to the sample audio and the second prediction time characteristic value output by the model, namely the difference between the corresponding degree of each frame of the sample audio and the note ending time point and the corresponding degree of each frame of the sample audio predicted by the model and the note ending time point.

Optionally, since the note start time point of the sample audio is affected by the volume (Velocity) of the sample audio to some extent, so that the first sample time feature value of each frame of the sample audio is affected by the volume, in order to further improve the stability and accuracy of the final music transcription model, the influence caused by the volume of the sample audio can be considered in the training process of the initial neural network model. The volume of the sample audio can reflect the playing speed of each note in the sample audio.

With reference to fig. 6, fig. 6 is another schematic structural diagram of an initial neural network model provided by an embodiment of the present disclosure. In this case, as shown in fig. 6, each training sample further includes a volume corresponding to the sample audio, and the output of the model may further include the sample volume of the sample audio, where the output of the model during each training further includes a predicted volume corresponding to the sample audio. Wherein, the hidden layer in the initial neural network model is mainly composed of a neural network.

Further, in this case, the total loss function l _total corresponding to the model training includes, in addition to the first training loss function, the second training loss function, and the third training loss function, a fourth training loss function for characterizing a difference between the sample volume of the sample audio and the predicted volume corresponding to the sample audio.

Specifically, in determining the fourth training loss function, the first loss l _bce(I_vel(t,k),P_vel (t, k) corresponding to each frame of the sample audio may be determined based on the sample volume and the predicted volume corresponding to each frame of the sample audio. Where I _vel (t, k) represents the sample volume corresponding to each frame of sample audio, P _vel (t, k) represents the predicted volume corresponding to each frame of sample audio predicted by the model, and the first loss l _bce(I_vel(t,k),P_vel (t, k)) represents the difference between the sample volume of the same frame of sample audio and the predicted volume corresponding to the sample audio, which may be represented by the corresponding cross entropy loss function as well, or determined by other loss functions, without limitation.

Further, a second penalty I _onset(t,k)·l_bce(I_vel(t,k),P_vel (t, k) for each frame of sample audio may be determined based on the note characterization value I _onset (t, k) for each frame of sample audio and the first penalty for each frame. Wherein I _vel (t, k) represents the sample volume corresponding to each frame of the sample audio, P _vel (t, k) represents the predicted volume corresponding to each frame of the sample audio predicted by the model, the note characterization value I _onset (t, k) represents whether a note exists in each frame of the sample audio (if a corresponding note characterization value is 1, if a corresponding note characterization value is not), l _bce(I_vel(t,k),P_vel (t, k)) represents the difference between the sample volume of the same frame of the sample audio and the predicted volume corresponding to the sample audio, which can be represented by the corresponding cross entropy loss as such, or determined by other loss functions, without limitation.

Based on the implementation manner, a fourth training loss function in the model training process based on training data can be obtainedIn this case, the total loss function corresponding to model training is l _total＝l_frame+l_onset+l_offset+l_vel.

As an alternative, fig. 7 shows a schematic diagram of still another structure of the initial neural network provided by the present disclosure, where, as shown in fig. 7, the initial neural network model is formed by a neural network based on a GRU, and the input of the model may be a log mel spectrum chart of the sample audio, and sample feature vectors of the sample audio are obtained based on each convolution layer. The model is respectively based on an independent GRU neural network model to obtain a predicted volume and a second predicted time characteristic value corresponding to the sample audio. And obtaining a first prediction time characteristic value corresponding to each frame of the sample audio through the GRU neural network model based on the output result of the model corresponding to the prediction volume and the preliminary first prediction time characteristic value obtained by the independent neural network model. On the other hand, the initial neural network model can integrate the first prediction time feature value, the second prediction time feature value and the sample feature vector of the sample audio to obtain a prediction music score corresponding to the sample audio. Optionally, to further ensure accuracy of music transcription of the music transcription model obtained by training, when the total loss function corresponding to the model converges, the model at the end of training may be further tested, so that when the test result meets the preset test condition, the model at the end of training is determined to be the music transcription model.

Specifically, test data may be obtained, where each test audio of the test data includes a test score corresponding to the test audio, and a first test time feature value and a second test time feature value corresponding to each frame. Wherein the first test time feature value characterizes a time difference between an intermediate time point of the corresponding frame and a nearest note start time point thereof, and the second test time feature value characterizes a time difference between the intermediate time point of the corresponding frame and a nearest note end time point thereof.

Furthermore, the distribution of the note starting time point and the note ending time point of each note in the test audio can be accurately obtained based on the first test time characteristic value and the second test time characteristic value corresponding to each frame of the test audio. Specifically, a frame interval with a local maximum value of the first test time characteristic value can be determined according to the first test time characteristic value corresponding to each frame of the test audio. The frame interval comprises 3 frames with the first test time characteristic value not being 0, and the first test time characteristic value of the middle frame is the maximum value of the first test time characteristic values in the frame interval. That is, there are note-on time points in the frame section, i.e., the distribution of accurate note-on time points is determined based on the first test time feature values of 3 frames in the frame section. With reference to fig. 8, fig. 8 is a schematic diagram of a model test method provided in an embodiment of the disclosure. As shown in fig. 8, assume that the abscissa of A, B, C corresponds to different frame-intermediate time points, respectively, and the ordinate corresponds to the first test time feature value thereof, respectively. AB may be extended to D based on mathematical methods, where DC is perpendicular to the abscissa. Further determination of DC midpoint E causes EF parallel to the abscissa to be compared to AD compared to G. Wherein the difference between the horizontal coordinates of B and H (or G) is the time difference between the note-on time point and the middle time point of the frame corresponding to the distance B, i.eWhen the first test time feature value corresponding to a is greater than the first test time feature value corresponding to C (both are less than the first test time feature value corresponding to B),Similarly, an accurate distribution of note onset time points for each note in the test audio can be obtained.

Similarly, based on the second test time feature value corresponding to each frame in the test audio, the accurate distribution of the note ending time points of each note in the test audio can be determined, which is not described herein.

On the other hand, the test audio is transcribed based on the model at the end of training to obtain a predicted score of the test audio, and further distribution of note start time points and note end time points of all notes in the test score is determined. When the note start time point and the note end time point of each note in the test score satisfy a preset test condition as a test result, the model at the end of training may be determined as a music transcription model.

The preset test conditions may be determined based on actual application scene requirements, and are not limited herein. For example, when the note start time point and the note end time point of each note in the test audio coincide with the note start time point and the note end time point thereof in the test score, respectively, or the error is within a preset range, it may be determined that the test result satisfies a preset test condition. For another example, when the note start time point and the note end time point of a certain proportion of notes in the test audio are respectively consistent with the note start time point and the note end time point of the notes in the test score or the error is within a preset range, the test result can be determined to meet the preset test condition.

When the test result does not meet the preset test condition, training can be continued on the model based on the training data until the test result meets the preset test condition, and the model at the end of training is determined to be a music transcription model.

In some possible implementations, the embodiments of the present disclosure may further obtain the audio to be processed and determine the audio feature vector corresponding to the audio to be processed, and the specific determination manner may refer to the related implementation manner of step S11 in fig. 1 on the sample audio, which is not described herein again. Further, the embodiment of the disclosure can input the audio feature vector of the audio to be processed into the music transcription model, so that the music score corresponding to the audio to be processed is obtained based on the output result of the music transcription model. Such as transcription of piano accompaniment into a piano roller shutter window, transcription of guitar performance into guitar score, etc.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a training device for a music transcription model according to an embodiment of the present disclosure. The device 1 provided in the embodiment of the present disclosure includes:

a training data obtaining module 11, configured to obtain training data, where each training sample in the training data includes an audio feature vector of a sample audio, a sample score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, where, for any frame, the first sample time feature value represents a time difference between an intermediate time point of the frame and a nearest note start time point of the frame, and the second sample time feature value represents a time difference between the intermediate time point of the frame and a nearest note end time point of the frame;

The training module 12 is configured to train the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determine the model after training is completed as a music transcription model;

In some possible embodiments, each training sample further includes a sample volume corresponding to the sample audio, the input of the model further includes a sample volume of the sample audio, the output of the model further includes a corresponding predicted volume of the sample audio, and the total loss function further includes a fourth training loss function, a value of the fourth training loss function characterizing a difference between the sample volume of the sample audio and the predicted volume.

In some possible embodiments, each training sample further includes a note representation value of each frame included in the sample audio, where the note representation value represents whether a note start time point is included in a frame; the sample volume of the sample audio includes a sample volume of each frame included in the sample audio, and the predicted volume includes a predicted volume of each frame;

Wherein, the training module 12 is used for:

for the sample audio, calculating a first loss corresponding to each frame of the sample audio based on a sample volume and a predicted volume corresponding to each frame of the sample audio;

Determining a second loss corresponding to each frame of the sample audio based on the note characterization value and the corresponding first loss for each frame of the sample audio;

and obtaining the fourth training loss function based on the second loss corresponding to each frame of each sample audio.

In some possible embodiments, the first training loss function, the second training loss function, and the third training loss function are each cross entropy loss functions.

In some possible embodiments, the training data acquisition module 11 is further configured to:

Framing the sample audio to obtain frames of the sample audio, and determining the middle time point of each frame of the sample audio;

Determining a note start time point and a note end time point of each note included in the sample audio, and determining a target note start time point and a target note end time point closest to the frame time for each frame;

and determining a first sample time characteristic value and a second sample time characteristic value corresponding to each frame of the sample audio based on the target note starting time point and the target note ending time point corresponding to each frame of the sample audio.

For each frame, determining a first time difference between a middle time point of the frame and a corresponding target note start time point, and a second time difference between the middle time point of the frame and a corresponding target note end time point;

Determining a first sample time characteristic value corresponding to each frame in the sample audio based on a first time difference corresponding to each frame in the sample audio;

And determining a second sample time characteristic value corresponding to each frame in the sample audio based on a second time difference corresponding to each frame in the sample audio.

In some possible embodiments, the training data acquisition module 11 is configured to:

For each frame, if the first time difference corresponding to the frame is larger than the preset time difference, determining the first sample time characteristic value corresponding to the frame as 0;

if the first time difference corresponding to the frame is smaller than or equal to the preset time difference, determining a first sample time characteristic value corresponding to the frame by the following method:

g(Δ_onset)＝1-αΔ_onset|

Wherein Δ _onset is the first time difference corresponding to the frame, α is the normalization coefficient, and g (Δ _onset) is the first sample time feature value corresponding to the frame.

for each frame, if the second time difference corresponding to the frame is greater than the preset time difference, determining the second sample time characteristic value corresponding to the frame as 0;

if the second time difference corresponding to the frame is smaller than or equal to the preset time difference, the first sample time characteristic value corresponding to the frame is determined by the following method:

g(Δ_offset)＝1-αΔ_offset|

Wherein Δ _offset is the second time difference corresponding to the frame, α is the normalization coefficient, and g (Δ _offset) is the second sample time feature value corresponding to the frame.

Referring to fig. 10, fig. 10 is a schematic structural view of a music transcription device provided in an embodiment of the present disclosure. The device 2 provided in the embodiment of the present disclosure includes:

The audio to be processed obtaining module 21 is configured to obtain audio to be processed, and determine an audio feature vector corresponding to the audio to be processed;

A transcription module 22, configured to input the audio feature vector of the audio to be processed into a music transcription model, and obtain a score corresponding to the audio to be processed based on an output result of the music transcription model;

wherein, the music transcription model is obtained by training the training method of the music transcription model in fig. 1.

Referring now to fig. 11, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 11 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

An electronic device includes: a memory and a processor, where the processor may be referred to as a processing device 601 described below, the memory may include at least one of a Read Only Memory (ROM) 602, a Random Access Memory (RAM) 603, and a storage device 608 described below, as follows:

As shown in fig. 11, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other electronic devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring training data, wherein each training sample in the training data comprises an audio feature vector of sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, wherein for any frame, the first sample time feature value represents a time difference between a middle time point of the frame and a nearest note starting time point of the frame, and the second sample time feature value represents a time difference between the middle time point of the frame and a nearest note ending time point of the frame; training the initial neural network model based on the training data until the model converges to the corresponding total loss function, and determining the model after training is finished as a music transcription model; the input of the model is an audio feature vector of the sample audio, and the output of the model comprises the first prediction time feature value, the second prediction time feature value and the prediction music score corresponding to each frame in the sample audio; the total loss function includes a first training loss function, a second training loss function, and a third training loss function, wherein a value of the first training loss function characterizes a difference between a sample score corresponding to the sample audio and the predicted score, a value of the second training loss characterizes a difference between a first sample time feature value corresponding to the sample audio and the first predicted time feature value, and a value of the third training loss function characterizes a difference between a second sample time feature value corresponding to the sample audio and the second predicted time feature value.

Or the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring audio to be processed, and determining an audio feature vector corresponding to the audio to be processed; inputting the audio feature vector of the audio to be processed into a music transcription model, and obtaining a music score corresponding to the audio to be processed based on an output result of the music transcription model; the music transcription model is obtained through training by the training method of the music transcription model provided by the embodiment of the disclosure.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Where the name of the module or unit does not constitute a limitation on the unit itself in some cases, for example, the training module may also be described as a "music transcription model training module".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or electronic device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or electronic device, or any suitable combination of the preceding. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, example one provides a training method of a music transcription model, comprising:

In a possible implementation manner, each training sample further includes a sample volume corresponding to the sample audio, the input of the model further includes the sample volume of the sample audio, the output of the model further includes a corresponding predicted volume of the sample audio, and the total loss function further includes a fourth training loss function, and a value of the fourth training loss function characterizes a difference between the sample volume of the sample audio and the predicted volume.

In one possible implementation, each training sample further includes a note representation value of each frame included in the sample audio, where the note representation value represents whether a note start time point is included in a frame; the sample volume of the sample audio includes a sample volume of each frame included in the sample audio, and the predicted volume includes a predicted volume of each frame;

Wherein the fourth training loss function is obtained by:

In one possible embodiment, the first training loss function, the second training loss function, and the third training loss function are cross entropy loss functions, respectively.

In one possible embodiment, the method further comprises:

In one possible embodiment, the determining the first sample time feature value and the second sample time feature value corresponding to each frame of the sample audio based on the target note start time point and the target note end time point corresponding to each frame of the sample audio includes:

In one possible implementation manner, the determining, based on the first time difference corresponding to each frame in the sample audio, a first sample time feature value corresponding to each frame in the sample audio includes:

g(Δ_onset)＝1-αΔ_onset|

In one possible implementation manner, the determining, based on the second time difference corresponding to each frame in the sample audio, a second sample time feature value corresponding to each frame in the sample audio includes:

g(Δ_offset)＝1-αΔ_offset|

According to one or more embodiments of the present disclosure, example two provides a music transcription method, comprising:

Wherein the music transcription model is trained by the method described above.

According to one or more embodiments of the present disclosure, example three provides a training apparatus of a music transcription model corresponding to example one, including:

Wherein, above-mentioned training module is used for:

In some possible embodiments, the training data acquisition module is further configured to:

In some possible embodiments, the training data acquisition module is configured to:

g(Δ_onset)＝1-αΔ_onset|

g(Δ_offset)＝1-αΔ_offset|

According to one or more embodiments of the present disclosure, example four provides a music transcription apparatus corresponding to example two, including:

Wherein the music transcription model is trained by the training method of the music transcription model in example one.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of training a music transcription model, the method comprising:

Obtaining training data, wherein each training sample in the training data comprises an audio feature vector of sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, wherein for any frame, the first sample time feature value is 0 in response to the time difference between the middle time point of the frame and the nearest note starting time point of the frame being greater than a preset threshold, and the first sample time feature value is 1 in response to the time difference between the middle time point of the frame and the nearest note starting time point of the frame being less than the preset threshold, and the first sample time feature value is the difference of the product of the time difference and a preset normalization coefficient; for any frame, in response to the time difference between the middle time point of the frame and the nearest note ending time point of the frame being greater than a preset threshold, the second sample time feature value is 0, and in response to the time difference between the middle time point of the frame and the nearest note ending time point of the frame being less than a preset threshold, the second sample time feature value is 1, the difference being the product of the time difference and a preset normalized coefficient; the audio feature vector is obtained based on logarithmic mel frequency spectrum features;

training an initial neural network model based on the training data until the model converges to a corresponding total loss function, and determining the model after training is finished as a music transcription model;

The input of the model is an audio feature vector of the sample audio, and the output of the model comprises a first prediction time feature value, a second prediction time feature value and a prediction music score corresponding to each frame in the sample audio;

the total loss function comprises a first training loss function, a second training loss function and a third training loss function, wherein the value of the first training loss function represents the difference between a sample music score corresponding to the sample audio and the predicted music score, the value of the second training loss represents the difference between a first sample time characteristic value corresponding to the sample audio and the first predicted time characteristic value, and the value of the third training loss function represents the difference between a second sample time characteristic value corresponding to the sample audio and the second predicted time characteristic value.

2. The method of claim 1, wherein each training sample further comprises a sample volume corresponding to the sample audio, wherein the input of the model further comprises a sample volume of the sample audio, wherein the output of the model further comprises a corresponding predicted volume of the sample audio, wherein the total loss function further comprises a fourth training loss function, wherein a value of the fourth training loss function characterizes a difference between the sample volume and the predicted volume of the sample audio.

3. The method of claim 2, wherein each training sample further comprises a note-representative value for each frame contained in the sample audio, the note-representative value being representative of whether a note-start time point is contained in a frame; the sample volume of the sample audio comprises the sample volume of each frame contained in the sample audio, and the predicted volume comprises the predicted volume of each frame;

wherein the fourth training loss function is obtained by:

Determining a second penalty corresponding to each frame of the sample audio based on the note characterization value and the corresponding first penalty for each frame of the sample audio;

4. The method of claim 1, wherein the first training loss function, the second training loss function, and the third training loss function are each cross entropy loss functions.

5. A method of music transcription, the method comprising:

Wherein the music transcription model is trained by the method of any one of claims 1 to 4.

6. A training device for a music transcription model, the training device comprising:

The training data acquisition module is used for acquiring training data, each training sample in the training data comprises an audio feature vector of sample audio, a sample music score corresponding to the sample audio, and a first sample time feature value and a second sample time feature value corresponding to each frame in the sample audio, wherein for any frame, the first sample time feature value is 0 in response to the time difference between the middle time point of the frame and the nearest note starting time point of the frame being greater than a preset threshold, and the first sample time feature value is 1 in response to the time difference between the middle time point of the frame and the nearest note starting time point of the frame being less than the preset threshold, and the difference of the product of the time difference and a preset normalization coefficient; for any frame, in response to the time difference between the middle time point of the frame and the nearest note ending time point of the frame being greater than a preset threshold, the second sample time feature value is 0, and in response to the time difference between the middle time point of the frame and the nearest note ending time point of the frame being less than a preset threshold, the second sample time feature value is 1, the difference being the product of the time difference and a preset normalized coefficient; the audio feature vector is obtained based on logarithmic mel frequency spectrum features;

7. A music transcription apparatus, characterized in that the music transcription apparatus comprises:

the audio processing device comprises an audio acquisition module to be processed, a processing module and a processing module, wherein the audio acquisition module to be processed is used for acquiring audio to be processed and determining an audio feature vector corresponding to the audio to be processed;

The transcription module is used for inputting the audio feature vector of the audio to be processed into a music transcription model, and obtaining a music score corresponding to the audio to be processed based on an output result of the music transcription model;

8. An electronic device comprising a processor and a memory, the processor and the memory being interconnected;

the memory is used for storing a computer program;

The processor is configured to perform the method of any of claims 1 to 4 or to perform the method of claim 5 when the computer program is invoked.

9. A computer readable medium, characterized in that it stores a computer program that is executed by a processor to implement the method of any one of claims 1 to 4 or to implement the method of claim 5.