CN112633175A

CN112633175A - Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment

Info

Publication number: CN112633175A
Application number: CN202011549426.4A
Authority: CN
Inventors: 卢迪; 邢湘琦
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-09

Abstract

The invention discloses a single note real-time recognition algorithm with an extraction function, which mainly comprises the following steps: firstly, selecting a pure audio data set of a musical instrument to be identified; adding pure audio into various noises to simulate the real situation of recording; carrying out feature extraction on an input waveform through an encoder; modeling the extracted feature vectors through a plurality of groups of time convolution networks; finally, pure audio waveforms are extracted through a decoder; and constructing a multi-scale convolution neural network for note recognition. The invention provides a time domain-based audio frequency operation method, which overcomes the defect that phase information is ignored in the traditional frequency domain operation. The target audio to be recognized is extracted from the complex environment before note recognition. And the characteristic extraction is carried out by utilizing the multi-scale convolution neural network, so that the note identification performance is improved. The problem that the existing note recognition algorithm is sensitive to noise factors is well solved, and the note recognition method is small in model size and greatly improves note recognition speed and accuracy.

Description

Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment

Technical Field

The invention belongs to the audio signal processing technology, and relates to a real-time convolutional neural network multi-scale monophonic note recognition algorithm with a noise reduction function.

Background

Along with the development of economy, the attention of people to the spiritual life is gradually improved, and nowadays, music becomes an important part of daily entertainment and life of people. The number of learners of musical instruments is increasing, and they include a part of amateurs and a part of beginners. The traditional musical instrument learning needs to require a professional teacher to guide, so that the learning cost is high, and the requirement that the teacher guides at any time during playing cannot be met due to the limitation of time and space conditions. Therefore, when the learner learns the musical instrument, the learner cannot examine the performance level of the learner in real time according to the playing situation of each time. An algorithm for recognizing musical notes can help a player solve the problem that the playing accuracy cannot be judged, and meanwhile, musical note recognition can help a musician to reduce the working intensity of music processing.

At present, pure note identification technology under relatively quiet environment is mature day by day, but the existing algorithm is only applied in a small range under pure ideal audio frequency condition, the requirement on the quality of music signals is strict, and the robustness of the algorithm is poor. Therefore, in an actual environment, the musical notes to be recognized are polluted by environmental noise, or the musical note signals of various musical instruments are doped in the environmental noise, so that great interference is brought to feature extraction, the feature extraction of the musical note signals is an important step in audio processing, the feature extraction is substantial, namely, parameters capable of expressing the essence of the musical note signals are extracted from rich information contained in the audio signals, if great errors occur in the feature extraction, the recognition accuracy of the musical note signals is greatly reduced, and the task of recognizing the musical notes of the target audio is failed. Therefore, it is an indispensable part of the whole system to reduce noise of the music signal and extract the target audio signal to be recognized.

The traditional unsupervised noise reduction methods such as spectral subtraction, kalman filtering and wiener filtering need to be based on certain assumptions and some prior information when the audio is subjected to noise reduction in a complex environment. If the noise has certain stationarity, the clean audio and the noise are required to be uncorrelated, Fourier transform coefficients of the note audio and the noise are mutually independent on a time-frequency domain, and the like. Although this method is simple to implement, it is very demanding to estimate the noise. And the parameters in the traditional extraction system need actual debugging, and the robustness of the system is poor. In order to improve the system robustness, the invention adopts a neural network method to extract the audio, so that the target audio can be effectively extracted in real time under the complex conditions of various noises or various musical instruments.

The existing audio processing algorithm almost completely depends on spectrogram representation of an audio signal, and converts a waveform file into a frequency domain for feature extraction. However, applying short-time fourier transform (STFT) to transform the waveform file of audio into the frequency domain has several limitations: first, since the model does not estimate the source phase, it is often assumed that it is equal to the mixed phase, and a wrong estimation of the phase will have an effect on the exact reconstruction of the phase of the clean sound source, thereby pulling down the upper bound on the reconstructed audio. Secondly, to successfully separate the sound source from the time-frequency representation, a high resolution frequency decomposition of the mixed signal is required, which requires a long time window for calculating the STFT. This requirement increases the minimum delay of the system, which limits its applicability in real-time, low-delay applications.

In summary, the technical problem to be solved by the present invention is how to extract the audio signal of the musical instrument to be identified accurately and quickly when the audio signal contains the complex situation of various types of noise. The model has strong generalization capability and good robustness, and the system can have relatively stable accuracy under the conditions of different noise types and different noise levels. Meanwhile, the algorithm provided by the invention enables the system not to only identify the notes of a specific instrument, but also to have universality on the note identification of the instrument. And improves the accuracy of note identification.

Disclosure of Invention

The invention provides a single note real-time identification algorithm based on a multi-scale convolution neural network under a complex environment.

The technical solution to achieve the object of the present invention is based on multi-scale neural networks and end-to-end audio extraction and note recognition as technical background. In order to achieve the purpose, the invention adopts the following technical solutions:

step 1, mixing the clean audio and the noise data set to establish a required data set, and designing an encoder module to convert short segments of the mixed signal into corresponding representations in a feature space.

Step 2, a clean audio extraction module is used to estimate the multiplication function (mask) for each source.

And 3, reconstructing a source waveform by the decoder module by using the characteristics extracted by the encoder through obtaining the masking coefficient.

And 4, constructing a single tone identifier identification module, designing a multi-scale audio encoder, and performing feature extraction on the pure audio signal by using three convolution modules with different sizes.

And 5, training a convolutional neural network to perform note recognition.

And 6, inputting the pure audio features extracted in the step 4 into a trained neural network to complete tone character recognition.

Compared with the prior art, the invention has the following remarkable advantages: the method takes multi-scale neural network and end-to-end audio extraction and note identification as technical background. Firstly, a convolution module is used for directly extracting the characteristics of the waveform file, and a method for converting the waveform file into a frequency domain to operate a spectrogram by using the conventional short-time Fourier transform (STFT) is replaced. Audio phase information is taken into account while increasing the processing speed. The target audio signal is extracted at the audio signal identification front end, so that the system can have relatively stable accuracy under the conditions of different noise types and different noise levels, such as Gaussian white noise, background noise or interference sound source environments. The traditional algorithm performs poorly in the context of practical applications. The introduction of the audio extraction method also makes the music monophonic character recognition system of the present invention robust against the interference of the complex environmental factors. Meanwhile, the algorithm provided by the invention enables the system not to only identify the notes of a specific instrument, but also to have universality on the note identification of the instrument. And a multi-scale convolutional coding and decoding module is adopted in the note identification module, so that the identification accuracy is greatly improved. The model has strong generalization capability and precision, directly processes the audio in the time domain, and has real-time performance.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the present invention will be briefly introduced to better understand the inventive content of the present invention;

FIG. 1 is a general flow chart of a single note real-time identification algorithm based on a multi-scale convolutional neural network under a complex environment;

FIG. 2 is a detailed flowchart of the algorithm;

FIG. 3 is a block diagram of a pure target audio extraction process under a complex environment according to the present invention;

FIG. 4 is a block diagram of an overall network architecture for a time-series convolutional neural network (TCN) for audio extraction according to the present invention;

FIG. 5 is a block diagram of a design framework for each one-dimensional convolution block of the audio extraction time-series convolutional neural network (TCN) of the present invention;

FIG. 6 is a block diagram of the present invention of a model of a monophonic identifier convolutional neural network.

Detailed Description

A single note real-time identification method based on a multi-scale convolutional neural network in a complex environment builds a model by taking a deep learning convolutional neural network and extracting target audio in the complex environment as technical backgrounds, a general flow chart is shown in an attached drawing for explanation, and a detailed flow chart of a specific audio extraction and identification technical scheme is shown in an attached drawing for explanation, namely, fig. 2.

Firstly, a pure audio data set of a required identified instrument is selected, and the model has strong generalization capability and is not only used for identifying notes of a specified instrument, so that the data set can be replaced by a target data set of the required identification. The selected piano audio (MAPS) data set contains piano audio files, associated MIDI files and tagged txt files. The data set is divided into 9 directories according to the piano type and recording conditions, each directory containing an independent tone, chord and complete piano tune. Wherein each catalogue has 30 complete piano songs, 270 complete piano songs, and the total duration is about 18 hours. And audio extraction is carried out on the audio signal under the complex audio condition to be identified at the stage of the audio signal preprocessing of the note identification front end. Inputting a piano audio signal with noise, and performing feature extraction through a one-dimensional convolution neural network encoder; modeling the extracted feature vectors through a plurality of groups of time sequence convolution neural networks (TCNs); finally, inputting the signals into a one-dimensional convolutional neural network decoder to obtain a pure piano audio waveform; meanwhile, a multi-scale convolutional neural network note recognition model is built, model training is carried out by using a pure piano data set (MAPS), and a model weight file is obtained, so that accurate recognition of piano notes in a complex environment is realized.

The problems that the existing note recognition algorithm is sensitive to noise, high in audio quality requirement and single in application scene are solved. The method for identifying the monophonic character in real time based on the multi-scale convolutional neural network under the complex environment comprises the following steps:

step one, establishing and processing an audio data set in a complex environment. In this section the pure piano audio (MAPS) data set selected for use with the present invention. The pure piano audio and noise audio sampling rates are first adjusted to the same 44100 HZ. And then carrying out batch aliasing on the pure piano audio and the noise to form a noise-containing mixed audio data set Y, wherein the noise audio data comprises white Gaussian noise, human speaking noise, sudden whistling noise and other musical instrument drum sounds. The formula for mixing the audio is:

and dividing the audio data under each environment in the audio data set Y into a training set, a verification set and a test set. The distribution proportion is that the number of the training set audio pieces, the number of the verification set audio pieces and the number of the test set audio pieces are 3: 2.

Step two, constructing a one-dimensional convolution encoder to extract the characteristics of the mixed signal, dividing the input mixed signal into overlapped sections with the length of L, and using y_k∈R^1×LIt is shown that,

an index representing the segment, wherein

Representing the total number of segments entered. By one-dimensional convolution operation_kConversion to N-dimensional representation w ∈ R^1×NUsing matrix multiplication representation (deleting index k from now):

wherein U is E.R^N×LN vectors (basis functions of the encoder) are included, each of length L.

Is the ReLU function to ensure non-negative.

Designing a pure audio convolution extraction module, wherein a pure target audio extraction flow frame diagram in a complex environment is shown in an attached drawing description figure 3;

the present invention uses a full convolution separation module consisting of stacked one-dimensional expansion convolution blocks, as shown in fig. 4. Time-series convolutional neural networks (TCNs) are used in place of Recurrent Neural Networks (RNNs) in various sequence modeling tasks. Each layer in the TCN consists of one-dimensional convolution blocks with gradually increasing expansion factors. The expansion factor increases exponentially to ensure that a sufficiently large time window can be included. Wherein M expansion factors are 1, 2, 4^M-1The convolution block of (2) is repeated R times. The inputs of each block are zero padded to ensure that the output length is the same as the input. The output of TCN will be fed to a convolutional block of kernel size 1 to estimate the mask. The 1 x 1 convolutional block, together with the nonlinear activation function, estimates C mask vectors for the C target sources.

Fig. 5 shows the design of each one-dimensional volume block. This applies both the residual path and a jump path transfer: the residual path of one block serves as the input to the next block, and the sum of the skip paths of all blocks serves as the output of a time-series convolutional neural network (TCN).

To further reduce the number of parameters, a depth separable convolution (S _ conv) is used instead of the standard convolution in each volume block. The deep separable convolution operator decouples the standard convolution operation into two successive operations, deep convolution (D _ conv (·)) followed by a dot product, which is a convolution with a convolution kernel size of 1 × 1:

where Z is the input of S _ conv (·), and K is a convolution kernel Z of size P_jAnd k_jThe rows of matrices Z and K, respectively, L is a convolution kernel of size 1,

representing a convolution operation.

The first 1 × 1conv (-) and D _ conv (-) blocks are followed by the addition of a nonlinear activation function and normalization, respectively. The nonlinear activation function is a parametric rectification linear unit (PReLU):

the normalization method in the network uses global level normalization (gLN). The features are normalized in the channel and time dimensions at gLN:

where F is a feature, γ and β are trainable parameters, and e is a small constant for numerical stability.

A linear one-dimensional volume block is added at the beginning of the extraction module as a bottleneck layer (bottleeck layer). This block determines the number of channels of the input path and the residual path of the subsequent convolution block. The linear bottleneck layer has B channels, and for a one-dimensional volume block with the number of channels H and the kernel size P, the kernel sizes in the first 1 × 1 volume block and the first deep convolution D _ conv block should be respectively equal to

And

and the kernel size in the residual path should be

The number of output channels in the next module input residual connection (skip-connection path) may be different from B, representing the kernel size in this path as L_Sc

Step four, estimating the extracted mask to realize the separation of each frame by estimating C vectors (masks), m_i∈R^1×NWhere C is the amount of noise in the mixed signal and m_i∈[0，1]. M is to be_iApplication to the mixed representation w yields the corresponding source representation:

wherein

Representing the corresponding point multiplication. Estimated target audio waveform signal

And (3) reconstructing by a decoder:

step five, the decoder reconstructs the waveform from the representation form by using one-dimensional transposition convolution operation, and the waveform can be represented by matrix multiplication as follows:

wherein

Is reconstructed x, V ∈ R^N×LAre the basis functions of the decoder, each of length L. The overlapping reconstructed segments are added together to generate the final waveform. The specific audio data extraction flow framework diagram is shown in the figure and is explained in figure 2.

Experimental configuration: the network trained 150 epochs over a 5 second long segment. Initial learning rate set to 1e^-3. If the accuracy of the validation set does not improve over 3 consecutive epochs, then the learning rate will be halved. The optimizer uses Adam. The convolutional auto-encoder uses a stride size of 50% (i.e., there is 50% overlap between consecutive frames). Applying maximum L during training₂And 5 norm clipping.

Training targets the goal of training an end-to-end system is to maximize the scale-invariant signal-to-noise ratio (SI-SNR). The SI-SNR is defined as:

the cross-entropy loss function is as follows:

and after the extraction processing of one target audio frequency in the note signal training set is finished, the training set, the verification set and the test set are sequentially processed according to the steps.

And step six, building a single tone identifier identification model. The algorithm adopts a multi-scale convolutional neural network encoder to extract time domain characteristics, and adopts a note level method based on a Convolutional Neural Network (CNN) to transcribe a single tone music signal. The CNN model is suitable for detecting the space structure characteristics, and in addition, compared with DNN, the CNN adopts shared parameters to extract the characteristics, so that the size of the model can be reduced, overfitting can be effectively prevented, and the generalization performance of the model is improved.

The audio recognition model will detect the pitch information of the newly played note in the input frame, and the output layer of the neural network contains 88 output units, corresponding to the 88 key notes in the piano selected for use in the present invention.

The audio characteristic extraction part adopts a multi-scale audio encoder, mainly solves the phase estimation problem of a frequency domain method, and selects a time domain method, namely directly converting a time domain mixed signal into characteristic representation by using a convolution network. In the frequency domain method, the audio signal is decomposed into an alternating representation, characterized by a sine and a cosine, by applying a fourier transform. Thus in the time domain approach, the filters in the convolutional layer can be similarly treated as basis functions, which is equivalent to treating the sine and cosine representations in the frequency domain as embedded coefficients. But time domain coding differs from fourier transformation in that: a) the feature representation cannot process the real and imaginary parts separately; b) the basis functions are not predefined as sine or cosine, but may be trained from the data.

The input pure target audio signal x (t) may be encoded into embedding coefficients using a convolutional neural network, which encodes the pure target audio signal into multi-scale audio embedding using a plurality of parallel one-dimensional convolutional neural networks, each convolutional neural network CNN module having a different time resolution.

The number of scales may vary and the system is generic, varying the number of time scales according to the type of audio signal that needs to be identified.

The present invention is only studied for three different time scales.

Is due to the frequency division of 88 keys of the pianoThe filter distributed at 27.5 HZ-4186.0 HZ and divided into three frequency bands of low frequency (27.5 HZ-123.47 HZ), intermediate frequency (130.81 HZ-739.99 HZ) and high frequency (783.99 HZ-4186.00 HZ) has different lengths, and is divided into L₁(short, L)₂(middle) and L₃(long) samples to cover different window sizes.

The Convolutional Neural Network (CNN) is followed by a corrective linear unit (ReLU) activation function that produces a note tone embedding E ═ E₁ E₂ E₃]。

To connect embeddings at different time scales by keeping the same pace at different scales

To align them.

As the filter length changes, the encoder can learn the representation over multiple scales, with short windows having good resolution in the high frequency band and long windows having higher resolution in the low frequency band. The time domain signal is encoded into three time resolutions in embedding E. Embedding coefficient E for each scale_iIs defined as:

E_i，k＝ReLU(x_i，kU_i)

K＝2(T-L₁)/L₁+1

each time

L of sample shift_iA sample window. Will [ E ]₁ E₂ E₃]Is concatenated as a piano audio signal extracted feature

And (3) constructing a convolutional neural network model for identifying the monophonic notes, wherein the structure of the CNN model is shown in figure 6, and 8 convolutional layers, 4 maximum pooling layers and 3 full-connection layers are adopted.

Normalization processing is added after each convolution layer, so that the model convergence speed can be accelerated, the regularization effect is provided, and overfitting is prevented. This allows the data to be processed without causing network performance instability due to the data being too large before Relu. The dropout layer is used for preventing overfitting and improving the generalization capability of the model.

Except for the output layer, each layer adopts a linear rectification activation function. The final output layer contains 88 output units, and the activation function of sigmoid is adopted.

Experimental configuration: the loss function adopts a loss function of cross entropy and is optimized by an Adam algorithm. Setting the trained blocksize to be 16, the iteration number epoch to be 50, and storing a weight file once when each 500 voices are trained; the cross-entropy loss function is as follows:

and training the convolutional neural network model identified by the tone characters in sequence according to the steps until the loss of the network model is converged, and finishing training the tone character identification model. The weight file and various configuration files of the tone character recognition model are saved.

And seventhly, identifying the music signal audio frequency of the test set by using the trained tone character real-time extraction and identification system based on the complex environment, counting the accuracy rate of the note identification, and comparing and analyzing the performance of the audio note identification recorded under the actual complex environment.

The specific implementation mode is as follows:

(1) and (3) carrying out note identification test on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database by using the traditional note identification system, and counting the accuracy rate of note identification.

(2) And (3) carrying out note identification test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using the traditional note identification system, and counting the accuracy rate of note identification.

(3) The real-time monophonic note identification and recognition system of the multi-scale convolutional neural network is used for carrying out note identification tests on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database, and the accuracy rate of note identification is counted.

(4) The real-time monophonic note identification and recognition system of the multi-scale convolutional neural network is used for carrying out note identification tests on 1000 audio test sets of the complex environment audio database provided by the invention after target extraction, and counting the accuracy rate of note identification.

After statistics is completed, the single tone identifier recognition algorithm based on audio extraction provided by the invention has the advantages that the accuracy of single tone identifier recognition in various noise environments, background noise and interference sound source environments is greatly improved; compared with the traditional note recognition algorithm, the recognition accuracy of the algorithm is greatly improved, the traditional algorithm is poor in performance, and the algorithm is excellent in performance and good in performance.

Therefore, the deep neural network note recognition method based on voice enhancement in the complex environment well solves the problems that the existing note recognition algorithm is sensitive to a noise environment, high in audio quality requirement and single in applicable scene, and realizes real-time note recognition in the complex audio environment.

Advantages of the invention

The method builds a model by taking a deep learning convolution neural network and audio extraction audio time domain analysis as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on audio signals under various complex conditions to be recognized in the stage of audio signal preprocessing at the front end of note recognition; a convolution module is used for directly extracting the characteristics of the waveform file, and a method for converting the waveform file into a frequency domain to operate a spectrogram by using the conventional short-time Fourier transform (STFT) is replaced. Audio phase information is taken into account while increasing the processing speed. The target audio signal is extracted at the audio signal identification front end, so that the system can have relatively stable accuracy under the conditions of different noise types and different noise levels, such as Gaussian white noise, background noise or interference sound source environments. Conventional algorithms perform poorly in the context of practical applications. The introduction of the audio extraction method also makes the music monophonic character recognition system of the present invention robust against the interference of the complex environmental factors. And after extracting the target audio signal from the complex sound environment, training the multi-scale monophonic note recognition model by means of the audio data set label to obtain a model weight file, thereby realizing the accurate recognition of the monophonic note in the complex environment. The problems that the existing note recognition algorithm is sensitive to noise factors, high in audio quality requirement and single in application scene are solved.

Claims

1. The single note real-time identification algorithm based on the multi-scale convolution neural network under the complex environment is characterized by comprising the following steps of:

firstly, selecting a pure audio data set of a musical instrument to be identified, mixing the pure audio and a noise data set, establishing a required data set, and designing an encoder module to convert short segments of a mixed signal into corresponding representations in a feature space;

estimating a mask of each source by using an audio extraction module;

thirdly, the decoder module reconstructs a source waveform by utilizing the characteristics extracted by the encoder through obtaining the masking coefficient;

step four, constructing a single tone identifier identification module, designing a multi-scale audio encoder, and performing feature extraction on the pure audio signal by using three convolution modules with different sizes;

training a convolutional neural network to perform note recognition;

and step six, inputting the extracted pure audio features into a trained neural network to complete the identification of the single tone character.

2. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network under the complex environment as claimed in claim 1, wherein the step one is to establish and process an audio data set under the complex environment; a pure piano audio data set selected by the present invention in the section; firstly, adjusting the sampling rates of pure piano audio and noise audio to the same 44100 HZ; and then carrying out batch aliasing on the pure piano audio and the noise to form a noise-containing mixed audio data set Y, wherein the noise audio data comprises white Gaussian noise, human speaking noise, sudden whistling noise and other musical instrument drum noise, and the formula of the mixed audio is as follows:

dividing the audio data under each environment in the audio data set Y into a training set, a verification set and a test set; the distribution proportion is that the number of the training set audio pieces, the number of the verification set audio pieces and the number of the test set audio pieces are 3: 2;

constructing a one-dimensional convolution encoder to extract features of the mixed signal, dividing the input mixed signal into overlapping segments of length L, and using y_k∈R^1×LIt is shown that,

an index representing the segment, wherein

Representing the total number of segments of the input, y being calculated by a one-dimensional convolution_kConversion to N-dimensional representation w ∈ R^1×NUsing matrix multiplication to represent:

wherein U is E.R^N×LComprising N vectors (basis functions of the encoder), each of length L,

is the ReLU function.

3. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network in the complex environment as claimed in claim 1, wherein in the second step, a clean audio extraction module is designed; the invention uses a full-volume integral separation module composed of stacked one-dimensional expansion convolution blocks; using time-series convolutional neural networks (TCNs) instead of recurrent neural networks in various sequence modeling tasksA luo (RNN); each layer in the TCN is composed of one-dimensional convolution blocks with gradually increased expansion factors, and the expansion factors grow exponentially; wherein M expansion factors are respectively l, 2, 4^M-1The convolution block of (a) is repeated R times; zero padding is performed on the input of each block; the output of TCN will be fed to a convolutional block of kernel size 1 to estimate the mask; the 1 × 1 convolutional block, together with the nonlinear activation function, estimates C mask vectors for the C target sources; residual path and a jump path transfer are applied: taking a residual path of one block as an input of a next block, and taking the sum of jump paths of all blocks as an output of a time sequence convolution neural network (TCN); estimation of extraction mask separation per frame is achieved by estimating C vectors (masks), m_i∈R^1×NWhere C is the amount of noise in the mixed signal and m_i∈[0，1](ii) a M is to be_iApplication to the mixed representation w yields the corresponding source representation:

d_i＝w⊙m_i (9)；

wherein [ ] indicates the multiplication of corresponding points, the estimated target audio waveform signal

And (3) reconstructing by a decoder:

4. the algorithm for real-time single note identification based on multi-scale convolutional neural network under complex environment as claimed in claim 1, wherein the decoder reconstructs the waveform from the representation form by using one-dimensional transpose convolution operation in step three, which can be expressed by matrix multiplication as:

wherein

Is reconstructed x, V ∈ R^N×LThe rows of (a) are the basis functions of the decoder, each of length L; adding the overlapping reconstructed segments together to generate a final waveform; training targets the goal of training an end-to-end system is to maximize the scale-invariant signal-to-noise ratio (SI-SNR), which is defined as:

the loss function is as follows:

5. the algorithm for identifying the monophonic note in real time based on the multi-scale convolutional neural network under the complex environment according to claim 1, wherein a monophonic note identification model is built in the fourth step; extracting time domain characteristics by adopting a multi-scale convolutional neural network encoder, wherein a convolutional neural network single-tone note recognition model can detect tone information of a newly played note in an input frame; the input pure piano audio signal x (t) can be encoded into an embedding coefficient by a convolutional neural network, and a pure target audio signal is encoded into multi-scale audio embedding by a plurality of parallel one-dimensional convolutional neural networks, wherein each convolutional neural network CNN module has different time resolution; the invention only aims at selecting a piano data set to carry out research on three different time scales, because the piano 88 key frequency is distributed between 27.5HZ and 4186.0HZ, the filter lengths of the parallel one-dimensional Convolutional Neural Networks (CNN) divided into three frequency bands of low frequency (27.5HZ to 123.47HZ), intermediate frequency (130.81HZ to 739.99HZ) and high frequency (783.99HZ to 4186.00HZ) are different, and the filter lengths of the parallel one-dimensional Convolutional Neural Networks (CNN) divided into L₁(short, L)₂(middle) and L₃(long) samples to cover different window sizes; the convolutional neural network is followed by a corrective linear unit activation function to produce a note tone embedded E ═ E₁ E₂ E₃](ii) a To connect embeddings at different time scales by keeping the same pace at different scales

To align them; the time domain signal is coded into three time resolutions in embedding E; embedding coefficient E for each scale_iIs defined as:

E_i，k＝ReLU(x_i，kU_i) (14)；

K＝2(T-L₁)/L₁+1 (15)；

each time

L of sample shift_iA sample window; will [ E ]₁ E₂ E₃]Is concatenated as a feature extracted from the piano audio signal.

6. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network in the complex environment according to claim 1, wherein a convolutional neural network model for identifying monophonic notes is constructed in the fifth step, and 8 convolutional layers, 4 maximum pooling layers and 3 full-link layers are adopted; normalization is added after each convolution layer; the dropout layer used; except for the output layer, all other layers adopt linear rectification activation functions; the last output layer comprises 88 output units and adopts a sigmoid activation function; the loss function adopts a cross entropy loss function, and is optimized by an Adam algorithm; setting the training batch size to be 16, the iteration time epoch to be 50, and storing a weight file once when 500 voices are trained; the cross-entropy loss function is as follows:

training the convolutional neural network model identified by the single tone identifier sequentially according to the steps until the loss function of the network model is converged, and finishing training the single tone identifier identification model; the weight file and various configuration files of the tone character recognition model are saved.

7. The algorithm for identifying monophonic notes in real time based on a multi-scale convolutional neural network under a complex environment as claimed in claim 1, wherein in the sixth step, a trained real-time monophonic note extraction and identification system based on a complex environment is used to identify music signals in a test set, statistics is performed on the accuracy of note identification, and a performance comparison analysis is performed on the statistics and the audio note identification of the music signals recorded under an actual complex environment, and the specific implementation manner is as follows:

firstly, using a traditional note identification system to perform note identification tests on 1000 audio test sets which are not subjected to target extraction in an established complex environment audio database, and counting the accuracy rate of note identification;

secondly, performing note recognition test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using a traditional note recognition system, and counting the accuracy rate of note recognition;

thirdly, using the monophonic note real-time identification and recognition system of the multi-scale convolutional neural network to perform note identification tests on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database, and counting the accuracy rate of note identification;

and fourthly, carrying out note identification test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using the single-tone real-time identification and identification system of the multi-scale convolutional neural network, and counting the accuracy rate of note identification.