CN112633175A - Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment - Google Patents

Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment Download PDF

Info

Publication number
CN112633175A
CN112633175A CN202011549426.4A CN202011549426A CN112633175A CN 112633175 A CN112633175 A CN 112633175A CN 202011549426 A CN202011549426 A CN 202011549426A CN 112633175 A CN112633175 A CN 112633175A
Authority
CN
China
Prior art keywords
audio
note
neural network
convolutional neural
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011549426.4A
Other languages
Chinese (zh)
Inventor
卢迪
邢湘琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202011549426.4A priority Critical patent/CN112633175A/en
Publication of CN112633175A publication Critical patent/CN112633175A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention discloses a single note real-time recognition algorithm with an extraction function, which mainly comprises the following steps: firstly, selecting a pure audio data set of a musical instrument to be identified; adding pure audio into various noises to simulate the real situation of recording; carrying out feature extraction on an input waveform through an encoder; modeling the extracted feature vectors through a plurality of groups of time convolution networks; finally, pure audio waveforms are extracted through a decoder; and constructing a multi-scale convolution neural network for note recognition. The invention provides a time domain-based audio frequency operation method, which overcomes the defect that phase information is ignored in the traditional frequency domain operation. The target audio to be recognized is extracted from the complex environment before note recognition. And the characteristic extraction is carried out by utilizing the multi-scale convolution neural network, so that the note identification performance is improved. The problem that the existing note recognition algorithm is sensitive to noise factors is well solved, and the note recognition method is small in model size and greatly improves note recognition speed and accuracy.

Description

Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
Technical Field
The invention belongs to the audio signal processing technology, and relates to a real-time convolutional neural network multi-scale monophonic note recognition algorithm with a noise reduction function.
Background
Along with the development of economy, the attention of people to the spiritual life is gradually improved, and nowadays, music becomes an important part of daily entertainment and life of people. The number of learners of musical instruments is increasing, and they include a part of amateurs and a part of beginners. The traditional musical instrument learning needs to require a professional teacher to guide, so that the learning cost is high, and the requirement that the teacher guides at any time during playing cannot be met due to the limitation of time and space conditions. Therefore, when the learner learns the musical instrument, the learner cannot examine the performance level of the learner in real time according to the playing situation of each time. An algorithm for recognizing musical notes can help a player solve the problem that the playing accuracy cannot be judged, and meanwhile, musical note recognition can help a musician to reduce the working intensity of music processing.
At present, pure note identification technology under relatively quiet environment is mature day by day, but the existing algorithm is only applied in a small range under pure ideal audio frequency condition, the requirement on the quality of music signals is strict, and the robustness of the algorithm is poor. Therefore, in an actual environment, the musical notes to be recognized are polluted by environmental noise, or the musical note signals of various musical instruments are doped in the environmental noise, so that great interference is brought to feature extraction, the feature extraction of the musical note signals is an important step in audio processing, the feature extraction is substantial, namely, parameters capable of expressing the essence of the musical note signals are extracted from rich information contained in the audio signals, if great errors occur in the feature extraction, the recognition accuracy of the musical note signals is greatly reduced, and the task of recognizing the musical notes of the target audio is failed. Therefore, it is an indispensable part of the whole system to reduce noise of the music signal and extract the target audio signal to be recognized.
The traditional unsupervised noise reduction methods such as spectral subtraction, kalman filtering and wiener filtering need to be based on certain assumptions and some prior information when the audio is subjected to noise reduction in a complex environment. If the noise has certain stationarity, the clean audio and the noise are required to be uncorrelated, Fourier transform coefficients of the note audio and the noise are mutually independent on a time-frequency domain, and the like. Although this method is simple to implement, it is very demanding to estimate the noise. And the parameters in the traditional extraction system need actual debugging, and the robustness of the system is poor. In order to improve the system robustness, the invention adopts a neural network method to extract the audio, so that the target audio can be effectively extracted in real time under the complex conditions of various noises or various musical instruments.
The existing audio processing algorithm almost completely depends on spectrogram representation of an audio signal, and converts a waveform file into a frequency domain for feature extraction. However, applying short-time fourier transform (STFT) to transform the waveform file of audio into the frequency domain has several limitations: first, since the model does not estimate the source phase, it is often assumed that it is equal to the mixed phase, and a wrong estimation of the phase will have an effect on the exact reconstruction of the phase of the clean sound source, thereby pulling down the upper bound on the reconstructed audio. Secondly, to successfully separate the sound source from the time-frequency representation, a high resolution frequency decomposition of the mixed signal is required, which requires a long time window for calculating the STFT. This requirement increases the minimum delay of the system, which limits its applicability in real-time, low-delay applications.
In summary, the technical problem to be solved by the present invention is how to extract the audio signal of the musical instrument to be identified accurately and quickly when the audio signal contains the complex situation of various types of noise. The model has strong generalization capability and good robustness, and the system can have relatively stable accuracy under the conditions of different noise types and different noise levels. Meanwhile, the algorithm provided by the invention enables the system not to only identify the notes of a specific instrument, but also to have universality on the note identification of the instrument. And improves the accuracy of note identification.
Disclosure of Invention
The invention provides a single note real-time identification algorithm based on a multi-scale convolution neural network under a complex environment.
The technical solution to achieve the object of the present invention is based on multi-scale neural networks and end-to-end audio extraction and note recognition as technical background. In order to achieve the purpose, the invention adopts the following technical solutions:
step 1, mixing the clean audio and the noise data set to establish a required data set, and designing an encoder module to convert short segments of the mixed signal into corresponding representations in a feature space.
Step 2, a clean audio extraction module is used to estimate the multiplication function (mask) for each source.
And 3, reconstructing a source waveform by the decoder module by using the characteristics extracted by the encoder through obtaining the masking coefficient.
And 4, constructing a single tone identifier identification module, designing a multi-scale audio encoder, and performing feature extraction on the pure audio signal by using three convolution modules with different sizes.
And 5, training a convolutional neural network to perform note recognition.
And 6, inputting the pure audio features extracted in the step 4 into a trained neural network to complete tone character recognition.
Compared with the prior art, the invention has the following remarkable advantages: the method takes multi-scale neural network and end-to-end audio extraction and note identification as technical background. Firstly, a convolution module is used for directly extracting the characteristics of the waveform file, and a method for converting the waveform file into a frequency domain to operate a spectrogram by using the conventional short-time Fourier transform (STFT) is replaced. Audio phase information is taken into account while increasing the processing speed. The target audio signal is extracted at the audio signal identification front end, so that the system can have relatively stable accuracy under the conditions of different noise types and different noise levels, such as Gaussian white noise, background noise or interference sound source environments. The traditional algorithm performs poorly in the context of practical applications. The introduction of the audio extraction method also makes the music monophonic character recognition system of the present invention robust against the interference of the complex environmental factors. Meanwhile, the algorithm provided by the invention enables the system not to only identify the notes of a specific instrument, but also to have universality on the note identification of the instrument. And a multi-scale convolutional coding and decoding module is adopted in the note identification module, so that the identification accuracy is greatly improved. The model has strong generalization capability and precision, directly processes the audio in the time domain, and has real-time performance.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the present invention will be briefly introduced to better understand the inventive content of the present invention;
FIG. 1 is a general flow chart of a single note real-time identification algorithm based on a multi-scale convolutional neural network under a complex environment;
FIG. 2 is a detailed flowchart of the algorithm;
FIG. 3 is a block diagram of a pure target audio extraction process under a complex environment according to the present invention;
FIG. 4 is a block diagram of an overall network architecture for a time-series convolutional neural network (TCN) for audio extraction according to the present invention;
FIG. 5 is a block diagram of a design framework for each one-dimensional convolution block of the audio extraction time-series convolutional neural network (TCN) of the present invention;
FIG. 6 is a block diagram of the present invention of a model of a monophonic identifier convolutional neural network.
Detailed Description
A single note real-time identification method based on a multi-scale convolutional neural network in a complex environment builds a model by taking a deep learning convolutional neural network and extracting target audio in the complex environment as technical backgrounds, a general flow chart is shown in an attached drawing for explanation, and a detailed flow chart of a specific audio extraction and identification technical scheme is shown in an attached drawing for explanation, namely, fig. 2.
Firstly, a pure audio data set of a required identified instrument is selected, and the model has strong generalization capability and is not only used for identifying notes of a specified instrument, so that the data set can be replaced by a target data set of the required identification. The selected piano audio (MAPS) data set contains piano audio files, associated MIDI files and tagged txt files. The data set is divided into 9 directories according to the piano type and recording conditions, each directory containing an independent tone, chord and complete piano tune. Wherein each catalogue has 30 complete piano songs, 270 complete piano songs, and the total duration is about 18 hours. And audio extraction is carried out on the audio signal under the complex audio condition to be identified at the stage of the audio signal preprocessing of the note identification front end. Inputting a piano audio signal with noise, and performing feature extraction through a one-dimensional convolution neural network encoder; modeling the extracted feature vectors through a plurality of groups of time sequence convolution neural networks (TCNs); finally, inputting the signals into a one-dimensional convolutional neural network decoder to obtain a pure piano audio waveform; meanwhile, a multi-scale convolutional neural network note recognition model is built, model training is carried out by using a pure piano data set (MAPS), and a model weight file is obtained, so that accurate recognition of piano notes in a complex environment is realized.
The problems that the existing note recognition algorithm is sensitive to noise, high in audio quality requirement and single in application scene are solved. The method for identifying the monophonic character in real time based on the multi-scale convolutional neural network under the complex environment comprises the following steps:
step one, establishing and processing an audio data set in a complex environment. In this section the pure piano audio (MAPS) data set selected for use with the present invention. The pure piano audio and noise audio sampling rates are first adjusted to the same 44100 HZ. And then carrying out batch aliasing on the pure piano audio and the noise to form a noise-containing mixed audio data set Y, wherein the noise audio data comprises white Gaussian noise, human speaking noise, sudden whistling noise and other musical instrument drum sounds. The formula for mixing the audio is:
Figure BDA0002856615420000041
and dividing the audio data under each environment in the audio data set Y into a training set, a verification set and a test set. The distribution proportion is that the number of the training set audio pieces, the number of the verification set audio pieces and the number of the test set audio pieces are 3: 2.
Step two, constructing a one-dimensional convolution encoder to extract the characteristics of the mixed signal, dividing the input mixed signal into overlapped sections with the length of L, and using yk∈R1×LIt is shown that,
Figure BDA0002856615420000042
an index representing the segment, wherein
Figure BDA0002856615420000043
Representing the total number of segments entered. By one-dimensional convolution operationkConversion to N-dimensional representation w ∈ R1×NUsing matrix multiplication representation (deleting index k from now):
Figure BDA0002856615420000044
wherein U is E.RN×LN vectors (basis functions of the encoder) are included, each of length L.
Figure BDA0002856615420000045
Is the ReLU function to ensure non-negative.
Designing a pure audio convolution extraction module, wherein a pure target audio extraction flow frame diagram in a complex environment is shown in an attached drawing description figure 3;
the present invention uses a full convolution separation module consisting of stacked one-dimensional expansion convolution blocks, as shown in fig. 4. Time-series convolutional neural networks (TCNs) are used in place of Recurrent Neural Networks (RNNs) in various sequence modeling tasks. Each layer in the TCN consists of one-dimensional convolution blocks with gradually increasing expansion factors. The expansion factor increases exponentially to ensure that a sufficiently large time window can be included. Wherein M expansion factors are 1, 2, 4M-1The convolution block of (2) is repeated R times. The inputs of each block are zero padded to ensure that the output length is the same as the input. The output of TCN will be fed to a convolutional block of kernel size 1 to estimate the mask. The 1 x 1 convolutional block, together with the nonlinear activation function, estimates C mask vectors for the C target sources.
Fig. 5 shows the design of each one-dimensional volume block. This applies both the residual path and a jump path transfer: the residual path of one block serves as the input to the next block, and the sum of the skip paths of all blocks serves as the output of a time-series convolutional neural network (TCN).
To further reduce the number of parameters, a depth separable convolution (S _ conv) is used instead of the standard convolution in each volume block. The deep separable convolution operator decouples the standard convolution operation into two successive operations, deep convolution (D _ conv (·)) followed by a dot product, which is a convolution with a convolution kernel size of 1 × 1:
Figure BDA0002856615420000046
Figure BDA0002856615420000047
where Z is the input of S _ conv (·), and K is a convolution kernel Z of size PjAnd kjThe rows of matrices Z and K, respectively, L is a convolution kernel of size 1,
Figure BDA0002856615420000048
representing a convolution operation.
The first 1 × 1conv (-) and D _ conv (-) blocks are followed by the addition of a nonlinear activation function and normalization, respectively. The nonlinear activation function is a parametric rectification linear unit (PReLU):
Figure BDA0002856615420000049
the normalization method in the network uses global level normalization (gLN). The features are normalized in the channel and time dimensions at gLN:
Figure BDA0002856615420000051
Figure BDA0002856615420000052
Figure BDA0002856615420000053
where F is a feature, γ and β are trainable parameters, and e is a small constant for numerical stability.
A linear one-dimensional volume block is added at the beginning of the extraction module as a bottleneck layer (bottleeck layer). This block determines the number of channels of the input path and the residual path of the subsequent convolution block. The linear bottleneck layer has B channels, and for a one-dimensional volume block with the number of channels H and the kernel size P, the kernel sizes in the first 1 × 1 volume block and the first deep convolution D _ conv block should be respectively equal to
Figure BDA0002856615420000054
And
Figure BDA0002856615420000055
and the kernel size in the residual path should be
Figure BDA0002856615420000056
The number of output channels in the next module input residual connection (skip-connection path) may be different from B, representing the kernel size in this path as LSc
Step four, estimating the extracted mask to realize the separation of each frame by estimating C vectors (masks), mi∈R1×NWhere C is the amount of noise in the mixed signal and mi∈[0,1]. M is to beiApplication to the mixed representation w yields the corresponding source representation:
Figure BDA0002856615420000057
wherein
Figure BDA0002856615420000058
Representing the corresponding point multiplication. Estimated target audio waveform signal
Figure BDA0002856615420000059
And (3) reconstructing by a decoder:
Figure BDA00028566154200000510
step five, the decoder reconstructs the waveform from the representation form by using one-dimensional transposition convolution operation, and the waveform can be represented by matrix multiplication as follows:
Figure BDA00028566154200000511
wherein
Figure BDA00028566154200000512
Is reconstructed x, V ∈ RN×LAre the basis functions of the decoder, each of length L. The overlapping reconstructed segments are added together to generate the final waveform. The specific audio data extraction flow framework diagram is shown in the figure and is explained in figure 2.
Experimental configuration: the network trained 150 epochs over a 5 second long segment. Initial learning rate set to 1e-3. If the accuracy of the validation set does not improve over 3 consecutive epochs, then the learning rate will be halved. The optimizer uses Adam. The convolutional auto-encoder uses a stride size of 50% (i.e., there is 50% overlap between consecutive frames). Applying maximum L during training2And 5 norm clipping.
Training targets the goal of training an end-to-end system is to maximize the scale-invariant signal-to-noise ratio (SI-SNR). The SI-SNR is defined as:
Figure BDA0002856615420000061
the cross-entropy loss function is as follows:
Figure BDA0002856615420000062
and after the extraction processing of one target audio frequency in the note signal training set is finished, the training set, the verification set and the test set are sequentially processed according to the steps.
And step six, building a single tone identifier identification model. The algorithm adopts a multi-scale convolutional neural network encoder to extract time domain characteristics, and adopts a note level method based on a Convolutional Neural Network (CNN) to transcribe a single tone music signal. The CNN model is suitable for detecting the space structure characteristics, and in addition, compared with DNN, the CNN adopts shared parameters to extract the characteristics, so that the size of the model can be reduced, overfitting can be effectively prevented, and the generalization performance of the model is improved.
The audio recognition model will detect the pitch information of the newly played note in the input frame, and the output layer of the neural network contains 88 output units, corresponding to the 88 key notes in the piano selected for use in the present invention.
The audio characteristic extraction part adopts a multi-scale audio encoder, mainly solves the phase estimation problem of a frequency domain method, and selects a time domain method, namely directly converting a time domain mixed signal into characteristic representation by using a convolution network. In the frequency domain method, the audio signal is decomposed into an alternating representation, characterized by a sine and a cosine, by applying a fourier transform. Thus in the time domain approach, the filters in the convolutional layer can be similarly treated as basis functions, which is equivalent to treating the sine and cosine representations in the frequency domain as embedded coefficients. But time domain coding differs from fourier transformation in that: a) the feature representation cannot process the real and imaginary parts separately; b) the basis functions are not predefined as sine or cosine, but may be trained from the data.
The input pure target audio signal x (t) may be encoded into embedding coefficients using a convolutional neural network, which encodes the pure target audio signal into multi-scale audio embedding using a plurality of parallel one-dimensional convolutional neural networks, each convolutional neural network CNN module having a different time resolution.
The number of scales may vary and the system is generic, varying the number of time scales according to the type of audio signal that needs to be identified.
The present invention is only studied for three different time scales.
Is due to the frequency division of 88 keys of the pianoThe filter distributed at 27.5 HZ-4186.0 HZ and divided into three frequency bands of low frequency (27.5 HZ-123.47 HZ), intermediate frequency (130.81 HZ-739.99 HZ) and high frequency (783.99 HZ-4186.00 HZ) has different lengths, and is divided into L1(short, L)2(middle) and L3(long) samples to cover different window sizes.
The Convolutional Neural Network (CNN) is followed by a corrective linear unit (ReLU) activation function that produces a note tone embedding E ═ E1 E2 E3]。
To connect embeddings at different time scales by keeping the same pace at different scales
Figure BDA0002856615420000071
To align them.
As the filter length changes, the encoder can learn the representation over multiple scales, with short windows having good resolution in the high frequency band and long windows having higher resolution in the low frequency band. The time domain signal is encoded into three time resolutions in embedding E. Embedding coefficient E for each scaleiIs defined as:
Ei,k=ReLU(xi,kUi)
K=2(T-L1)/L1+1
each time
Figure BDA0002856615420000072
L of sample shiftiA sample window. Will [ E ]1 E2 E3]Is concatenated as a piano audio signal extracted feature
And (3) constructing a convolutional neural network model for identifying the monophonic notes, wherein the structure of the CNN model is shown in figure 6, and 8 convolutional layers, 4 maximum pooling layers and 3 full-connection layers are adopted.
Normalization processing is added after each convolution layer, so that the model convergence speed can be accelerated, the regularization effect is provided, and overfitting is prevented. This allows the data to be processed without causing network performance instability due to the data being too large before Relu. The dropout layer is used for preventing overfitting and improving the generalization capability of the model.
Except for the output layer, each layer adopts a linear rectification activation function. The final output layer contains 88 output units, and the activation function of sigmoid is adopted.
Experimental configuration: the loss function adopts a loss function of cross entropy and is optimized by an Adam algorithm. Setting the trained blocksize to be 16, the iteration number epoch to be 50, and storing a weight file once when each 500 voices are trained; the cross-entropy loss function is as follows:
Figure BDA0002856615420000073
and training the convolutional neural network model identified by the tone characters in sequence according to the steps until the loss of the network model is converged, and finishing training the tone character identification model. The weight file and various configuration files of the tone character recognition model are saved.
And seventhly, identifying the music signal audio frequency of the test set by using the trained tone character real-time extraction and identification system based on the complex environment, counting the accuracy rate of the note identification, and comparing and analyzing the performance of the audio note identification recorded under the actual complex environment.
The specific implementation mode is as follows:
(1) and (3) carrying out note identification test on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database by using the traditional note identification system, and counting the accuracy rate of note identification.
(2) And (3) carrying out note identification test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using the traditional note identification system, and counting the accuracy rate of note identification.
(3) The real-time monophonic note identification and recognition system of the multi-scale convolutional neural network is used for carrying out note identification tests on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database, and the accuracy rate of note identification is counted.
(4) The real-time monophonic note identification and recognition system of the multi-scale convolutional neural network is used for carrying out note identification tests on 1000 audio test sets of the complex environment audio database provided by the invention after target extraction, and counting the accuracy rate of note identification.
After statistics is completed, the single tone identifier recognition algorithm based on audio extraction provided by the invention has the advantages that the accuracy of single tone identifier recognition in various noise environments, background noise and interference sound source environments is greatly improved; compared with the traditional note recognition algorithm, the recognition accuracy of the algorithm is greatly improved, the traditional algorithm is poor in performance, and the algorithm is excellent in performance and good in performance.
Therefore, the deep neural network note recognition method based on voice enhancement in the complex environment well solves the problems that the existing note recognition algorithm is sensitive to a noise environment, high in audio quality requirement and single in applicable scene, and realizes real-time note recognition in the complex audio environment.
Advantages of the invention
The method builds a model by taking a deep learning convolution neural network and audio extraction audio time domain analysis as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on audio signals under various complex conditions to be recognized in the stage of audio signal preprocessing at the front end of note recognition; a convolution module is used for directly extracting the characteristics of the waveform file, and a method for converting the waveform file into a frequency domain to operate a spectrogram by using the conventional short-time Fourier transform (STFT) is replaced. Audio phase information is taken into account while increasing the processing speed. The target audio signal is extracted at the audio signal identification front end, so that the system can have relatively stable accuracy under the conditions of different noise types and different noise levels, such as Gaussian white noise, background noise or interference sound source environments. Conventional algorithms perform poorly in the context of practical applications. The introduction of the audio extraction method also makes the music monophonic character recognition system of the present invention robust against the interference of the complex environmental factors. And after extracting the target audio signal from the complex sound environment, training the multi-scale monophonic note recognition model by means of the audio data set label to obtain a model weight file, thereby realizing the accurate recognition of the monophonic note in the complex environment. The problems that the existing note recognition algorithm is sensitive to noise factors, high in audio quality requirement and single in application scene are solved.

Claims (7)

1. The single note real-time identification algorithm based on the multi-scale convolution neural network under the complex environment is characterized by comprising the following steps of:
firstly, selecting a pure audio data set of a musical instrument to be identified, mixing the pure audio and a noise data set, establishing a required data set, and designing an encoder module to convert short segments of a mixed signal into corresponding representations in a feature space;
estimating a mask of each source by using an audio extraction module;
thirdly, the decoder module reconstructs a source waveform by utilizing the characteristics extracted by the encoder through obtaining the masking coefficient;
step four, constructing a single tone identifier identification module, designing a multi-scale audio encoder, and performing feature extraction on the pure audio signal by using three convolution modules with different sizes;
training a convolutional neural network to perform note recognition;
and step six, inputting the extracted pure audio features into a trained neural network to complete the identification of the single tone character.
2. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network under the complex environment as claimed in claim 1, wherein the step one is to establish and process an audio data set under the complex environment; a pure piano audio data set selected by the present invention in the section; firstly, adjusting the sampling rates of pure piano audio and noise audio to the same 44100 HZ; and then carrying out batch aliasing on the pure piano audio and the noise to form a noise-containing mixed audio data set Y, wherein the noise audio data comprises white Gaussian noise, human speaking noise, sudden whistling noise and other musical instrument drum noise, and the formula of the mixed audio is as follows:
Figure FDA0002856615410000011
dividing the audio data under each environment in the audio data set Y into a training set, a verification set and a test set; the distribution proportion is that the number of the training set audio pieces, the number of the verification set audio pieces and the number of the test set audio pieces are 3: 2;
constructing a one-dimensional convolution encoder to extract features of the mixed signal, dividing the input mixed signal into overlapping segments of length L, and using yk∈R1×LIt is shown that,
Figure FDA0002856615410000012
an index representing the segment, wherein
Figure FDA0002856615410000013
Representing the total number of segments of the input, y being calculated by a one-dimensional convolutionkConversion to N-dimensional representation w ∈ R1×NUsing matrix multiplication to represent:
Figure FDA0002856615410000014
wherein U is E.RN×LComprising N vectors (basis functions of the encoder), each of length L,
Figure FDA0002856615410000015
is the ReLU function.
3. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network in the complex environment as claimed in claim 1, wherein in the second step, a clean audio extraction module is designed; the invention uses a full-volume integral separation module composed of stacked one-dimensional expansion convolution blocks; using time-series convolutional neural networks (TCNs) instead of recurrent neural networks in various sequence modeling tasksA luo (RNN); each layer in the TCN is composed of one-dimensional convolution blocks with gradually increased expansion factors, and the expansion factors grow exponentially; wherein M expansion factors are respectively l, 2, 4M-1The convolution block of (a) is repeated R times; zero padding is performed on the input of each block; the output of TCN will be fed to a convolutional block of kernel size 1 to estimate the mask; the 1 × 1 convolutional block, together with the nonlinear activation function, estimates C mask vectors for the C target sources; residual path and a jump path transfer are applied: taking a residual path of one block as an input of a next block, and taking the sum of jump paths of all blocks as an output of a time sequence convolution neural network (TCN); estimation of extraction mask separation per frame is achieved by estimating C vectors (masks), mi∈R1×NWhere C is the amount of noise in the mixed signal and mi∈[0,1](ii) a M is to beiApplication to the mixed representation w yields the corresponding source representation:
di=w⊙mi (9);
wherein [ ] indicates the multiplication of corresponding points, the estimated target audio waveform signal
Figure FDA0002856615410000021
And (3) reconstructing by a decoder:
Figure FDA0002856615410000022
4. the algorithm for real-time single note identification based on multi-scale convolutional neural network under complex environment as claimed in claim 1, wherein the decoder reconstructs the waveform from the representation form by using one-dimensional transpose convolution operation in step three, which can be expressed by matrix multiplication as:
Figure FDA0002856615410000023
wherein
Figure FDA0002856615410000024
Is reconstructed x, V ∈ RN×LThe rows of (a) are the basis functions of the decoder, each of length L; adding the overlapping reconstructed segments together to generate a final waveform; training targets the goal of training an end-to-end system is to maximize the scale-invariant signal-to-noise ratio (SI-SNR), which is defined as:
Figure FDA0002856615410000025
the loss function is as follows:
Figure FDA0002856615410000026
5. the algorithm for identifying the monophonic note in real time based on the multi-scale convolutional neural network under the complex environment according to claim 1, wherein a monophonic note identification model is built in the fourth step; extracting time domain characteristics by adopting a multi-scale convolutional neural network encoder, wherein a convolutional neural network single-tone note recognition model can detect tone information of a newly played note in an input frame; the input pure piano audio signal x (t) can be encoded into an embedding coefficient by a convolutional neural network, and a pure target audio signal is encoded into multi-scale audio embedding by a plurality of parallel one-dimensional convolutional neural networks, wherein each convolutional neural network CNN module has different time resolution; the invention only aims at selecting a piano data set to carry out research on three different time scales, because the piano 88 key frequency is distributed between 27.5HZ and 4186.0HZ, the filter lengths of the parallel one-dimensional Convolutional Neural Networks (CNN) divided into three frequency bands of low frequency (27.5HZ to 123.47HZ), intermediate frequency (130.81HZ to 739.99HZ) and high frequency (783.99HZ to 4186.00HZ) are different, and the filter lengths of the parallel one-dimensional Convolutional Neural Networks (CNN) divided into L1(short, L)2(middle) and L3(long) samples to cover different window sizes; the convolutional neural network is followed by a corrective linear unit activation function to produce a note tone embedded E ═ E1 E2 E3](ii) a To connect embeddings at different time scales by keeping the same pace at different scales
Figure FDA0002856615410000027
To align them; the time domain signal is coded into three time resolutions in embedding E; embedding coefficient E for each scaleiIs defined as:
Ei,k=ReLU(xi,kUi) (14);
K=2(T-L1)/L1+1 (15);
each time
Figure FDA0002856615410000031
L of sample shiftiA sample window; will [ E ]1 E2 E3]Is concatenated as a feature extracted from the piano audio signal.
6. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network in the complex environment according to claim 1, wherein a convolutional neural network model for identifying monophonic notes is constructed in the fifth step, and 8 convolutional layers, 4 maximum pooling layers and 3 full-link layers are adopted; normalization is added after each convolution layer; the dropout layer used; except for the output layer, all other layers adopt linear rectification activation functions; the last output layer comprises 88 output units and adopts a sigmoid activation function; the loss function adopts a cross entropy loss function, and is optimized by an Adam algorithm; setting the training batch size to be 16, the iteration time epoch to be 50, and storing a weight file once when 500 voices are trained; the cross-entropy loss function is as follows:
Figure FDA0002856615410000032
training the convolutional neural network model identified by the single tone identifier sequentially according to the steps until the loss function of the network model is converged, and finishing training the single tone identifier identification model; the weight file and various configuration files of the tone character recognition model are saved.
7. The algorithm for identifying monophonic notes in real time based on a multi-scale convolutional neural network under a complex environment as claimed in claim 1, wherein in the sixth step, a trained real-time monophonic note extraction and identification system based on a complex environment is used to identify music signals in a test set, statistics is performed on the accuracy of note identification, and a performance comparison analysis is performed on the statistics and the audio note identification of the music signals recorded under an actual complex environment, and the specific implementation manner is as follows:
firstly, using a traditional note identification system to perform note identification tests on 1000 audio test sets which are not subjected to target extraction in an established complex environment audio database, and counting the accuracy rate of note identification;
secondly, performing note recognition test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using a traditional note recognition system, and counting the accuracy rate of note recognition;
thirdly, using the monophonic note real-time identification and recognition system of the multi-scale convolutional neural network to perform note identification tests on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database, and counting the accuracy rate of note identification;
and fourthly, carrying out note identification test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using the single-tone real-time identification and identification system of the multi-scale convolutional neural network, and counting the accuracy rate of note identification.
CN202011549426.4A 2020-12-24 2020-12-24 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment Pending CN112633175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011549426.4A CN112633175A (en) 2020-12-24 2020-12-24 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011549426.4A CN112633175A (en) 2020-12-24 2020-12-24 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment

Publications (1)

Publication Number Publication Date
CN112633175A true CN112633175A (en) 2021-04-09

Family

ID=75324214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011549426.4A Pending CN112633175A (en) 2020-12-24 2020-12-24 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment

Country Status (1)

Country Link
CN (1) CN112633175A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314140A (en) * 2021-05-31 2021-08-27 哈尔滨理工大学 Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN113593598A (en) * 2021-08-09 2021-11-02 深圳远虑科技有限公司 Noise reduction method and device of audio amplifier in standby state and electronic equipment
CN114067820A (en) * 2022-01-18 2022-02-18 深圳市友杰智新科技有限公司 Training method of voice noise reduction model, voice noise reduction method and related equipment
CN114822593A (en) * 2022-06-29 2022-07-29 新缪斯(深圳)音乐科技产业发展有限公司 Performance data identification method and system
WO2022218134A1 (en) * 2021-04-16 2022-10-20 深圳市优必选科技股份有限公司 Multi-channel speech detection system and method
CN117153197A (en) * 2023-10-27 2023-12-01 云南师范大学 Speech emotion recognition method, apparatus, and computer-readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170330586A1 (en) * 2016-05-10 2017-11-16 Google Inc. Frequency based audio analysis using neural networks
US10068557B1 (en) * 2017-08-23 2018-09-04 Google Llc Generating music with deep neural networks
CN109584904A (en) * 2018-12-24 2019-04-05 厦门大学 The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism
CN110379401A (en) * 2019-08-12 2019-10-25 黑盒子科技(北京)有限公司 A kind of music is virtually chorused system and method
CN110580458A (en) * 2019-08-25 2019-12-17 天津大学 music score image recognition method combining multi-scale residual error type CNN and SRU
CN110782878A (en) * 2019-10-10 2020-02-11 天津大学 Attention mechanism-based multi-scale audio scene recognition method
CN111008595A (en) * 2019-12-05 2020-04-14 武汉大学 Private car interior rear row baby/pet groveling window distinguishing and car interior atmosphere identifying method
CN111986661A (en) * 2020-08-28 2020-11-24 西安电子科技大学 Deep neural network speech recognition method based on speech enhancement in complex environment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170330586A1 (en) * 2016-05-10 2017-11-16 Google Inc. Frequency based audio analysis using neural networks
US10068557B1 (en) * 2017-08-23 2018-09-04 Google Llc Generating music with deep neural networks
CN109584904A (en) * 2018-12-24 2019-04-05 厦门大学 The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism
CN110379401A (en) * 2019-08-12 2019-10-25 黑盒子科技(北京)有限公司 A kind of music is virtually chorused system and method
CN110580458A (en) * 2019-08-25 2019-12-17 天津大学 music score image recognition method combining multi-scale residual error type CNN and SRU
CN110782878A (en) * 2019-10-10 2020-02-11 天津大学 Attention mechanism-based multi-scale audio scene recognition method
CN111008595A (en) * 2019-12-05 2020-04-14 武汉大学 Private car interior rear row baby/pet groveling window distinguishing and car interior atmosphere identifying method
CN111986661A (en) * 2020-08-28 2020-11-24 西安电子科技大学 Deep neural network speech recognition method based on speech enhancement in complex environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴琼等: "基于多尺度残差式卷积神经网络与双向简单循环单元的光学乐谱识别方法", 《激光与光电子学进展》 *
张开玉 等: "MSER快速自然场景倾斜文本定位算法", 《哈尔滨理工大学学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022218134A1 (en) * 2021-04-16 2022-10-20 深圳市优必选科技股份有限公司 Multi-channel speech detection system and method
CN113314140A (en) * 2021-05-31 2021-08-27 哈尔滨理工大学 Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN113593598A (en) * 2021-08-09 2021-11-02 深圳远虑科技有限公司 Noise reduction method and device of audio amplifier in standby state and electronic equipment
CN113593598B (en) * 2021-08-09 2024-04-12 深圳远虑科技有限公司 Noise reduction method and device for audio amplifier in standby state and electronic equipment
CN114067820A (en) * 2022-01-18 2022-02-18 深圳市友杰智新科技有限公司 Training method of voice noise reduction model, voice noise reduction method and related equipment
CN114067820B (en) * 2022-01-18 2022-06-28 深圳市友杰智新科技有限公司 Training method of voice noise reduction model, voice noise reduction method and related equipment
CN114822593A (en) * 2022-06-29 2022-07-29 新缪斯(深圳)音乐科技产业发展有限公司 Performance data identification method and system
CN117153197A (en) * 2023-10-27 2023-12-01 云南师范大学 Speech emotion recognition method, apparatus, and computer-readable storage medium
CN117153197B (en) * 2023-10-27 2024-01-02 云南师范大学 Speech emotion recognition method, apparatus, and computer-readable storage medium

Similar Documents

Publication Publication Date Title
Guzhov et al. Esresnet: Environmental sound classification based on visual domain models
CN112633175A (en) Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
Kavalerov et al. Universal sound separation
CN113314140A (en) Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN101599271B (en) Recognition method of digital music emotion
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
Dua et al. An improved RNN-LSTM based novel approach for sheet music generation
CN111369982A (en) Training method of audio classification model, audio classification method, device and equipment
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN110767210A (en) Method and device for generating personalized voice
CN115762536A (en) Small sample optimization bird sound recognition method based on bridge transform
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
Li et al. Sams-net: A sliced attention-based neural network for music source separation
KR102272554B1 (en) Method and system of text to multiple speech
Sunny et al. Recognition of speech signals: an experimental comparison of linear predictive coding and discrete wavelet transforms
CN114495969A (en) Voice recognition method integrating voice enhancement
CN113257279A (en) GTCN-based real-time voice emotion recognition method and application device
CN116229932A (en) Voice cloning method and system based on cross-domain consistency loss
KR20190135853A (en) Method and system of text to multiple speech
CN113423005B (en) Intelligent music generation method and system based on improved neural network
CN111488486B (en) Electronic music classification method and system based on multi-sound-source separation
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
CN115359775A (en) End-to-end tone and emotion migration Chinese voice cloning method
CN111259188B (en) Lyric alignment method and system based on seq2seq network
CN115240702A (en) Voice separation method based on voiceprint characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210409