CN112633175A - Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment - Google Patents
Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment Download PDFInfo
- Publication number
- CN112633175A CN112633175A CN202011549426.4A CN202011549426A CN112633175A CN 112633175 A CN112633175 A CN 112633175A CN 202011549426 A CN202011549426 A CN 202011549426A CN 112633175 A CN112633175 A CN 112633175A
- Authority
- CN
- China
- Prior art keywords
- audio
- note
- neural network
- convolutional neural
- identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 19
- 238000000605 extraction Methods 0.000 claims abstract description 42
- 238000000034 method Methods 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 52
- 230000006870 function Effects 0.000 claims description 29
- 230000005236 sound signal Effects 0.000 claims description 26
- 238000012549 training Methods 0.000 claims description 20
- 238000012360 testing method Methods 0.000 claims description 16
- 230000004913 activation Effects 0.000 claims description 10
- 238000013095 identification testing Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 5
- 238000012795 verification Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000000926 separation method Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000002156 mixing Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000000873 masking effect Effects 0.000 claims description 2
- 238000011176 pooling Methods 0.000 claims description 2
- 230000000306 recurrent effect Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 238000011160 research Methods 0.000 claims 1
- 230000007547 defect Effects 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 230000007613 environmental effect Effects 0.000 description 4
- 230000009467 reduction Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007306 functionalization reaction Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/08—Feature extraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Signal Processing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Biodiversity & Conservation Biology (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
The invention discloses a single note real-time recognition algorithm with an extraction function, which mainly comprises the following steps: firstly, selecting a pure audio data set of a musical instrument to be identified; adding pure audio into various noises to simulate the real situation of recording; carrying out feature extraction on an input waveform through an encoder; modeling the extracted feature vectors through a plurality of groups of time convolution networks; finally, pure audio waveforms are extracted through a decoder; and constructing a multi-scale convolution neural network for note recognition. The invention provides a time domain-based audio frequency operation method, which overcomes the defect that phase information is ignored in the traditional frequency domain operation. The target audio to be recognized is extracted from the complex environment before note recognition. And the characteristic extraction is carried out by utilizing the multi-scale convolution neural network, so that the note identification performance is improved. The problem that the existing note recognition algorithm is sensitive to noise factors is well solved, and the note recognition method is small in model size and greatly improves note recognition speed and accuracy.
Description
Technical Field
The invention belongs to the audio signal processing technology, and relates to a real-time convolutional neural network multi-scale monophonic note recognition algorithm with a noise reduction function.
Background
Along with the development of economy, the attention of people to the spiritual life is gradually improved, and nowadays, music becomes an important part of daily entertainment and life of people. The number of learners of musical instruments is increasing, and they include a part of amateurs and a part of beginners. The traditional musical instrument learning needs to require a professional teacher to guide, so that the learning cost is high, and the requirement that the teacher guides at any time during playing cannot be met due to the limitation of time and space conditions. Therefore, when the learner learns the musical instrument, the learner cannot examine the performance level of the learner in real time according to the playing situation of each time. An algorithm for recognizing musical notes can help a player solve the problem that the playing accuracy cannot be judged, and meanwhile, musical note recognition can help a musician to reduce the working intensity of music processing.
At present, pure note identification technology under relatively quiet environment is mature day by day, but the existing algorithm is only applied in a small range under pure ideal audio frequency condition, the requirement on the quality of music signals is strict, and the robustness of the algorithm is poor. Therefore, in an actual environment, the musical notes to be recognized are polluted by environmental noise, or the musical note signals of various musical instruments are doped in the environmental noise, so that great interference is brought to feature extraction, the feature extraction of the musical note signals is an important step in audio processing, the feature extraction is substantial, namely, parameters capable of expressing the essence of the musical note signals are extracted from rich information contained in the audio signals, if great errors occur in the feature extraction, the recognition accuracy of the musical note signals is greatly reduced, and the task of recognizing the musical notes of the target audio is failed. Therefore, it is an indispensable part of the whole system to reduce noise of the music signal and extract the target audio signal to be recognized.
The traditional unsupervised noise reduction methods such as spectral subtraction, kalman filtering and wiener filtering need to be based on certain assumptions and some prior information when the audio is subjected to noise reduction in a complex environment. If the noise has certain stationarity, the clean audio and the noise are required to be uncorrelated, Fourier transform coefficients of the note audio and the noise are mutually independent on a time-frequency domain, and the like. Although this method is simple to implement, it is very demanding to estimate the noise. And the parameters in the traditional extraction system need actual debugging, and the robustness of the system is poor. In order to improve the system robustness, the invention adopts a neural network method to extract the audio, so that the target audio can be effectively extracted in real time under the complex conditions of various noises or various musical instruments.
The existing audio processing algorithm almost completely depends on spectrogram representation of an audio signal, and converts a waveform file into a frequency domain for feature extraction. However, applying short-time fourier transform (STFT) to transform the waveform file of audio into the frequency domain has several limitations: first, since the model does not estimate the source phase, it is often assumed that it is equal to the mixed phase, and a wrong estimation of the phase will have an effect on the exact reconstruction of the phase of the clean sound source, thereby pulling down the upper bound on the reconstructed audio. Secondly, to successfully separate the sound source from the time-frequency representation, a high resolution frequency decomposition of the mixed signal is required, which requires a long time window for calculating the STFT. This requirement increases the minimum delay of the system, which limits its applicability in real-time, low-delay applications.
In summary, the technical problem to be solved by the present invention is how to extract the audio signal of the musical instrument to be identified accurately and quickly when the audio signal contains the complex situation of various types of noise. The model has strong generalization capability and good robustness, and the system can have relatively stable accuracy under the conditions of different noise types and different noise levels. Meanwhile, the algorithm provided by the invention enables the system not to only identify the notes of a specific instrument, but also to have universality on the note identification of the instrument. And improves the accuracy of note identification.
Disclosure of Invention
The invention provides a single note real-time identification algorithm based on a multi-scale convolution neural network under a complex environment.
The technical solution to achieve the object of the present invention is based on multi-scale neural networks and end-to-end audio extraction and note recognition as technical background. In order to achieve the purpose, the invention adopts the following technical solutions:
And 3, reconstructing a source waveform by the decoder module by using the characteristics extracted by the encoder through obtaining the masking coefficient.
And 4, constructing a single tone identifier identification module, designing a multi-scale audio encoder, and performing feature extraction on the pure audio signal by using three convolution modules with different sizes.
And 5, training a convolutional neural network to perform note recognition.
And 6, inputting the pure audio features extracted in the step 4 into a trained neural network to complete tone character recognition.
Compared with the prior art, the invention has the following remarkable advantages: the method takes multi-scale neural network and end-to-end audio extraction and note identification as technical background. Firstly, a convolution module is used for directly extracting the characteristics of the waveform file, and a method for converting the waveform file into a frequency domain to operate a spectrogram by using the conventional short-time Fourier transform (STFT) is replaced. Audio phase information is taken into account while increasing the processing speed. The target audio signal is extracted at the audio signal identification front end, so that the system can have relatively stable accuracy under the conditions of different noise types and different noise levels, such as Gaussian white noise, background noise or interference sound source environments. The traditional algorithm performs poorly in the context of practical applications. The introduction of the audio extraction method also makes the music monophonic character recognition system of the present invention robust against the interference of the complex environmental factors. Meanwhile, the algorithm provided by the invention enables the system not to only identify the notes of a specific instrument, but also to have universality on the note identification of the instrument. And a multi-scale convolutional coding and decoding module is adopted in the note identification module, so that the identification accuracy is greatly improved. The model has strong generalization capability and precision, directly processes the audio in the time domain, and has real-time performance.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the present invention will be briefly introduced to better understand the inventive content of the present invention;
FIG. 1 is a general flow chart of a single note real-time identification algorithm based on a multi-scale convolutional neural network under a complex environment;
FIG. 2 is a detailed flowchart of the algorithm;
FIG. 3 is a block diagram of a pure target audio extraction process under a complex environment according to the present invention;
FIG. 4 is a block diagram of an overall network architecture for a time-series convolutional neural network (TCN) for audio extraction according to the present invention;
FIG. 5 is a block diagram of a design framework for each one-dimensional convolution block of the audio extraction time-series convolutional neural network (TCN) of the present invention;
FIG. 6 is a block diagram of the present invention of a model of a monophonic identifier convolutional neural network.
Detailed Description
A single note real-time identification method based on a multi-scale convolutional neural network in a complex environment builds a model by taking a deep learning convolutional neural network and extracting target audio in the complex environment as technical backgrounds, a general flow chart is shown in an attached drawing for explanation, and a detailed flow chart of a specific audio extraction and identification technical scheme is shown in an attached drawing for explanation, namely, fig. 2.
Firstly, a pure audio data set of a required identified instrument is selected, and the model has strong generalization capability and is not only used for identifying notes of a specified instrument, so that the data set can be replaced by a target data set of the required identification. The selected piano audio (MAPS) data set contains piano audio files, associated MIDI files and tagged txt files. The data set is divided into 9 directories according to the piano type and recording conditions, each directory containing an independent tone, chord and complete piano tune. Wherein each catalogue has 30 complete piano songs, 270 complete piano songs, and the total duration is about 18 hours. And audio extraction is carried out on the audio signal under the complex audio condition to be identified at the stage of the audio signal preprocessing of the note identification front end. Inputting a piano audio signal with noise, and performing feature extraction through a one-dimensional convolution neural network encoder; modeling the extracted feature vectors through a plurality of groups of time sequence convolution neural networks (TCNs); finally, inputting the signals into a one-dimensional convolutional neural network decoder to obtain a pure piano audio waveform; meanwhile, a multi-scale convolutional neural network note recognition model is built, model training is carried out by using a pure piano data set (MAPS), and a model weight file is obtained, so that accurate recognition of piano notes in a complex environment is realized.
The problems that the existing note recognition algorithm is sensitive to noise, high in audio quality requirement and single in application scene are solved. The method for identifying the monophonic character in real time based on the multi-scale convolutional neural network under the complex environment comprises the following steps:
step one, establishing and processing an audio data set in a complex environment. In this section the pure piano audio (MAPS) data set selected for use with the present invention. The pure piano audio and noise audio sampling rates are first adjusted to the same 44100 HZ. And then carrying out batch aliasing on the pure piano audio and the noise to form a noise-containing mixed audio data set Y, wherein the noise audio data comprises white Gaussian noise, human speaking noise, sudden whistling noise and other musical instrument drum sounds. The formula for mixing the audio is:
and dividing the audio data under each environment in the audio data set Y into a training set, a verification set and a test set. The distribution proportion is that the number of the training set audio pieces, the number of the verification set audio pieces and the number of the test set audio pieces are 3: 2.
Step two, constructing a one-dimensional convolution encoder to extract the characteristics of the mixed signal, dividing the input mixed signal into overlapped sections with the length of L, and using yk∈R1×LIt is shown that,an index representing the segment, whereinRepresenting the total number of segments entered. By one-dimensional convolution operationkConversion to N-dimensional representation w ∈ R1×NUsing matrix multiplication representation (deleting index k from now):
wherein U is E.RN×LN vectors (basis functions of the encoder) are included, each of length L.Is the ReLU function to ensure non-negative.
Designing a pure audio convolution extraction module, wherein a pure target audio extraction flow frame diagram in a complex environment is shown in an attached drawing description figure 3;
the present invention uses a full convolution separation module consisting of stacked one-dimensional expansion convolution blocks, as shown in fig. 4. Time-series convolutional neural networks (TCNs) are used in place of Recurrent Neural Networks (RNNs) in various sequence modeling tasks. Each layer in the TCN consists of one-dimensional convolution blocks with gradually increasing expansion factors. The expansion factor increases exponentially to ensure that a sufficiently large time window can be included. Wherein M expansion factors are 1, 2, 4M-1The convolution block of (2) is repeated R times. The inputs of each block are zero padded to ensure that the output length is the same as the input. The output of TCN will be fed to a convolutional block of kernel size 1 to estimate the mask. The 1 x 1 convolutional block, together with the nonlinear activation function, estimates C mask vectors for the C target sources.
Fig. 5 shows the design of each one-dimensional volume block. This applies both the residual path and a jump path transfer: the residual path of one block serves as the input to the next block, and the sum of the skip paths of all blocks serves as the output of a time-series convolutional neural network (TCN).
To further reduce the number of parameters, a depth separable convolution (S _ conv) is used instead of the standard convolution in each volume block. The deep separable convolution operator decouples the standard convolution operation into two successive operations, deep convolution (D _ conv (·)) followed by a dot product, which is a convolution with a convolution kernel size of 1 × 1:
where Z is the input of S _ conv (·), and K is a convolution kernel Z of size PjAnd kjThe rows of matrices Z and K, respectively, L is a convolution kernel of size 1,representing a convolution operation.
The first 1 × 1conv (-) and D _ conv (-) blocks are followed by the addition of a nonlinear activation function and normalization, respectively. The nonlinear activation function is a parametric rectification linear unit (PReLU):
the normalization method in the network uses global level normalization (gLN). The features are normalized in the channel and time dimensions at gLN:
where F is a feature, γ and β are trainable parameters, and e is a small constant for numerical stability.
A linear one-dimensional volume block is added at the beginning of the extraction module as a bottleneck layer (bottleeck layer). This block determines the number of channels of the input path and the residual path of the subsequent convolution block. The linear bottleneck layer has B channels, and for a one-dimensional volume block with the number of channels H and the kernel size P, the kernel sizes in the first 1 × 1 volume block and the first deep convolution D _ conv block should be respectively equal toAndand the kernel size in the residual path should beThe number of output channels in the next module input residual connection (skip-connection path) may be different from B, representing the kernel size in this path as LSc
Step four, estimating the extracted mask to realize the separation of each frame by estimating C vectors (masks), mi∈R1×NWhere C is the amount of noise in the mixed signal and mi∈[0,1]. M is to beiApplication to the mixed representation w yields the corresponding source representation:
whereinRepresenting the corresponding point multiplication. Estimated target audio waveform signalAnd (3) reconstructing by a decoder:
step five, the decoder reconstructs the waveform from the representation form by using one-dimensional transposition convolution operation, and the waveform can be represented by matrix multiplication as follows:
whereinIs reconstructed x, V ∈ RN×LAre the basis functions of the decoder, each of length L. The overlapping reconstructed segments are added together to generate the final waveform. The specific audio data extraction flow framework diagram is shown in the figure and is explained in figure 2.
Experimental configuration: the network trained 150 epochs over a 5 second long segment. Initial learning rate set to 1e-3. If the accuracy of the validation set does not improve over 3 consecutive epochs, then the learning rate will be halved. The optimizer uses Adam. The convolutional auto-encoder uses a stride size of 50% (i.e., there is 50% overlap between consecutive frames). Applying maximum L during training2And 5 norm clipping.
Training targets the goal of training an end-to-end system is to maximize the scale-invariant signal-to-noise ratio (SI-SNR). The SI-SNR is defined as:
the cross-entropy loss function is as follows:
and after the extraction processing of one target audio frequency in the note signal training set is finished, the training set, the verification set and the test set are sequentially processed according to the steps.
And step six, building a single tone identifier identification model. The algorithm adopts a multi-scale convolutional neural network encoder to extract time domain characteristics, and adopts a note level method based on a Convolutional Neural Network (CNN) to transcribe a single tone music signal. The CNN model is suitable for detecting the space structure characteristics, and in addition, compared with DNN, the CNN adopts shared parameters to extract the characteristics, so that the size of the model can be reduced, overfitting can be effectively prevented, and the generalization performance of the model is improved.
The audio recognition model will detect the pitch information of the newly played note in the input frame, and the output layer of the neural network contains 88 output units, corresponding to the 88 key notes in the piano selected for use in the present invention.
The audio characteristic extraction part adopts a multi-scale audio encoder, mainly solves the phase estimation problem of a frequency domain method, and selects a time domain method, namely directly converting a time domain mixed signal into characteristic representation by using a convolution network. In the frequency domain method, the audio signal is decomposed into an alternating representation, characterized by a sine and a cosine, by applying a fourier transform. Thus in the time domain approach, the filters in the convolutional layer can be similarly treated as basis functions, which is equivalent to treating the sine and cosine representations in the frequency domain as embedded coefficients. But time domain coding differs from fourier transformation in that: a) the feature representation cannot process the real and imaginary parts separately; b) the basis functions are not predefined as sine or cosine, but may be trained from the data.
The input pure target audio signal x (t) may be encoded into embedding coefficients using a convolutional neural network, which encodes the pure target audio signal into multi-scale audio embedding using a plurality of parallel one-dimensional convolutional neural networks, each convolutional neural network CNN module having a different time resolution.
The number of scales may vary and the system is generic, varying the number of time scales according to the type of audio signal that needs to be identified.
The present invention is only studied for three different time scales.
Is due to the frequency division of 88 keys of the pianoThe filter distributed at 27.5 HZ-4186.0 HZ and divided into three frequency bands of low frequency (27.5 HZ-123.47 HZ), intermediate frequency (130.81 HZ-739.99 HZ) and high frequency (783.99 HZ-4186.00 HZ) has different lengths, and is divided into L1(short, L)2(middle) and L3(long) samples to cover different window sizes.
The Convolutional Neural Network (CNN) is followed by a corrective linear unit (ReLU) activation function that produces a note tone embedding E ═ E1 E2 E3]。
To connect embeddings at different time scales by keeping the same pace at different scalesTo align them.
As the filter length changes, the encoder can learn the representation over multiple scales, with short windows having good resolution in the high frequency band and long windows having higher resolution in the low frequency band. The time domain signal is encoded into three time resolutions in embedding E. Embedding coefficient E for each scaleiIs defined as:
Ei,k=ReLU(xi,kUi)
K=2(T-L1)/L1+1
each timeL of sample shiftiA sample window. Will [ E ]1 E2 E3]Is concatenated as a piano audio signal extracted feature
And (3) constructing a convolutional neural network model for identifying the monophonic notes, wherein the structure of the CNN model is shown in figure 6, and 8 convolutional layers, 4 maximum pooling layers and 3 full-connection layers are adopted.
Normalization processing is added after each convolution layer, so that the model convergence speed can be accelerated, the regularization effect is provided, and overfitting is prevented. This allows the data to be processed without causing network performance instability due to the data being too large before Relu. The dropout layer is used for preventing overfitting and improving the generalization capability of the model.
Except for the output layer, each layer adopts a linear rectification activation function. The final output layer contains 88 output units, and the activation function of sigmoid is adopted.
Experimental configuration: the loss function adopts a loss function of cross entropy and is optimized by an Adam algorithm. Setting the trained blocksize to be 16, the iteration number epoch to be 50, and storing a weight file once when each 500 voices are trained; the cross-entropy loss function is as follows:
and training the convolutional neural network model identified by the tone characters in sequence according to the steps until the loss of the network model is converged, and finishing training the tone character identification model. The weight file and various configuration files of the tone character recognition model are saved.
And seventhly, identifying the music signal audio frequency of the test set by using the trained tone character real-time extraction and identification system based on the complex environment, counting the accuracy rate of the note identification, and comparing and analyzing the performance of the audio note identification recorded under the actual complex environment.
The specific implementation mode is as follows:
(1) and (3) carrying out note identification test on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database by using the traditional note identification system, and counting the accuracy rate of note identification.
(2) And (3) carrying out note identification test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using the traditional note identification system, and counting the accuracy rate of note identification.
(3) The real-time monophonic note identification and recognition system of the multi-scale convolutional neural network is used for carrying out note identification tests on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database, and the accuracy rate of note identification is counted.
(4) The real-time monophonic note identification and recognition system of the multi-scale convolutional neural network is used for carrying out note identification tests on 1000 audio test sets of the complex environment audio database provided by the invention after target extraction, and counting the accuracy rate of note identification.
After statistics is completed, the single tone identifier recognition algorithm based on audio extraction provided by the invention has the advantages that the accuracy of single tone identifier recognition in various noise environments, background noise and interference sound source environments is greatly improved; compared with the traditional note recognition algorithm, the recognition accuracy of the algorithm is greatly improved, the traditional algorithm is poor in performance, and the algorithm is excellent in performance and good in performance.
Therefore, the deep neural network note recognition method based on voice enhancement in the complex environment well solves the problems that the existing note recognition algorithm is sensitive to a noise environment, high in audio quality requirement and single in applicable scene, and realizes real-time note recognition in the complex audio environment.
Advantages of the invention
The method builds a model by taking a deep learning convolution neural network and audio extraction audio time domain analysis as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on audio signals under various complex conditions to be recognized in the stage of audio signal preprocessing at the front end of note recognition; a convolution module is used for directly extracting the characteristics of the waveform file, and a method for converting the waveform file into a frequency domain to operate a spectrogram by using the conventional short-time Fourier transform (STFT) is replaced. Audio phase information is taken into account while increasing the processing speed. The target audio signal is extracted at the audio signal identification front end, so that the system can have relatively stable accuracy under the conditions of different noise types and different noise levels, such as Gaussian white noise, background noise or interference sound source environments. Conventional algorithms perform poorly in the context of practical applications. The introduction of the audio extraction method also makes the music monophonic character recognition system of the present invention robust against the interference of the complex environmental factors. And after extracting the target audio signal from the complex sound environment, training the multi-scale monophonic note recognition model by means of the audio data set label to obtain a model weight file, thereby realizing the accurate recognition of the monophonic note in the complex environment. The problems that the existing note recognition algorithm is sensitive to noise factors, high in audio quality requirement and single in application scene are solved.
Claims (7)
1. The single note real-time identification algorithm based on the multi-scale convolution neural network under the complex environment is characterized by comprising the following steps of:
firstly, selecting a pure audio data set of a musical instrument to be identified, mixing the pure audio and a noise data set, establishing a required data set, and designing an encoder module to convert short segments of a mixed signal into corresponding representations in a feature space;
estimating a mask of each source by using an audio extraction module;
thirdly, the decoder module reconstructs a source waveform by utilizing the characteristics extracted by the encoder through obtaining the masking coefficient;
step four, constructing a single tone identifier identification module, designing a multi-scale audio encoder, and performing feature extraction on the pure audio signal by using three convolution modules with different sizes;
training a convolutional neural network to perform note recognition;
and step six, inputting the extracted pure audio features into a trained neural network to complete the identification of the single tone character.
2. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network under the complex environment as claimed in claim 1, wherein the step one is to establish and process an audio data set under the complex environment; a pure piano audio data set selected by the present invention in the section; firstly, adjusting the sampling rates of pure piano audio and noise audio to the same 44100 HZ; and then carrying out batch aliasing on the pure piano audio and the noise to form a noise-containing mixed audio data set Y, wherein the noise audio data comprises white Gaussian noise, human speaking noise, sudden whistling noise and other musical instrument drum noise, and the formula of the mixed audio is as follows:
dividing the audio data under each environment in the audio data set Y into a training set, a verification set and a test set; the distribution proportion is that the number of the training set audio pieces, the number of the verification set audio pieces and the number of the test set audio pieces are 3: 2;
constructing a one-dimensional convolution encoder to extract features of the mixed signal, dividing the input mixed signal into overlapping segments of length L, and using yk∈R1×LIt is shown that,an index representing the segment, whereinRepresenting the total number of segments of the input, y being calculated by a one-dimensional convolutionkConversion to N-dimensional representation w ∈ R1×NUsing matrix multiplication to represent:
3. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network in the complex environment as claimed in claim 1, wherein in the second step, a clean audio extraction module is designed; the invention uses a full-volume integral separation module composed of stacked one-dimensional expansion convolution blocks; using time-series convolutional neural networks (TCNs) instead of recurrent neural networks in various sequence modeling tasksA luo (RNN); each layer in the TCN is composed of one-dimensional convolution blocks with gradually increased expansion factors, and the expansion factors grow exponentially; wherein M expansion factors are respectively l, 2, 4M-1The convolution block of (a) is repeated R times; zero padding is performed on the input of each block; the output of TCN will be fed to a convolutional block of kernel size 1 to estimate the mask; the 1 × 1 convolutional block, together with the nonlinear activation function, estimates C mask vectors for the C target sources; residual path and a jump path transfer are applied: taking a residual path of one block as an input of a next block, and taking the sum of jump paths of all blocks as an output of a time sequence convolution neural network (TCN); estimation of extraction mask separation per frame is achieved by estimating C vectors (masks), mi∈R1×NWhere C is the amount of noise in the mixed signal and mi∈[0,1](ii) a M is to beiApplication to the mixed representation w yields the corresponding source representation:
di=w⊙mi (9);
wherein [ ] indicates the multiplication of corresponding points, the estimated target audio waveform signalAnd (3) reconstructing by a decoder:
4. the algorithm for real-time single note identification based on multi-scale convolutional neural network under complex environment as claimed in claim 1, wherein the decoder reconstructs the waveform from the representation form by using one-dimensional transpose convolution operation in step three, which can be expressed by matrix multiplication as:
whereinIs reconstructed x, V ∈ RN×LThe rows of (a) are the basis functions of the decoder, each of length L; adding the overlapping reconstructed segments together to generate a final waveform; training targets the goal of training an end-to-end system is to maximize the scale-invariant signal-to-noise ratio (SI-SNR), which is defined as:
the loss function is as follows:
5. the algorithm for identifying the monophonic note in real time based on the multi-scale convolutional neural network under the complex environment according to claim 1, wherein a monophonic note identification model is built in the fourth step; extracting time domain characteristics by adopting a multi-scale convolutional neural network encoder, wherein a convolutional neural network single-tone note recognition model can detect tone information of a newly played note in an input frame; the input pure piano audio signal x (t) can be encoded into an embedding coefficient by a convolutional neural network, and a pure target audio signal is encoded into multi-scale audio embedding by a plurality of parallel one-dimensional convolutional neural networks, wherein each convolutional neural network CNN module has different time resolution; the invention only aims at selecting a piano data set to carry out research on three different time scales, because the piano 88 key frequency is distributed between 27.5HZ and 4186.0HZ, the filter lengths of the parallel one-dimensional Convolutional Neural Networks (CNN) divided into three frequency bands of low frequency (27.5HZ to 123.47HZ), intermediate frequency (130.81HZ to 739.99HZ) and high frequency (783.99HZ to 4186.00HZ) are different, and the filter lengths of the parallel one-dimensional Convolutional Neural Networks (CNN) divided into L1(short, L)2(middle) and L3(long) samples to cover different window sizes; the convolutional neural network is followed by a corrective linear unit activation function to produce a note tone embedded E ═ E1 E2 E3](ii) a To connect embeddings at different time scales by keeping the same pace at different scalesTo align them; the time domain signal is coded into three time resolutions in embedding E; embedding coefficient E for each scaleiIs defined as:
Ei,k=ReLU(xi,kUi) (14);
K=2(T-L1)/L1+1 (15);
6. The algorithm for identifying monophonic notes in real time based on the multi-scale convolutional neural network in the complex environment according to claim 1, wherein a convolutional neural network model for identifying monophonic notes is constructed in the fifth step, and 8 convolutional layers, 4 maximum pooling layers and 3 full-link layers are adopted; normalization is added after each convolution layer; the dropout layer used; except for the output layer, all other layers adopt linear rectification activation functions; the last output layer comprises 88 output units and adopts a sigmoid activation function; the loss function adopts a cross entropy loss function, and is optimized by an Adam algorithm; setting the training batch size to be 16, the iteration time epoch to be 50, and storing a weight file once when 500 voices are trained; the cross-entropy loss function is as follows:
training the convolutional neural network model identified by the single tone identifier sequentially according to the steps until the loss function of the network model is converged, and finishing training the single tone identifier identification model; the weight file and various configuration files of the tone character recognition model are saved.
7. The algorithm for identifying monophonic notes in real time based on a multi-scale convolutional neural network under a complex environment as claimed in claim 1, wherein in the sixth step, a trained real-time monophonic note extraction and identification system based on a complex environment is used to identify music signals in a test set, statistics is performed on the accuracy of note identification, and a performance comparison analysis is performed on the statistics and the audio note identification of the music signals recorded under an actual complex environment, and the specific implementation manner is as follows:
firstly, using a traditional note identification system to perform note identification tests on 1000 audio test sets which are not subjected to target extraction in an established complex environment audio database, and counting the accuracy rate of note identification;
secondly, performing note recognition test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using a traditional note recognition system, and counting the accuracy rate of note recognition;
thirdly, using the monophonic note real-time identification and recognition system of the multi-scale convolutional neural network to perform note identification tests on 1000 audio test sets which are not subjected to target extraction in the established complex environment audio database, and counting the accuracy rate of note identification;
and fourthly, carrying out note identification test on 1000 audio test sets of the complex environment audio database subjected to target extraction by using the single-tone real-time identification and identification system of the multi-scale convolutional neural network, and counting the accuracy rate of note identification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011549426.4A CN112633175A (en) | 2020-12-24 | 2020-12-24 | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011549426.4A CN112633175A (en) | 2020-12-24 | 2020-12-24 | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112633175A true CN112633175A (en) | 2021-04-09 |
Family
ID=75324214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011549426.4A Pending CN112633175A (en) | 2020-12-24 | 2020-12-24 | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112633175A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113314140A (en) * | 2021-05-31 | 2021-08-27 | 哈尔滨理工大学 | Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network |
CN113593598A (en) * | 2021-08-09 | 2021-11-02 | 深圳远虑科技有限公司 | Noise reduction method and device of audio amplifier in standby state and electronic equipment |
CN114067820A (en) * | 2022-01-18 | 2022-02-18 | 深圳市友杰智新科技有限公司 | Training method of voice noise reduction model, voice noise reduction method and related equipment |
CN114822593A (en) * | 2022-06-29 | 2022-07-29 | 新缪斯(深圳)音乐科技产业发展有限公司 | Performance data identification method and system |
WO2022218134A1 (en) * | 2021-04-16 | 2022-10-20 | 深圳市优必选科技股份有限公司 | Multi-channel speech detection system and method |
CN117153197A (en) * | 2023-10-27 | 2023-12-01 | 云南师范大学 | Speech emotion recognition method, apparatus, and computer-readable storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170330586A1 (en) * | 2016-05-10 | 2017-11-16 | Google Inc. | Frequency based audio analysis using neural networks |
US10068557B1 (en) * | 2017-08-23 | 2018-09-04 | Google Llc | Generating music with deep neural networks |
CN109584904A (en) * | 2018-12-24 | 2019-04-05 | 厦门大学 | The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110211574A (en) * | 2019-06-03 | 2019-09-06 | 哈尔滨工业大学 | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism |
CN110379401A (en) * | 2019-08-12 | 2019-10-25 | 黑盒子科技(北京)有限公司 | A kind of music is virtually chorused system and method |
CN110580458A (en) * | 2019-08-25 | 2019-12-17 | 天津大学 | music score image recognition method combining multi-scale residual error type CNN and SRU |
CN110782878A (en) * | 2019-10-10 | 2020-02-11 | 天津大学 | Attention mechanism-based multi-scale audio scene recognition method |
CN111008595A (en) * | 2019-12-05 | 2020-04-14 | 武汉大学 | Private car interior rear row baby/pet groveling window distinguishing and car interior atmosphere identifying method |
CN111986661A (en) * | 2020-08-28 | 2020-11-24 | 西安电子科技大学 | Deep neural network speech recognition method based on speech enhancement in complex environment |
-
2020
- 2020-12-24 CN CN202011549426.4A patent/CN112633175A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170330586A1 (en) * | 2016-05-10 | 2017-11-16 | Google Inc. | Frequency based audio analysis using neural networks |
US10068557B1 (en) * | 2017-08-23 | 2018-09-04 | Google Llc | Generating music with deep neural networks |
CN109584904A (en) * | 2018-12-24 | 2019-04-05 | 厦门大学 | The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110211574A (en) * | 2019-06-03 | 2019-09-06 | 哈尔滨工业大学 | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism |
CN110379401A (en) * | 2019-08-12 | 2019-10-25 | 黑盒子科技(北京)有限公司 | A kind of music is virtually chorused system and method |
CN110580458A (en) * | 2019-08-25 | 2019-12-17 | 天津大学 | music score image recognition method combining multi-scale residual error type CNN and SRU |
CN110782878A (en) * | 2019-10-10 | 2020-02-11 | 天津大学 | Attention mechanism-based multi-scale audio scene recognition method |
CN111008595A (en) * | 2019-12-05 | 2020-04-14 | 武汉大学 | Private car interior rear row baby/pet groveling window distinguishing and car interior atmosphere identifying method |
CN111986661A (en) * | 2020-08-28 | 2020-11-24 | 西安电子科技大学 | Deep neural network speech recognition method based on speech enhancement in complex environment |
Non-Patent Citations (2)
Title |
---|
吴琼等: "基于多尺度残差式卷积神经网络与双向简单循环单元的光学乐谱识别方法", 《激光与光电子学进展》 * |
张开玉 等: "MSER快速自然场景倾斜文本定位算法", 《哈尔滨理工大学学报》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022218134A1 (en) * | 2021-04-16 | 2022-10-20 | 深圳市优必选科技股份有限公司 | Multi-channel speech detection system and method |
CN113314140A (en) * | 2021-05-31 | 2021-08-27 | 哈尔滨理工大学 | Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network |
CN113593598A (en) * | 2021-08-09 | 2021-11-02 | 深圳远虑科技有限公司 | Noise reduction method and device of audio amplifier in standby state and electronic equipment |
CN113593598B (en) * | 2021-08-09 | 2024-04-12 | 深圳远虑科技有限公司 | Noise reduction method and device for audio amplifier in standby state and electronic equipment |
CN114067820A (en) * | 2022-01-18 | 2022-02-18 | 深圳市友杰智新科技有限公司 | Training method of voice noise reduction model, voice noise reduction method and related equipment |
CN114067820B (en) * | 2022-01-18 | 2022-06-28 | 深圳市友杰智新科技有限公司 | Training method of voice noise reduction model, voice noise reduction method and related equipment |
CN114822593A (en) * | 2022-06-29 | 2022-07-29 | 新缪斯(深圳)音乐科技产业发展有限公司 | Performance data identification method and system |
CN117153197A (en) * | 2023-10-27 | 2023-12-01 | 云南师范大学 | Speech emotion recognition method, apparatus, and computer-readable storage medium |
CN117153197B (en) * | 2023-10-27 | 2024-01-02 | 云南师范大学 | Speech emotion recognition method, apparatus, and computer-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Guzhov et al. | Esresnet: Environmental sound classification based on visual domain models | |
CN112633175A (en) | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment | |
Kavalerov et al. | Universal sound separation | |
CN113314140A (en) | Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network | |
CN101599271B (en) | Recognition method of digital music emotion | |
CN110600047A (en) | Perceptual STARGAN-based many-to-many speaker conversion method | |
Dua et al. | An improved RNN-LSTM based novel approach for sheet music generation | |
CN111369982A (en) | Training method of audio classification model, audio classification method, device and equipment | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN110767210A (en) | Method and device for generating personalized voice | |
CN115762536A (en) | Small sample optimization bird sound recognition method based on bridge transform | |
CN114023300A (en) | Chinese speech synthesis method based on diffusion probability model | |
Li et al. | Sams-net: A sliced attention-based neural network for music source separation | |
KR102272554B1 (en) | Method and system of text to multiple speech | |
Sunny et al. | Recognition of speech signals: an experimental comparison of linear predictive coding and discrete wavelet transforms | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN113257279A (en) | GTCN-based real-time voice emotion recognition method and application device | |
CN116229932A (en) | Voice cloning method and system based on cross-domain consistency loss | |
KR20190135853A (en) | Method and system of text to multiple speech | |
CN113423005B (en) | Intelligent music generation method and system based on improved neural network | |
CN111488486B (en) | Electronic music classification method and system based on multi-sound-source separation | |
Sunny et al. | Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam | |
CN115359775A (en) | End-to-end tone and emotion migration Chinese voice cloning method | |
CN111259188B (en) | Lyric alignment method and system based on seq2seq network | |
CN115240702A (en) | Voice separation method based on voiceprint characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210409 |