CN114627885A

CN114627885A - Small sample data set musical instrument identification method based on ASRT algorithm

Info

Publication number: CN114627885A
Application number: CN202210182234.7A
Authority: CN
Inventors: 王树龙; 刘钰; 薛慧敏; 赵银峰; 马兰; 孙承坤; 陈树鹏; 刘红侠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-06-14

Abstract

The invention relates to the field of natural language processing, in particular to a small sample data set musical instrument identification method based on an ASRT algorithm. The invention uses the network configuration VGG with the best effect in image identification for reference on the model structure, has strong expression capability, can see very long history and future information, and is more excellent in robustness compared with RNN; the method can be perfectly combined with a CTC scheme at the output end to realize end-to-end training of the whole model, and the sound waveform signal is directly transcribed into the waveform of the musical instrument, so that the waveform of the musical instrument is finally judged and the predicted musical instrument type is output.

Description

Small sample data set musical instrument identification method based on ASRT algorithm

Technical Field

The invention relates to the field of natural language processing, in particular to a small sample data set musical instrument identification method based on an ASRT algorithm.

Background

With the development of artificial intelligence, the emergence of Convolutional Neural Network (CNN) and connectivity time sequence classification (CTC) methods, the rapid development of deep neural networks, the application of traditional handwritten word clustering has failed to meet the needs of people, and the processing of slow artificial intelligence in natural language is more and more extensive. The recognition of musical instruments is always a less important aspect, and people need to distinguish the types of the musical instruments at some time, especially a complete piece of music, so that non-professionals and even professionals can hardly distinguish which musical instruments are used by the music. Therefore, a tool oil can be used for judging the type of the sound.

ASRT is a deep learning-based chinese speech recognition system that is implemented using tensrflow.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a small sample data set musical instrument identification method based on an ASRT algorithm, which can effectively identify the musical instrument type in audio data.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

A small sample data set musical instrument identification method based on an ASRT algorithm comprises the following steps:

step 1, obtaining a sample set;

step 2, preprocessing the sample set sample, and training the musical instrument recognition model by using the preprocessed sample set sample to obtain a trained musical instrument recognition model;

and 3, preprocessing the audio file needing to identify the type of the musical instrument, inputting the preprocessed audio file into the trained musical instrument identification model, and obtaining the type of the musical instrument contained in the audio file.

Compared with the prior art, the invention has the beneficial effects that: the VGG (network configuration gateway) with the best effect in image recognition is used for reference on the model structure, has strong expression capacity, can see very long history and future information, and is more excellent in robustness compared with RNN (radio network); the method can be perfectly combined with a CTC scheme at the output end to realize end-to-end training of the whole model, and the sound waveform signal is directly transcribed into the waveform of the musical instrument, so that the waveform of the musical instrument is finally judged and the predicted musical instrument type is output.

Drawings

The invention is described in further detail below with reference to the figures and the specific embodiments.

FIG. 1 is a schematic diagram of an instrument recognition model according to the present invention;

fig. 2 is a schematic structural diagram of the instrument recognition model in the present invention during training.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.

step 1, obtaining a sample set;

specifically, a plurality of audio files are obtained from the existing database, and all the audio files are converted into 1600hz and wav formats; wherein, the audio file is an audio file with only single musical instrument sound or an audio file with a plurality of musical instrument sounds;

using a converted audio file and the types of instruments contained in the audio file as a set of samples; all samples are taken as a sample set.

substep 2.1, preprocessing the sample set sample: firstly, analyzing an audio file in a sample to obtain data contained in the audio file; converting the data into a two-dimensional frequency spectrum image, namely a spectrogram, through operations such as framing and windowing;

substep 2.2, using the spectrogram as network input and inputting the spectrogram into the musical instrument recognition model; using the type of the instrument in the sample as a network tag; the musical instrument type contained in the audio file corresponding to the spectrogram output by the musical instrument identification model;

specifically, referring to fig. 1, the musical instrument identification model includes a convolutional layer, a pooling layer, a convolutional layer, a recombination matrix format layer, a fully-connected layer, and an Activation function Activation, which are connected in sequence;

wherein the convolution kernel size of the first convolution layer is 1 x 3 x 32, the convolution kernel size of the second convolution layer is 32 x 3 x 32, the convolution kernel size of the third convolution layer is 64 x 3 x 32, the convolution kernel size of the fourth convolution layer is 128 x 3 x 32, the convolution kernel size of the first convolution layer is 256 x 3 x 32, and the convolution kernel size of the first convolution layer is 512 x 3 x 32; after each convolution, activation is performed using the RELU activation function.

Padding is carried out once during each data input, and the size of the matrix is increased by 2 dimensions in the four directions of the upper direction, the lower direction, the left direction and the right direction, and 0 is supplemented; the pooling size pool _ size during pooling is 2, and the padding function used is the 'same' function.

The convolutional network is based on the Keras and tensrflow framework, using this deep convolutional neural network referenced to VGG as the network model.

And substep 2.3, performing iterative updating on parameters of the instrument recognition model by using the regression loss function to obtain the trained instrument recognition model.

Referring to fig. 2, during training, the instrument recognition model adds Dropout layers between the first convolutional layer and the second convolutional layer, after each pooling layer, between the Reshape layer and the first fully-connected layer, and in each fully-connected layer; the Dropout layer is used for randomly interrupting part of the neural network, and the instrument recognition model is prevented from being over-fitted during training.

Specifically, the audio file to be identified as the type of the musical instrument is preprocessed as follows: firstly, converting an audio file needing to identify the type of the musical instrument into a format of 1600hz and wav; analyzing the converted audio file to obtain data contained in the audio file; and finally, converting the data into a two-dimensional frequency spectrum image, namely a spectrogram, through operations such as framing and windowing.

In addition, the server software based on the HTTP is provided, an HTTP protocol basic server package of Python is used, a voice recognition API based on a network HTTP protocol is provided, and an API server can be easily set; so that the client software can send API request through network to realize the function of identifying musical instrument.

API interface based on HTTP protocol:

the project uses an http server packet built in Python to realize a basic API server based on an http protocol. By using the server program, a simple API server can be directly realized, and data interaction between a user and the server is carried out in a POST mode.

The following table is a list of POST parameters:

simulation experiment

And (3) identifying 100 pieces of audio data acquired from an external network IRMAS website by using the trained instrument identification model, and predicting the probability of the instrument type contained in each piece of audio data.

Results of the experiment

For a random one of the 100 audio data, the prediction result is: there are 0.0115% of the probability to contain flute, 21.4380% of the probability to contain gel (electric guitar), 6.0032% of the probability to contain piano, 0.0000767% of the probability to contain violin (violin), 0.0000923% of the probability to contain tremepet (trumpet), 0.7711% of the probability to contain gac (acoustic guitar), 0.0475% of the probability to contain saxophone (saxophone), 18.4390% of the probability to contain organ (organ), 0.0002497% of the probability to contain cello (cello), 53.2890% of the probability to contain voi (human voice), 0.0000048% of the probability to contain clarinet (clarinet), and 0.00000005% of the probability to contain other musical instruments.

For 100 audio data, the predicted correct rate of the only instrument being the most likely instrument is 74%;

for 100 audio data, the accuracy of predicting whether the first five instruments are the first five possible instruments is 92.5%;

for 100 audio data, the overall prediction accuracy reaches 99%, which shows that the musical instrument identification method can accurately identify musical instruments in the audio.

Although the present invention has been described in detail in this specification with reference to specific embodiments and illustrative embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made thereto based on the present invention. Accordingly, it is intended that all such modifications and alterations be included within the scope of this invention as defined in the appended claims.

Claims

1. A small sample data set musical instrument identification method based on an ASRT algorithm is characterized by comprising the following steps:

step 1, obtaining a sample set;

2. The ASRT algorithm-based small sample dataset musical instrument recognition method according to claim 1, wherein step 1 specifically obtains a plurality of audio files from an existing database, converts all audio files to 1600hz, wav format; wherein, the audio file is an audio file with only single musical instrument sound or an audio file with a plurality of musical instrument sounds; using a converted audio file and the types of instruments contained in the audio file as a set of samples; all samples are taken as a sample set.

3. The ASRT algorithm-based small sample dataset musical instrument identification method according to claim 1, characterised in that the substep of step 2 is as follows:

4. The ASRT algorithm-based small sample data set musical instrument identification method according to claim 1, wherein the musical instrument identification model in step 2 comprises a convolutional layer, a pooling layer, a convolutional layer, a recombination matrix format layer, a full-link layer, and an Activation function Activation which are connected in sequence;