CN113129871A

CN113129871A - Music emotion recognition method and system based on audio signal and lyrics

Info

Publication number: CN113129871A
Application number: CN202110328406.2A
Authority: CN
Inventors: 李风环; 李轶; 田春晖; 徐宏杰; 张健炜; 符善森; 黎其钻
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-07-16

Abstract

The invention provides a music emotion recognition method and system based on audio signals and lyrics, and solves the problems of single consideration factor and low emotion recognition accuracy of the conventional music emotion recognition method.

Description

Music emotion recognition method and system based on audio signal and lyrics

Technical Field

The invention relates to the technical field of music emotion recognition, in particular to a music emotion recognition method and system based on audio signals and lyrics.

Background

With the development of music and technology, music emotion recognition systems are actively used for various purposes, including personal music collection, music recommendation systems, and music treatment for emotional disorders. Analyzing the emotional content of music is a cross-disciplinary study that includes not only signal processing and machine learning, but also auditory perception concepts, psychology, cognitive sciences, and musics.

The existing music emotion recognition method firstly extracts acoustic features of music acoustic contents such as rhythm and tone and then applies different machine learning algorithms to understand the relationship between the extracted features of music and preset emotion labels.

Chinese patent No. CN111326178A, 6/23/2020, discloses a multi-modal speech emotion recognition system and method based on a convolutional neural network, wherein speech signals are processed through the convolutional neural network, so that emotion information in speech is recognized through extracting speech feature analysis, the accuracy of analysis and recognition is improved to a certain extent, but the effect of lyrics in music emotion recognition is ignored, so that the accuracy of music emotion recognition needs to be further improved, and the current music emotion recognition method research combining audio signals and music lyrics is lacked.

Disclosure of Invention

In order to solve the problems of single consideration and low emotion recognition accuracy of the existing music emotion recognition method, the invention provides a music emotion recognition method and system based on audio signals and lyrics, wherein the emotion is recognized by combining the audio signals and the lyrics, so that the accuracy of music emotion recognition is improved.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a music emotion recognition method based on audio signals and lyrics at least comprises the following steps:

s1, obtaining an audio data sample and a lyric data sample of music to be recognized, and respectively preprocessing the audio data sample and the lyric data sample to obtain an audio signal and a lyric vector;

s2, respectively constructing and training a first convolutional neural network for extracting audio signal characteristics and a second convolutional neural network for extracting lyric vector characteristics;

s3, inputting the preprocessed audio data sample into a first convolutional neural network, inputting the preprocessed lyric data sample into a second convolutional neural network, and serially connecting and fusing the output end of the first convolutional neural network and the output end of the second convolutional neural network by using a fusion module;

and S4, the fusion module outputs the fusion result of the audio signal characteristic and the lyric vector characteristic to the full connection layer module for analysis and processing, and the full connection layer module outputs a music emotion recognition result.

Preferably, before the audio data samples and the lyric data samples are respectively preprocessed, each audio data sample and each lyric data sample obtain three additional sample segments by utilizing pitch shift and lossy coding, and finally, the number of the audio data samples and the lyric data samples is expanded by using a data enhancement technology, so that the richness of the samples is improved.

Preferably, in step S1, when the audio data sample is preprocessed, the audio data sample is converted into a two-dimensional mel-frequency spectrogram audio signal, and the conversion process is as follows:

determining the number M of the Mel filters and the sample length N of the Hann window audio data;

setting a sampling frequency f;

the audio data samples are input to a mel filter and converted into a two-dimensional mel-frequency spectrogram audio signal.

Preferably, in step S1, when preprocessing the lyric data samples, a segment of a word is extracted from each track, and the lyric data samples are made into a lyric vector using a K-dimensional vector.

Preferably, the M mel-filters and the Hann window of the N audio data samples are non-overlapping.

Preferably, the first convolutional neural network of step S2 includes a first convolutional layer and a second convolutional layer connected in sequence, where the first convolutional layer and the second convolutional layer are both one-dimensional and include 32 feature maps with a size of 8 and a step size of 1 and 16 pooling layers with a size of 4 and a step size of 4.

Preferably, the second convolutional neural network of step S2 includes a third convolutional layer and an LSTM layer connected in sequence, where the third convolutional layer is a one-dimensional convolutional layer structure.

Preferably, the input of the first convolutional neural network is an audio signal, the output of the first convolutional neural network is an audio signal characteristic, and the trained network parameters of the first convolutional neural network are obtained through training by a gradient descent method; and the input of the second convolutional neural network is a lyric vector, the output of the second convolutional neural network is a lyric vector characteristic, and the trained network parameters of the second convolutional neural network are obtained through training by a gradient descent method.

Preferably, the full-link layer module includes a first full-link layer and a second full-link layer, an input end of the first full-link layer is connected to the common output end, an output end of the first full-link layer is connected to an input end of the second full-link layer, the first full-link layer and the second full-link layer jointly predict a fusion result of the audio signal feature and the lyric vector feature, and an output end of the second full-link layer outputs a music emotion recognition result.

The invention also provides a multi-mode music emotion recognition system based on the audio signals and the lyrics, which is used for realizing the music emotion recognition method based on the audio signals and the lyrics, and comprises the following steps:

the music data acquisition module comprises an audio data acquisition module and a lyric data acquisition module; the audio data acquisition module is used for acquiring an audio data sample of music to be recognized, and the lyric data acquisition module is used for acquiring a lyric data sample of the music to be recognized;

the preprocessing module is used for respectively preprocessing the audio data samples and the lyric data samples to obtain audio signals and lyric vectors;

the characteristic extraction module is used for extracting audio signal characteristics and lyric vector characteristics;

the fusion module is used for fusing the audio signal features and the lyric vector features extracted by the feature extraction module;

and the full connection layer module is used for receiving the result of fusion of the audio signal characteristics and the lyric vector characteristics output by the fusion module, analyzing and predicting and outputting a music emotion recognition result.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a music emotion recognition method and a music emotion recognition system based on audio signals and lyrics.

Drawings

Fig. 1 is a schematic flow chart of a music emotion recognition method based on audio signals and lyrics according to an embodiment of the present invention;

FIG. 2 is a block diagram of an overall neural network framework for multi-modal music emotion recognition based on audio signals and lyrics in an embodiment of the present invention;

fig. 3 is a system diagram of multimodal music emotion recognition based on audio signals and lyrics according to an embodiment of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for better illustration of the present embodiment, certain parts of the drawings may be omitted, enlarged or reduced, and do not represent actual dimensions;

it will be understood by those skilled in the art that certain well-known descriptions of the figures may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Examples

The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

fig. 1 is a schematic flow chart of a music emotion recognition method based on audio signals and lyrics, and referring to fig. 1, the method includes:

In this embodiment, when the audio data samples and the lyric data samples are obtained in step S1, 30-second long segments are extracted from the audio, 7 segments are extracted from each track, the segments are uniformly extracted from the song, before the audio data samples and the lyric data samples are respectively preprocessed, pitch offset and lossy coding are used to obtain three additional sample segments for each audio data sample and lyric data sample, in specific implementation, the size of the whole sample training set is increased by about 21 times, and finally, the number of the audio data samples and lyric data samples is expanded by using a data enhancement technique.

When the audio data sample is preprocessed, the audio data sample is converted into a two-dimensional Mel frequency spectrogram audio signal, and the conversion process is as follows:

setting a sampling frequency f;

When the lyric data samples are preprocessed, extracting fragments of words from each track, and making the lyric data samples into lyric vectors by using K-dimensional vectors; specifically, in the present embodiment, the input lyrics are represented by using a 100-dimensional vector, and seven segments of 50 words are extracted from each track by data expansion to produce a lyrics vector.

The M Mel filters and the Hann windows of the N audio data samples are not overlapped, in the embodiment, the number M of the Mel filters is 40, the length N of the samples of the Hann window audio data is 1024, the samples are not overlapped, and the sampling frequency f is 44.1 kHz;

in this embodiment, the first convolutional neural network described in step S2 includes a first convolutional layer and a second convolutional layer that are connected in sequence, where the first convolutional layer and the second convolutional layer are both one-dimensional, and each include 32 feature maps with a size of 8 and a stride of 1, and 16 pooling layers with a size of 4 and a stride of 4, the second convolutional neural network includes a third convolutional layer and an LSTM layer that are connected in sequence, and the third convolutional layer is a one-dimensional convolutional layer structure, and during specific training, an audio data sample and a lyric data sample of music to be recognized, which are originally obtained in step S1, are respectively divided into a test set and a training set, and then are respectively preprocessed, an input of the first convolutional neural network is an audio signal, an output is an audio signal feature, and network parameters of the trained first convolutional neural network are obtained through a gradient descent method training; the input of the second convolutional neural network is a lyric vector, the output of the second convolutional neural network is a lyric vector characteristic, network parameters of the trained second convolutional neural network are obtained through training by a gradient descent method, the input size of each neural network is 100 x 50, the full connection layer module comprises a first full connection layer and a second full connection layer, the input end of the first full connection layer is connected with the public output end, the output end of the first full connection layer is connected with the input end of the second full connection layer, the first full connection layer and the second full connection layer jointly predict the fusion result of the audio signal characteristic and the lyric vector characteristic, the output end of the second full connection layer outputs a music emotion recognition result, and the overall neural network framework of the multi-mode music emotion recognition based on the audio signal and the lyrics can refer to the figure 2.

Referring to fig. 3, the present invention further provides a multimodal music emotion recognition system based on audio signals and lyrics, where the system is used to implement the music emotion recognition method based on audio signals and lyrics, and the method includes:

the fusion module is used for fusing the audio signal features and the lyric vector features extracted by the feature extraction module; during specific implementation, a fusion module compatible with the first convolutional neural network and the second convolutional neural network is selected according to specific components of the first convolutional neural network and the second convolutional neural network and respective final output ends of the first convolutional neural network and the second convolutional neural network.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A music emotion recognition method based on audio signals and lyrics is characterized by at least comprising the following steps:

2. The method of claim 1, wherein the audio data samples and the lyric data samples are pre-processed separately by obtaining three additional sample segments for each of the audio data samples and the lyric data samples using pitch shifting and lossy coding, and finally expanding the number of the audio data samples and the lyric data samples using data enhancement techniques.

3. The method for music emotion recognition based on audio signal and lyrics of claim 2, wherein in step S1, the audio data sample is converted into two-dimensional mel-frequency spectrum audio signal during the pre-processing of the audio data sample, the conversion process is as follows:

setting a sampling frequency f;

4. The method for music emotion recognition based on audio signal and lyrics of claim 3, wherein, in step S1, when the lyric data samples are preprocessed, segments of words are extracted from each track, and the lyric data samples are made into lyric vector using K-dimensional vector.

5. The method of claim 4, wherein the M Mel filters and the Hann windows of the N audio data samples are non-overlapping.

6. The method of claim 5, wherein the first convolutional neural network comprises a first convolutional layer and a second convolutional layer connected in sequence in step S2, the first convolutional layer and the second convolutional layer are one-dimensional and comprise 32 feature maps with size of 8 and step of 1 and 16 pooling layers with size of 4 and step of 4.

7. The method for music emotion recognition based on audio signal and lyrics of claim 6, wherein the second convolutional neural network of step S2 includes a third convolutional layer and an LSTM layer connected in sequence, and the third convolutional layer is a one-dimensional convolutional layer structure.

8. The method for music emotion recognition based on audio signals and lyrics of claim 7, wherein the input of the first convolutional neural network is audio signals, the output of the first convolutional neural network is audio signal characteristics, and the network parameters of the trained first convolutional neural network are obtained through training by a gradient descent method; and the input of the second convolutional neural network is a lyric vector, the output of the second convolutional neural network is a lyric vector characteristic, and the trained network parameters of the second convolutional neural network are obtained through training by a gradient descent method.

9. The method of claim 8, wherein the fully-connected layer module comprises a first fully-connected layer and a second fully-connected layer, an input terminal of the first fully-connected layer is connected to the common output terminal, an output terminal of the first fully-connected layer is connected to an input terminal of the second fully-connected layer, the first fully-connected layer and the second fully-connected layer jointly predict a result of fusion of the audio signal feature and the lyric vector feature, and an output terminal of the second fully-connected layer outputs the result of music emotion recognition.

10. A multimodal music emotion recognition system based on audio signals and lyrics, wherein the system is used for implementing the music emotion recognition method based on audio signals and lyrics as claimed in any one of claims 1 to 9, and the method comprises: