CN116010902A

CN116010902A - Cross-modal fusion-based music emotion recognition method and system

Info

Publication number: CN116010902A
Application number: CN202310058201.6A
Authority: CN
Inventors: 李伟; 赵嘉豪; 茹港徽; 吴宇伦
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-04-25

Abstract

The invention belongs to the technical field of music emotion recognition, and particularly relates to a cross-modal fusion-based music emotion recognition method and system. The method comprises the steps of obtaining music information of music to be identified; extracting music emotion characteristics of the music information; preprocessing the emotion characteristics of music to obtain a characteristic fusion sequence with position codes; inputting the feature fusion sequence into Transformer Encoder to obtain fused music emotion features; and inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain predicted music emotion types. The invention can improve the accuracy of music emotion recognition.

Description

Cross-modal fusion-based music emotion recognition method and system

Technical Field

The invention belongs to the technical field of music emotion recognition, and particularly relates to a cross-modal fusion-based music emotion recognition method and system.

Background

Music Emotion Recognition (MER) refers to the task of recognizing emotion information contained in a given piece of music, the input of which is an original audio file and other related information such as lyrics, song names, etc., and the output is a category of emotion or a valence/activation value. Music emotion recognition is a process of condensing and extracting high-level semantic information in music and is also an important subtask for understanding music content, so that the music emotion recognition occupies extremely important and special positions in the field of music information retrieval. Since emotion is considered as a soul of music, music emotion recognition is an important ring for understanding music content. The main application field of music emotion recognition is music information retrieval based on music emotion, including tasks such as music classification, music recommendation and the like.

Music service providers such as Deezer, spotify, apple Music and domestic internet Music, tencer Music and the like commonly used in the industry provide Music recommendation services based on emotion in APP, and emotion classification also exists in classification retrieval systems of Music software, wherein Music recommendation songs such as "Music to heart opening", "anxiety and the like are recommended according to the emotion states of listeners. In these application scenarios, the use of artificial emotion labeling tends to result in system development costs that are too high and inefficient, as song data tends to be very large in scale. The research task of the invention, namely music emotion recognition, aims to automatically complete emotion recognition of a large amount of song data by using a computer algorithm model, so that a large-scale emotion-based music information retrieval system can be realized.

Early music emotion recognition methods were based primarily on traditional mid-level features of music, or learning implicit features (features) in audio using traditional machine learning models. Mion et al explored the middle layer features affecting the expression of musical emotion using feature selection and principal component analysis methods, and found that information including attack time, notes per second, highest sound level, etc. was highly correlated with musical emotion. Schmidt et al fused mid-level features such as spectral centroid, MFCC (Mel Frequency Cepstrum Coefficient, mel frequency cepstral coefficient) and designed a representative model of musical emotion recognition based on these features.

In terms of the traditional machine learning method, wang et al proposed a method of learning implicit features in music using a gaussian mixture model (GMM, gaussian Mixture Model) and named gaussian acoustic emotion model (AEG, acoustic Emotion Gaussian), which has a large impact on the field of music emotion recognition and was used in emotion recognition methods by Chen et al until 2017. In addition to AEG, support vector machines (SVM, support Vector Machine), linear regression (LR, linear Regression) and other conventional machine learning methods are also commonly used for emotion recognition in the audio domain.

The music emotion recognition research in recent years is mostly based on deep learning. In 2016, li et al have proposed a model based on DBLSTM (Deep Bidirectional Long Short Term Memory, deep bi-directional long and short-term memory network) and used multi-resolution feature fusion to achieve the SoTA effect on MediaEval datasets, while many other LSTM-based approaches have emerged. After this, CNN and RNN (Recurrent Neural Network ) have been the mainstream in music emotion recognition until now, and the bi-directional CRNN model proposed by Dong et al in 2019 has a great influence yet. At present, the emotion analysis performance of the audio domain encounters a large bottleneck and is slow to develop in recent years.

Since the factors reflecting and affecting the emotion of music do not all exist in one modality, a multi-modality method of simultaneously using text and audio has great feasibility in theory to solve the emotion problem. Early multi-modal methods were also based on traditional machine learning models. In 2008, laurier et al and Yang et al both proposed emotion recognition models that use SVMs to analyze both audio content and lyric content. In 2016 Huang et al proposed a multimodal emotion recognition model based on the deep boltzmann machine (DBM, deep Boltzmann Machine).

In recent years, a representative multi-modal emotion method was proposed by Delbouys et al in 2018, which uses LSTM to analyze lyric content and 1D CNN to analyze audio content, and a high-quality multi-modal music dataset is proposed. In 2021, konstantinos et al proposed a multi-modal approach using pre-trained BERT as the text model, 2D CNN as the audio model, and reached SoTA performance.

In view of the above background, it is not difficult to conclude that the existing music emotion recognition method has the following general problems:

1. the emotion recognition accuracy is not high enough, in an actual application scene, more labels are often required to be classified on music emotion, but the existing method is poor in performance in music data sets with more than four emotion types, and commercial accuracy is difficult to achieve;

2. although the method based on the traditional machine learning and middle layer characteristics has better generalization and lower requirement on data quantity, the identification accuracy is extremely low, and the method cannot cope with the input of more complex music structure and emotion information;

3. although the precision of the method based on the deep learning is relatively higher, the precision difference from the actual application is larger, the requirements on the quantity and the labeling quality of training data are higher, and the performance is poorer when the method is used for testing a cross-data set;

4. existing multi-modal methods, on the one hand, have difficulty utilizing all available relevant data such as album name, song name, album art, etc.; on the other hand, the feature fusion efficiency of each mode is lower, and the existing method generally uses methods such as direct splicing or cross-mode attention to perform feature fusion among modes.

Disclosure of Invention

The invention aims to provide a cross-modal fusion-based music emotion recognition method and system so as to improve the accuracy of music emotion recognition.

The invention provides a cross-modal fusion-based music emotion recognition method, which comprises the following specific steps:

(1) Acquiring music information of music to be identified; the music information includes audio files, lyrics, song names, singer names, album names and album covers;

(2) Extracting music emotion characteristics of the music information; the music emotion features comprise audio features, text features and image features; the audio features are features of the audio file; the text features include features of the lyrics, the song name, the singer name, and the album name; the image features are features of the album art;

(3) Preprocessing the music emotion characteristics to obtain a characteristic fusion sequence with position codes;

(4) Inputting the feature fusion sequence into a pre-trained Transformer Encoder [1] to obtain fused music emotion features;

(5) Inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain predicted music emotion categories; the musical emotion categories include happy, hard, angry and relaxed.

Further, the extracting the music emotion feature of the music information in the step (2) specifically includes:

(1) Calculating a mel spectrum of the audio file;

(2) The size of a spectrogram of the Mel frequency spectrum is adjusted to a preset size;

(3) Inputting the adjusted spectrogram into a pre-trained 2D-CNN model for audio feature extraction to obtain audio features;

(4) Splicing the lyrics, the song names, the singer names and the album names to obtain a spliced text;

(5) Calculating word embedded representations of the spliced text;

(6) Embedding the word into a BERT 2 model representing input pre-training to obtain text characteristics;

(7) Inputting the album art into a pre-trained Resnet50[3] model to obtain image features.

Optionally, the 2D-CNN includes a convolutional layer and a pooling layer; the convolution layers comprise a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer; the pooling layers comprise a first pooling layer, a second pooling layer, a third pooling layer, a fourth pooling layer and a fifth pooling layer;

the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, the third convolution layer, the third pooling layer, the fourth convolution layer, the fourth pooling layer, the fifth convolution layer and the fifth pooling layer are sequentially connected;

the convolution kernel sizes of the convolution layers are 3*3; the size of the pooling window of the pooling layer is 2 x 2.

Further, the preprocessing of the music emotion features in the step (3) is performed to obtain a feature fusion sequence with position codes, which specifically includes:

(1) Converting the music emotion characteristics into one-dimensional vectors;

(2) Dividing the music emotion characteristics converted into one-dimensional vectors to obtain feature sequences with equal lengths of fragments;

(3) And performing position coding on the fragments in the feature sequences with the same length as the fragments to obtain a feature fusion sequence with position coding.

Optionally, the format of the audio file is wav format.

Optionally, the format of the album cover is jpg format.

Further, transformer Encoder in step (4) is a 6-layer Transformer Encoder. The transducer mechanism consists of a multi-head self-attention mechanism and a feedforward network, wherein the input of Transformer Encoder is a feature fusion sequence (an embedded representation sequence of multi-mode fusion), the feature fusion sequence is input into Transformer Encoder of 6 layers of pre-training, and the output is the multi-mode fused feature, namely the fused music emotion feature.

Further, in the step (5), the fused music emotion features are input into a pre-trained fully-connected classifier to obtain predicted music emotion types, which specifically includes:

inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain the prediction probability of each music emotion type; and obtaining predicted music emotion types according to the prediction probabilities of the music emotion types.

Based on the above cross-modal fusion music emotion recognition method, the invention also provides a corresponding cross-modal fusion music emotion recognition system, which comprises:

the acquisition module is used for acquiring music information of the music to be identified; the music information includes audio files, lyrics, song names, singer names, album names and album covers;

the extraction module is used for extracting music emotion characteristics of the music information; the music emotion features comprise audio features, text features and image features; the audio features are features of the audio file; the text features include features of the lyrics, the song name, the singer name, and the album name; the image features are features of the album art;

the preprocessing module is used for preprocessing the music emotion characteristics to obtain a characteristic fusion sequence with position codes;

the fusion module is used for inputting the feature fusion sequence into a pre-trained Transformer Encoder to obtain fused music emotion features;

the prediction module is used for inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain predicted music emotion types; the musical emotion categories include happy, hard, angry and relaxed.

These 5 modules perform the 5 steps of the music emotion recognition method based on cross-modality fusion.

The invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic equipment to execute the cross-modal fusion-based music emotion recognition method.

The invention also provides a computer readable storage medium which stores a computer program, wherein the computer program realizes the cross-mode fusion-based music emotion recognition method when being executed by a processor.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the cross-modal fusion-based music emotion recognition method provided by the invention is applied to other music contents except audio information, including lyrics, song names, singer names, album names and album covers, so that more accurate music emotion characteristics can be obtained, and the cross-modal transducer is applied to fuse various modal characteristics.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a cross-modal fusion-based music emotion recognition method provided by the invention.

Fig. 2 is a block diagram of a cross-modal fusion-based music emotion recognition model.

Fig. 3 is a block diagram of a cross-modal fusion-based music emotion recognition system provided by the invention.

Reference numerals in the drawings: 1 is an acquisition module, 2 is an extraction module, 3 is a preprocessing module, 4 is a fusion module, and 5 is a prediction module.

Detailed Description

The invention will be further described with reference to examples and figures. The described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, the present invention provides a method for identifying music emotion based on cross-modal fusion, which comprises:

step S1: acquiring music information of music to be identified; the music information includes audio files, lyrics, song names, singer names, album names and album covers; specifically, the format of the audio file is wav format. The format of the album art is jpg format.

Step S2: extracting music emotion characteristics of the music information; the music emotion features comprise audio features, text features and image features; the audio features are features of the audio file; the text features include features of the lyrics, the song name, the singer name, and the album name; the image features are features of the album art.

S2 specifically comprises:

step S21: and calculating the Mel frequency spectrum of the audio file.

Step S22: and adjusting the size of the spectrogram of the Mel frequency spectrum to a preset size.

Step S23: and inputting the adjusted spectrogram into a pre-trained 2D-CNN model to extract audio features, so as to obtain the audio features. Specifically, the 2D-CNN includes a convolutional layer and a pooling layer; the convolution layers comprise a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer; the pooling layers comprise a first pooling layer, a second pooling layer, a third pooling layer, a fourth pooling layer and a fifth pooling layer; the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, the third convolution layer, the third pooling layer, the fourth convolution layer, the fourth pooling layer, the fifth convolution layer and the fifth pooling layer are sequentially connected; the convolution kernel sizes of the convolution layers are 3*3; the size of the pooling window of the pooling layer is 2 x 2.

In practical application, for an input audio file, mel's spectrum is calculated by using melspctrogram () function in python's library, and its size is adjusted to the same size by using resize () function, specifically, mel's spectrum is a spectrum in which the horizontal axis is time, the vertical axis is frequency, energy distribution is represented by color depth, and the generated spectrogram is a graph in which the size is x y, which means that the size is adjusted to be uniform. After the mel spectrum is calculated, a five-layer 2D-CNN is used for audio feature extraction. These five convolutional layers have 32, 64, 128, 256, 128 filters, respectively, and each uses a convolutional kernel size of 3*3, stride set to 1. To downsample the features, a 2-dimensional max pooling layer is used after each convolution layer, the pooling window size being set to 2 x 2. Through the calculation, the audio characteristics of the music are obtained.

Step S24: and splicing the lyrics, the song names, the singer names and the album names to obtain a spliced text.

Step S25: word embedded representations of the spliced text are calculated.

Step S26: and embedding the words into a BERT model representing input pre-training to obtain text characteristics.

In practical application, for the input lyrics, song names, singer names and album names, these text information are spliced together, word embedding representations corresponding to the spliced text information are calculated, specifically, the BERT names, singer names and album names can be calculated quickly by using BERT names () functions in a converters library, the input text information is converted into word embedding representation forms which can be processed by a BERT model, and after word embedding representations (word names) corresponding to the text information are calculated, text features corresponding to the music lyrics, song names, singer names and album names are extracted by using a pre-trained BERT model. The input of the pre-trained BERT model is a word embedded representation of the text information and the output is the extracted text feature.

Step S27: inputting the album art into a pre-trained Resnet50 model to obtain image features.

In practical application, for the input album art, a pre-trained Resnet50 model in a torchvision library is used for extracting features, and image features corresponding to the song album art are obtained. Wherein the input of the pretrained Resnet50 model is a picture and the output is an extracted image feature.

Step S3: and preprocessing the music emotion characteristics to obtain a characteristic fusion sequence with position codes.

S3 specifically comprises:

step S31: and converting the music emotion characteristics into one-dimensional vectors.

Step S32: and cutting the music emotion characteristics converted into one-dimensional vectors to obtain feature sequences with equal length of the fragments.

Step S33: and performing position coding on the fragments in the feature sequences with the same length as the fragments to obtain a feature fusion sequence with position coding.

In practical application, the obtained audio feature, text feature and image feature are flattened into one-dimensional vectors, and are segmented into feature sequences with equal lengths of each segment, position codes are added to the feature sequences to obtain a multi-mode fused embedded representation sequence, specifically, each segment is regarded as an embedded representation, and the position codes corresponding to the segments are directly added or multiplied to the embedded representation sequence to obtain the multi-mode fused embedded representation sequence. The embedded representation sequence of the multi-mode fusion is a feature fusion sequence with position coding.

Step S4: and inputting the feature fusion sequence into a pre-trained Transformer Encoder to obtain the fused music emotion features. Specifically, pre-trained Transformer Encoder is a 6-layer Transformer Encoder. The transducer mechanism consists of a multi-head self-attention mechanism and a feedforward network, wherein the input of Transformer Encoder is a feature fusion sequence (an embedded representation sequence of multi-mode fusion), the feature fusion sequence is input into Transformer Encoder of 6 layers of pre-training, and the output is the multi-mode fused feature, namely the fused music emotion feature.

Step S5: inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain predicted music emotion categories; the musical emotion categories include happy, hard, angry and relaxed.

S5 specifically comprises the following steps:

step S51: and inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain the prediction probability of each music emotion type. Specifically, a full-connection classifier is used for the fused emotion characteristics of the music, so that the predicted probability (namely, the probability that the input music corresponds to each emotion type) can be output. The music emotion categories depend on the actual application scenario and training data set, including but not limited to happy, hard, angry and relaxed.

Step S52: and obtaining predicted music emotion types according to the prediction probabilities of the music emotion types.

In practical application, the cross-modal fusion-based music emotion recognition method provided by the invention forms a music emotion recognition model, and the model takes music information as input and takes a music emotion recognition result as output; the model includes an initial pre-trained Transformer Encoder, an initial pre-trained 2D-CNN, a pre-trained BERT model, and an initial pre-trained Resnet50; the model is loaded with an initial pre-trained Transformer Encoder, an initial pre-trained 2D-CNN, a pre-trained BERT model and an initial pre-trained Resnet50;

when the music emotion recognition model is integrally trained, music information can be obtained in a music database, and an open-source data set can be used for training in the training process, for example, a part of subsets of Million Song dataset are provided with emotion label marks. In the training process of the music emotion recognition model, as Transformer Encoder, 2D-CNN, BERT model and Resnet50 are all pretrained, only the parameters of Transformer Encoder, 2D-CNN and Resnet50 of initial pretraining are required to be fine-tuned when the whole music emotion recognition model is trained, so that the training time of the music emotion recognition model is saved, transformer Encoder, 2D-CNN and Resnet50 of pretraining are obtained, and the parameters of the BERT model of pretraining are not required to be updated in the whole training process of the music emotion recognition model, and a pretrained full-connection classifier is obtained when the whole music emotion recognition model is trained.

Training the music emotion recognition model, calculating a cross entropy loss function between the output probability of the recognition model and a label, and performing reverse gradient transfer by using the loss function to finish training the model, wherein two training completion standards exist, namely, firstly, training is considered to be finished when the loss function basically converges in the training process; secondly, testing the performance of the current model on a verification set at intervals of a certain number of rounds in training, and taking the version with the best performance as a trained version. Furthermore, the initial pre-trained Transformer Encoder, initial pre-trained 2D-CNN, pre-trained BERT model and initial pre-trained Resnet50 do not require additional fine-tune (transfer learning) operations, and training the ensemble model loaded with initial pre-trained Transformer Encoder, initial pre-trained 2D-CNN, pre-trained BERT model and initial pre-trained Resnet50 directly using the open-source data set, parameters of initial pre-trained Transformer Encoder, initial pre-trained 2D-CNN and initial pre-trained Resnet50 are adaptively adjusted accordingly during training of the musical emotion recognition model.

The music emotion recognition method based on cross-mode fusion provided by the invention can default part of input in actual use.

Compared with the prior art, the cross-modal fusion-based music emotion recognition method provided by the invention has the following advantages:

1. the cross-modal fusion-based music emotion recognition method provided by the invention can more effectively utilize other music contents except audio information, such as lyrics, song names, singer names, album pictures and the like, has higher recognition accuracy than the existing method, and is more suitable for being placed on the actual application scene;

2. the feature extraction model used by the cross-modal fusion-based music emotion recognition method is mainly an open-source pre-training model, so that only a small amount of annotation data is needed to be finely adjusted in actual use, and compared with the existing method, the method has lower requirements on training data and lower cost in actual application;

3. the music emotion recognition method based on cross-mode fusion can still operate when some inputs are short of time, has high robustness when facing special cases, and can predict emotion through partial inputs. (e.g., inputting song names and lyrics only can also make predictions with relatively high accuracy);

4. the cross-modal fusion-based music emotion recognition method provided by the invention uses the cross-modal transducer with higher performance to fuse the characteristics of each mode, and has better performance and robustness when facing to input data with more complex emotion information compared with the existing multi-mode method using cross-modal attention or direct characteristic stitching.

Example 2

In order to execute the method corresponding to the embodiment 1 to achieve the corresponding functions and technical effects, a system for identifying music emotion based on cross-modal fusion is provided, as shown in fig. 3, where the system includes:

the acquisition module 1 is used for acquiring music information of music to be identified; the music information includes audio files, lyrics, song names, singer names, album names, and album covers.

The extraction module 2 is used for extracting the music emotion characteristics of the music information; the music emotion features comprise audio features, text features and image features; the audio features are features of the audio file; the text features include features of the lyrics, the song name, the singer name, and the album name; the image features are features of the album art.

And the preprocessing module 3 is used for preprocessing the music emotion characteristics to obtain a characteristic fusion sequence with position codes.

And the fusion module 4 is used for inputting the feature fusion sequence into a pre-trained Transformer Encoder to obtain the fused music emotion features.

The prediction module 5 is used for inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain predicted music emotion types; the musical emotion categories include happy, hard, angry and relaxed.

Example 3

The invention provides an electronic device, comprising a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the cross-modal fusion-based music emotion recognition method of the embodiment 1.

Alternatively, the electronic device may be a server.

In addition, the embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program realizes the cross-modal fusion-based music emotion recognition method of the first embodiment when being executed by a processor.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Example 4

The method is trained and tested on MSDD (Million Song Dataset annotated by Deezer) [4] public data set, the total data amount is 2000 tracks, 60% of the data is used as training set, 20% is used as verification set, and 20% is used as test set.

The experimental parameters were set as follows: the number of the 2D-CNN layers of the audio feature extraction part is set to be 5, the largest pooling layer with the window size of 2 x 2 is arranged behind each convolution layer, the 5 convolution layers respectively have 32, 64, 128, 256 and 128 convolution kernels, the convolution kernel size is 3*3, and the step length is 1. The text feature extraction section uses a pre-trained BERT model, loaded with a BERT-base model pre-trained on an english corpus, having 12 layers transformer encoder. The image feature extraction section loads a resnet50 model that is pre-trained on the ImageNet dataset. In the processing part of the multimodal fusion feature, the original 6-layer transformer encoder structure set forth in [1] is used.

The method provided by the invention achieves the R2 index of 0.352 on the MSDD data set, and compared with the existing SoTA method [5], the method has great performance improvement of 0.043. Experimental results prove that the method provided by the invention has better recognition performance than the existing method.

Reference to the literature

[1]Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[J].Advancesin neural information processing systems,2017,30.

[2]Devlin J,Chang M W,Lee K,et al.Bert:Pre-training of deep bidirectionaltransformers for language understanding[J].arXiv preprint arXiv:1810.04805,2018.

[3]He K,Zhang X,Ren S,et al.Deep residual learning for imagerecognition[C]//Proceedings of the IEEE conference on computer vision and patternrecognition.2016:770-778.

[4]Delbouys R,Hennequin R,Piccoli F,et al.Music Mood Detection Based OnAudio And Lyrics With Deep Neural Net[J].2018.

[5]Zhao J,Ru G,Yu Y,et al.Multimodal music emotion recognition withhierarchical cross-modal attention network[C]//2022 IEEE International Conference onMultimedia and Expo(ICME).IEEE,2022:1-6。

Claims

1. A music emotion recognition method based on cross-modal fusion is characterized by comprising the following specific steps:

(4) Inputting the feature fusion sequence into a pre-trained Transformer Encoder to obtain fused music emotion features;

2. The music emotion recognition method of claim 1, wherein the format of the audio file is wav format; the format of album art is jpg format.

3. The method for identifying musical emotion according to claim 1, wherein said extracting musical emotion features of said musical information in step (2) specifically comprises:

(1) Calculating a mel spectrum of the audio file;

(5) Calculating word embedded representations of the spliced text;

(6) Embedding the words into a BERT model representing input pre-training to obtain text characteristics;

(7) Inputting the album art into a pre-trained Resnet50 model to obtain image features.

4. A method of musical emotion recognition according to claim 3, characterized in that said 2D-CNN (2-Dimensional Convolutional Neural Network, two-dimensional convolutional neural network) comprises a convolutional layer and a pooling layer; the convolution layers comprise a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer; the pooling layers comprise a first pooling layer, a second pooling layer, a third pooling layer, a fourth pooling layer and a fifth pooling layer;

5. The method for identifying musical emotion according to claim 1, wherein the preprocessing of the musical emotion features in step (3) results in a feature fusion sequence with position coding, and specifically comprises:

(1) Converting the music emotion characteristics into one-dimensional vectors;

6. The method of claim 1, wherein Transformer Encoder in step (4) is a 6-layer Transformer Encoder; the transducer mechanism consists of a multi-head self-attention mechanism and a feedforward network; the feature fusion sequence is input into Transformer Encoder of the pre-trained 6 layers, and the output is the multi-mode fused features, namely the fused music emotion features.

7. The method for identifying musical emotion according to claim 1, wherein in step (5), the fused musical emotion features are input into a pre-trained fully-connected classifier to obtain predicted musical emotion categories, and specifically comprising:

8. A cross-modal fusion based music emotion recognition system based on the music emotion recognition method of one of claims 1 to 7, comprising:

the prediction module is used for inputting the fused music emotion characteristics into a pre-trained full-connection classifier to obtain predicted music emotion types; the music emotion categories include happy, hard, angry and relaxed;

9. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the cross-modal fusion-based music emotion recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements a cross-modal fusion based music emotion recognition method as claimed in any one of claims 1 to 7.