CN113421589B

CN113421589B - Singer identification method, singer identification device, singer identification equipment and storage medium

Info

Publication number: CN113421589B
Application number: CN202110740063.0A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2024-03-01
Anticipated expiration: 2041-06-30
Also published as: CN113421589A

Abstract

The invention relates to the field of artificial intelligence, and discloses a singer identification method, device, equipment and storage medium, which are used for improving accuracy and efficiency of singer identification. The singer identification method comprises the following steps: receiving an audio file to be identified of the mixed record, and acquiring a target audio signal of the audio file to be identified; according to a preset Fourier transform algorithm, converting a target audio signal into a Mel frequency domain to obtain an audio Mel frequency spectrogram corresponding to the audio file to be identified; extracting a melody Mel spectrogram of the background music part from the audio Mel spectrogram by a preset fundamental frequency extraction algorithm; and carrying out singer identification on the audio Mel spectrogram and the melody Mel spectrogram based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified. Furthermore, the present invention relates to blockchain technology, and singer information can be stored in the blockchain nodes.

Description

Singer identification method, singer identification device, singer identification equipment and storage medium

Technical Field

The present invention relates to the field of speech classification, and in particular, to a singer recognition method, apparatus, device, and storage medium.

Background

Currently, singer identification is widely used in many fields, particularly in terms of music classification, and when singers sing in a specific music environment, existing singer identification models can identify singers so as to provide singer information to users.

The existing singer identification model usually cuts and intercepts different songs of the same singer to obtain positive samples, then cuts and intercepts the songs of different singers to obtain negative samples, the positive samples and the negative samples are used as sample pairs for joint training, the identification accuracy of the model in the mode depends on the magnitude of the sample pairs to a great extent, and the operation is complex and the identification efficiency is low.

Disclosure of Invention

The invention provides a singer identification method, a singer identification device, singer identification equipment and a storage medium, which are used for improving accuracy and efficiency of singer identification.

The first aspect of the present invention provides a singer identification method, including:

receiving an audio file to be identified of a mixed recording, and acquiring a target audio signal of the audio file to be identified;

according to a preset Fourier transform algorithm, converting the target audio signal into a Mel frequency domain to obtain an audio Mel spectrogram corresponding to the audio file to be identified;

Extracting a melody Mel spectrogram of the background music part from the audio Mel spectrogram by a preset fundamental frequency extraction algorithm;

and carrying out singer identification on the audio Mel spectrogram and the melody Mel spectrogram based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified.

Optionally, in a first implementation manner of the first aspect of the present invention, before the receiving the audio file to be identified of the mixed recording and obtaining the target audio signal of the audio file to be identified, the singer identifying method further includes:

acquiring an initial sample audio file with singer information labels, and expanding the initial sample audio file through a preset data enhancement algorithm to obtain a target sample audio file;

acquiring an initialized convolution recurrent neural network model, and inputting the target sample audio file into the convolution recurrent neural network model, wherein the convolution recurrent neural network model comprises a plurality of convolution layers, a plurality of gating circulation unit layers and a full connection layer;

model training is carried out on the convolution layers, the gating circulating unit layers and the full-connection layer based on the target sample audio file, so that a model loss result is obtained;

And according to the model loss result, adjusting network parameters of the convolution recurrent neural network model to obtain a trained singer identification model.

Optionally, in a second implementation manner of the first aspect of the present invention, the obtaining an initial sample audio file with singer information label, and expanding the initial sample audio file through a preset data enhancement algorithm, to obtain a target sample audio file, includes:

acquiring a plurality of initial sample audio files with singer information marks, and converting the plurality of initial sample audio files into sample audio signals to obtain a plurality of sample audio signals;

deleting musical instrument tracks in the plurality of sample audio signals through a preset music track dividing algorithm to obtain a plurality of voice signals;

respectively extracting background sounds from the plurality of sample audio signals through a preset fundamental frequency extraction algorithm to obtain a plurality of melody signals;

and respectively fusing each human sound signal with the melody signals through a preset data enhancement algorithm to obtain a target sample audio file.

Optionally, in a third implementation manner of the first aspect of the present invention, the deleting, by a preset music track dividing algorithm, musical instrument tracks in the plurality of sample audio signals to obtain a plurality of voice signals includes:

According to a preset music track dividing algorithm, respectively separating the plurality of sample audio signals into a plurality of target tracks;

and eliminating the musical instrument sound tracks in the target sound tracks to obtain a plurality of human sound tracks, and generating a plurality of human sound signals corresponding to the human sound tracks.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the converting, according to a preset fourier transform algorithm, the target audio signal to a mel frequency domain to obtain an audio mel spectrogram corresponding to the audio file to be identified includes:

carrying out framing treatment on the target audio signal according to a time window with a preset frame length to obtain a multi-frame time domain signal;

converting the multi-frame time domain signals into frequency domains through a preset Fourier transform algorithm to obtain multi-frame frequency domain signals;

and acquiring a preset Mel filter group, and performing filtering processing on the multi-frame frequency domain signals to obtain an audio Mel spectrogram corresponding to the audio file to be identified.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the extracting, by a preset fundamental frequency extraction algorithm, a melody mel spectrogram of the background music part from the audio mel spectrogram includes:

Extracting a fundamental frequency signal from the target audio signal through a preset fundamental frequency extraction algorithm, wherein the fundamental frequency signal is used for indicating a background music part signal of the audio file to be identified;

and carrying out convolution operation on the audio Mel spectrogram based on the fundamental frequency signal to obtain a melody Mel spectrogram of the background music part in the audio Mel spectrogram.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the performing singer identification on the audio mel spectrogram and the melody mel spectrogram based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified includes:

performing convolution recursive feature extraction on the audio Mel spectrogram and the melody Mel spectrogram through a plurality of convolution layers and a plurality of gating circulating unit layers in the trained singer recognition model to obtain a feature matrix corresponding to the audio file to be recognized;

and carrying out singer probability voting on the feature matrix through a full connection layer in the singer identification model, and taking singer information with highest voting probability as singer information corresponding to the audio file to be identified.

A second aspect of the present invention provides a singer identification apparatus, comprising:

the receiving module is used for receiving the audio file to be identified of the mixed record and acquiring a target audio signal of the audio file to be identified;

the conversion module is used for converting the target audio signal into a Mel frequency domain according to a preset Fourier transform algorithm to obtain an audio Mel frequency spectrogram corresponding to the audio file to be identified;

the extraction module is used for extracting a melody Mel spectrogram of the background music part from the audio Mel spectrogram through a preset fundamental frequency extraction algorithm;

and the identification module is used for identifying the singer on the basis of the trained singer identification model and carrying out singer identification on the audio Mel spectrogram and the melody Mel spectrogram to obtain singer information corresponding to the audio file to be identified.

Optionally, in a first implementation manner of the second aspect of the present invention, the singer identifying device further includes:

the expansion module is used for acquiring an initial sample audio file with singer information marks, and expanding the initial sample audio file through a preset data enhancement algorithm to obtain a target sample audio file;

the input module is used for acquiring an initialized convolution recurrent neural network model and inputting the target sample audio file into the convolution recurrent neural network model, and the convolution recurrent neural network model comprises a plurality of convolution layers, a plurality of gating circulation unit layers and a full connection layer;

The training module is used for carrying out model training on the plurality of convolution layers, the plurality of gating circulating unit layers and the full-connection layer based on the target sample audio file to obtain a model loss result;

and the adjusting module is used for adjusting the network parameters of the convolution recurrent neural network model according to the model loss result to obtain a trained singer identification model.

Optionally, in a second implementation manner of the second aspect of the present invention, the extension module includes:

the acquisition unit is used for acquiring a plurality of initial sample audio files with singer information marks, converting the plurality of initial sample audio files into sample audio signals and obtaining a plurality of sample audio signals;

the track dividing unit is used for deleting musical instrument tracks in the plurality of sample audio signals respectively through a preset music track dividing algorithm to obtain a plurality of voice signals;

the extraction unit is used for respectively extracting background sounds from the plurality of sample audio signals through a preset fundamental frequency extraction algorithm to obtain a plurality of melody signals;

and the fusion unit is used for respectively fusing the human sound signals with the melody signals through a preset data enhancement algorithm to obtain a target sample audio file.

Optionally, in a third implementation manner of the second aspect of the present invention, the track dividing unit is specifically configured to:

Optionally, in a fourth implementation manner of the second aspect of the present invention, the conversion module is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the extracting module is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the identification module is specifically configured to:

A third aspect of the present invention provides a singer identification apparatus, comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the singer identification device to perform the singer identification method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having instructions stored therein that, when run on a computer, cause the computer to perform the singer identification method described above.

In the technical scheme provided by the invention, an audio file to be identified of a mixed record is received, and a target audio signal of the audio file to be identified is obtained; according to a preset Fourier transform algorithm, converting the target audio signal into a Mel frequency domain to obtain an audio Mel spectrogram corresponding to the audio file to be identified; extracting a melody Mel spectrogram of the background music part from the audio Mel spectrogram by a preset fundamental frequency extraction algorithm; and carrying out singer identification on the audio Mel spectrogram and the melody Mel spectrogram based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified. In the embodiment of the invention, after obtaining the target audio information corresponding to the audio file to be identified, the server maps the target audio signal to the Mel frequency domain to obtain an audio Mel frequency spectrogram, extracts the melody Mel frequency spectrogram in the audio Mel frequency spectrogram through a fundamental frequency extraction algorithm, and finally, identifies the singer based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified. The singer identification method and device can improve singer identification efficiency and accuracy.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a singer identification method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a singer identification method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a singer identification device according to an embodiment of the present invention;

FIG. 4 is a schematic view of another embodiment of a singer identification device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a singer identification device according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a singer identification method, a singer identification device, singer identification equipment and a storage medium, which are used for improving accuracy and efficiency of singer identification.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, and an embodiment of a singer identification method in an embodiment of the present invention includes:

101. receiving an audio file to be identified of the mixed record, and acquiring a target audio signal of the audio file to be identified;

it is to be understood that the execution subject of the present invention may be a singer identification device, or may be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

In this embodiment, the audio file to be identified is a mixed recording audio file, that is, a music file with singer playing and background music, and the audio file to be identified may be an audio file of whole music or a certain piece of whole music, which is not limited herein.

In this embodiment, the server collects, through a preset audio signal collection algorithm, audio signals in the audio file to be identified to obtain a target audio signal, and inputs the target audio signal as a signal source of the audio file to be identified in the singer identification process.

102. According to a preset Fourier transform algorithm, converting a target audio signal into a Mel frequency domain to obtain an audio Mel frequency spectrogram corresponding to the audio file to be identified;

In this embodiment, the fourier transform algorithm is a linear integral transform algorithm, and is mainly used for time domain transformation in signals, and is based on expansion of fourier series, and the server can transform any target audio signal into a series of sum of sine wave and cosine wave through a preset fourier transform algorithm, so that an abstract and irregular target music score signal can be represented as a change curve of audio.

In this embodiment, the preset fourier transform algorithm includes a short-time fourier transform algorithm and a fast fourier transform algorithm, and preferably, the server adopts the short-time fourier transform algorithm. Firstly, a server performs short-time Fourier transform on a time domain signal of a target audio signal to obtain an audio frequency domain signal corresponding to the target audio signal, and then the server filters the audio frequency domain signal through a filter bank with Mel frequency scales to obtain an audio Mel frequency spectrum corresponding to an audio file to be identified, which is used for identifying singers of the audio file to be identified.

103. Extracting a melody Mel spectrogram of the background music part from the audio Mel spectrogram by a preset fundamental frequency extraction algorithm;

in this embodiment, since the fundamental frequency of sound often varies with time, when the server extracts the fundamental frequency, the server frames the target audio signal (the frame length is usually tens of milliseconds) to obtain multi-frame audio signal segments, and then the server extracts the fundamental frequency of the multi-frame audio signal segments frame by frame. The server extracts the fundamental frequency of each frame of audio signal segment in two ways, one of which is: the server takes the waveform of the target audio signal as input, and then searches the minimum positive period of the waveform, so as to obtain the melody base frequency signal of the background sound part, and the other mode is as follows: the server firstly carries out Fourier transform on the target audio signal to obtain a frequency spectrum (only taking an amplitude spectrum and discarding a phase spectrum), peaks exist at integral multiples of a fundamental frequency on the frequency spectrum, and the server obtains the greatest common divisor of the peak frequencies, so that the melody fundamental frequency signal is obtained.

In this embodiment, the preset fundamental frequency extraction algorithm includes, but is not limited to: a saw-tooth pitch prediction algorithm (sawtooth wave inspired pitch estimator, SWIPE), a self-supervised pitch prediction algorithm (self-supervised pitch estimation, SPICE), a convolved pitch prediction algorithm (convolutional representation for pitch estimation, CREPE), wherein the present invention prefers CREPE to be the fundamental frequency extraction algorithm for fundamental frequency extraction of the target audio signal.

104. And carrying out singer identification on the audio Mel spectrogram and the melody Mel spectrogram based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified.

In this embodiment, the server inputs the audio mel spectrogram and the melody mel spectrogram to the trained singer recognition model, the server recognizes the audio mel spectrogram and the melody mel spectrogram through the singer recognition model, so as to obtain a singer tag corresponding to the audio file to be recognized, and the server searches a preset singer library through the singer tag to obtain singer information corresponding to the singer tag and outputs the singer information to the terminal.

In this embodiment, the trained singer recognition model is a convolutional recurrent neural network model (convolutional recurrent neural network model, CRNNM), which is a stack of 4 convolutional layers, 2 gate-controlled cyclic units (Gate Recurrent Unit, GRU) layers, and 1 dense layer (i.e., full-connection layer), and is capable of performing efficient feature extraction, singer extraction, and other processes on the audio mel spectrogram and the melody mel spectrogram.

Further, the server stores singer information in a blockchain database, and is not limited herein.

In the embodiment of the invention, after obtaining the target audio information corresponding to the audio file to be identified, the server maps the target audio signal to the Mel frequency domain to obtain an audio Mel frequency spectrogram, extracts the melody Mel frequency spectrogram in the audio Mel frequency spectrogram through a fundamental frequency extraction algorithm, and finally, identifies the singer based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified. The singer identification method and device can improve singer identification efficiency and accuracy.

Referring to fig. 2, another embodiment of the singer identification method in the embodiment of the invention includes:

201. acquiring an initial sample audio file with singer information labels, and expanding the initial sample audio file through a preset data enhancement algorithm to obtain a target sample audio file;

in this embodiment, in order to expand the number of samples, the server expands the initial sample audio file with singer information label through a preset data enhancement algorithm, so as to obtain a target sample audio file with a larger order of magnitude.

Specifically, a server acquires a plurality of initial sample audio files with singer information marks, and converts the plurality of initial sample audio files into sample audio signals to obtain a plurality of sample audio signals; the server deletes musical instrument tracks in a plurality of sample audio signals respectively through a preset music track dividing algorithm to obtain a plurality of voice signals; the server extracts background sounds from a plurality of sample audio signals through a preset fundamental frequency extraction algorithm to obtain a plurality of melody signals; and the server respectively fuses each human sound signal with a plurality of melody signals through a preset data enhancement algorithm to obtain a target sample audio file.

In this embodiment, the server separates the melody signals and the vocal signals in the plurality of initial sample audio files to obtain a plurality of vocal signals and a plurality of melody signals, and then combines each vocal signal with all the melody signals in a pairwise combination manner, so that the number of samples is exponentially increased to obtain a target sample audio file, the target sample audio file is the target sample audio file with the expanded number of samples, so that generalization capability of training of small samples can be improved, robustness of classification of singers can be improved, and a data base is provided for training of singer identification models.

In this embodiment, the server extracts the vocal signals in the sample audio signals corresponding to each initial sample audio file through a music track-dividing algorithm, and extracts the melody signals in each sample audio signal through a fundamental frequency extraction algorithm, which is similar to the signal extraction method of the audio file to be identified, and detailed description thereof is omitted.

Further, the server deletes musical instrument tracks in the plurality of sample audio signals through a preset music track dividing algorithm to obtain a plurality of voice signals, including: the server separates a plurality of sample audio signals into a plurality of target audio tracks according to a preset music track dividing algorithm; and the server eliminates the musical instrument sound tracks in the target sound tracks to obtain a plurality of human sound tracks, and generates a plurality of human sound signals corresponding to the human sound tracks.

In this embodiment, the server performs a track-splitting process on each sample audio signal by using a preset music track-splitting algorithm, where the music track-splitting algorithm includes: the embodiment preferably uses demux as a preset music track dividing algorithm, can efficiently divide a plurality of sample audio signals into a plurality of target audio tracks, and the server deletes instrument audio tracks in the plurality of target audio tracks to obtain human voice signals corresponding to the human audio tracks.

202. Acquiring an initialized convolution recurrent neural network model, and inputting a target sample audio file into the convolution recurrent neural network model, wherein the convolution recurrent neural network model comprises a plurality of convolution layers, a plurality of gating circulation unit layers and a full connection layer;

in this embodiment, the initial model for singer recognition model training is a convolutional recurrent neural network model, the network structure of the initial model is a stack composed of a plurality of convolutional layers, a plurality of gating cyclic unit layers and a full connection layer, and before the target sample audio file is input into the initialized convolutional recurrent neural network model, the server further performs a series of preprocessing on the target sample audio file to obtain an audio mel spectrogram and a melody mel spectrogram corresponding to the target sample audio file, so as to improve the training efficiency of the convolutional recurrent neural network model.

203. Model training is carried out on a plurality of convolution layers, a plurality of gating circulating unit layers and a full connection layer based on a target sample audio file, so that a model loss result is obtained;

in this embodiment, after the server inputs the audio mel spectrogram and the melody mel spectrogram corresponding to the target sample audio file into the convolutional recurrent neural network model, the server performs multiple rounds of singer identification training on the convolutional recurrent neural network model through multiple convolutional layers, multiple gating circulating unit layers and full connection layers in the convolutional recurrent neural network model, and outputs a prediction result in each round, and the server performs loss calculation through singer information labels corresponding to the prediction result and the target sample audio file, so as to obtain a model loss result of each round of training.

204. According to the model loss result, network parameters of the convolution recurrent neural network model are adjusted to obtain a trained singer identification model;

in this embodiment, the server determines the model loss result of each training round, when the model loss result is smaller than the preset loss threshold, it indicates that the model training is completed, and when the model loss result is larger than the preset loss threshold, it indicates that the model is not yet trained, and there is a large difference between the predicted result and the actual result, then the server repeats steps 203 and 204, and performs the next model training round until the model loss result is smaller than the preset loss threshold, and the server generates a trained singer identification model.

205. Receiving an audio file to be identified of the mixed record, and acquiring a target audio signal of the audio file to be identified;

the execution of step 205 is similar to that of step 101, and detailed description thereof will be omitted herein.

206. According to a preset Fourier transform algorithm, converting a target audio signal into a Mel frequency domain to obtain an audio Mel frequency spectrogram corresponding to the audio file to be identified;

specifically, the server carries out framing processing on the target audio signal according to a time window with a preset frame length to obtain multi-frame time domain signals; the server converts the multi-frame time domain signals into frequency domains through a preset Fourier transform algorithm to obtain multi-frame frequency domain signals; the server acquires a preset Mel filter group, and carries out filtering processing on the multi-frame frequency domain signals to obtain an audio Mel spectrogram corresponding to the audio file to be identified.

In this embodiment, since the target audio signal is a certain amount of time-dependent change, in singer recognition, the relationship between frequency and energy in sound is focused, so the server needs to convert the target audio signal into a spectrogram to perform audio analysis, thereby recognizing singer information. Specifically, the server carries out framing processing on the target audio signal through a preset frame length time window to obtain multi-frame time domain signals, and then converts each frame of time domain signals into a frequency domain through a Fourier transform algorithm to obtain multi-frame frequency domain signals, wherein a carrier of the multi-frame frequency domain signals is a spectrogram.

In this embodiment, since the perception of the human ear to the sound is not linear and the human ear is more sensitive to the low frequency of the sound than to the high frequency of the sound, the server converts the linear spectrogram into the nonlinear mel spectrum, specifically, the mel filter bank is adopted to filter the multi-frame frequency domain signal, so as to obtain the audio mel spectrogram corresponding to the audio file to be identified.

Firstly, a server performs short-time Fourier transform on a time domain signal of a target audio signal to obtain an audio frequency domain signal corresponding to the target audio signal, and then the server filters the audio frequency domain signal through a filter bank with Mel frequency scales to obtain an audio Mel frequency spectrum corresponding to an audio file to be identified, wherein the audio Mel frequency spectrum is used for singer identification of the audio file to be identified.

207. Extracting a melody Mel spectrogram of the background music part from the audio Mel spectrogram by a preset fundamental frequency extraction algorithm;

specifically, the server extracts a fundamental frequency signal from the target audio signal through a preset fundamental frequency extraction algorithm, wherein the fundamental frequency signal is used for indicating a background music part signal of the audio file to be identified; the server carries out convolution operation on the audio Mel spectrogram based on the fundamental frequency signal to obtain a melody Mel spectrogram of the background music part in the audio Mel spectrogram.

In this embodiment, the preferred fundamental frequency extraction algorithm is CREPE, which is also called a fundamental frequency extraction model, and includes 6 convolution layers and 1 full-connection layer, the output of the fundamental frequency extraction model is a 360-dimensional vector, each dimension corresponds to the probability of a candidate fundamental frequency, the output layer of the fundamental frequency extraction model does not use an activation function as a whole, but uses a sigmoid function for each dimension so that the output probability interval is between 0 and 1, the server uses the candidate fundamental frequency with the highest probability as the target fundamental frequency of each dimension, and the target fundamental frequencies of each dimension are combined to form the melody mel spectrogram.

208. And carrying out singer identification on the audio Mel spectrogram and the melody Mel spectrogram based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified.

Specifically, performing convolution recursive feature extraction on an audio Mel spectrogram and a melody Mel spectrogram through a plurality of convolution layers and a plurality of gating circulating unit layers in a trained singer recognition model to obtain a feature matrix corresponding to an audio file to be recognized; and the server performs singer probability voting on the feature matrix through a full connection layer in the singer identification model, and takes singer information with the highest voting probability as singer information corresponding to the audio file to be identified.

In this embodiment, the server can efficiently identify the singer to the audio file to be identified through the trained singer identification model, that is, the multiple convolution layers, the multiple gating circulation unit layers and the full connection layer after parameter fine adjustment, and the final singer information is determined by singer probability voting, and the singer information with the highest probability is used as the singer information corresponding to the final audio file to be identified.

In the embodiment of the invention, when the training samples of the singer identification model are insufficient to complete model training, a server performs quantity expansion on the initial sample audio files through a data enhancement algorithm to obtain exponentially increased target sample audio files, and then performs singer information prediction on the target sample audio files through a plurality of convolution layers, a plurality of gating circulating unit layers and a full connection layer in the initialized convolution recurrent neural network model to obtain model loss results, and finally, the server generates a trained singer identification model according to the model loss results.

The method for identifying a singer in the embodiment of the present invention is described above, and the device for identifying a singer in the embodiment of the present invention is described below, referring to fig. 3, one embodiment of the device for identifying a singer in the embodiment of the present invention includes:

The receiving module 301 is configured to receive an audio file to be identified of a mixed recording, and obtain a target audio signal of the audio file to be identified;

the conversion module 302 is configured to convert the target audio signal to a mel frequency domain according to a preset fourier transform algorithm, so as to obtain an audio mel spectrogram corresponding to the audio file to be identified;

an extracting module 303, configured to extract a melody mel spectrogram of the background music part from the audio mel spectrogram by using a preset fundamental frequency extracting algorithm;

and the recognition module 304 is configured to recognize the singer on the basis of the trained singer recognition model, so as to obtain singer information corresponding to the audio file to be recognized.

Further, singer information is stored in a blockchain database, and is not limited herein.

Referring to fig. 4, another embodiment of the singer identification device according to the present invention includes:

Optionally, the singer identifying device further includes:

the expansion module 305 is configured to obtain an initial sample audio file with singer information labels, and expand the initial sample audio file through a preset data enhancement algorithm to obtain a target sample audio file;

an input module 306, configured to obtain an initialized convolutional recurrent neural network model, and input the target sample audio file into the convolutional recurrent neural network model, where the convolutional recurrent neural network model includes a plurality of convolutional layers, a plurality of gating cyclic unit layers, and a full connection layer;

The training module 307 is configured to perform model training on the plurality of convolution layers, the plurality of gating circulating unit layers and the fully-connected layer based on the target sample audio file, so as to obtain a model loss result;

and the adjustment module 308 is configured to adjust network parameters of the convolutional recurrent neural network model according to the model loss result, so as to obtain a trained singer identification model.

Optionally, the expansion module 305 includes:

the acquisition unit 3051 is used for acquiring a plurality of initial sample audio files with singer information marks, converting the plurality of initial sample audio files into sample audio signals and obtaining a plurality of sample audio signals;

the track dividing unit 3052 is configured to delete musical instrument tracks in the plurality of sample audio signals respectively through a preset music track dividing algorithm to obtain a plurality of voice signals;

an extracting unit 3053, configured to extract background sounds from the plurality of sample audio signals through a preset fundamental frequency extracting algorithm, so as to obtain a plurality of melody signals;

and the fusion unit 3054 is used for respectively fusing the human sound signals and the melody signals through a preset data enhancement algorithm to obtain a target sample audio file.

Optionally, the track dividing unit 3052 is specifically configured to:

Optionally, the conversion module 302 is specifically configured to:

Optionally, the extracting module 303 is specifically configured to:

Optionally, the identifying module 304 is specifically configured to:

The singer identification device in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 3 and fig. 4, and the singer identification apparatus in the embodiment of the present invention is described in detail from the point of view of the hardware processing.

Fig. 5 is a schematic structural diagram of a singer identification device according to an embodiment of the present invention, where the singer identification device 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the apparatus 500 for identifying a song. Still further, the processor 510 may be configured to communicate with the storage medium 530 and execute a series of instruction operations in the storage medium 530 on the singer identification device 500.

The singer identification appliance 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the singer identification device structure shown in fig. 5 is not limiting of the singer identification device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The present invention also provides a singer identification apparatus, the computer apparatus including a memory and a processor, the memory storing computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the singer identification method in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, or a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the singer identification method.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A singer identification method, characterized in that the singer identification method comprises:

based on a trained singer identification model, carrying out singer identification on the audio Mel spectrogram and the melody Mel spectrogram to obtain singer information corresponding to the audio file to be identified;

Before the audio file to be identified of the mixed record is received and the target audio signal of the audio file to be identified is acquired, the singer identification method further comprises the following steps:

according to the model loss result, network parameters of the convolution recurrent neural network model are adjusted to obtain a trained singer identification model;

the method for obtaining the initial sample audio file with singer information label, and expanding the initial sample audio file through a preset data enhancement algorithm to obtain a target sample audio file comprises the following steps:

respectively fusing each human sound signal with the melody signals through a preset data enhancement algorithm to obtain a target sample audio file;

the method for extracting the melody mel spectrogram of the background music part from the audio mel spectrogram by a preset fundamental frequency extraction algorithm comprises the following steps:

based on the fundamental frequency signal, carrying out convolution operation on the audio Mel spectrogram to obtain a melody Mel spectrogram of a background music part in the audio Mel spectrogram;

The singer identification is performed on the audio Mel spectrogram and the melody Mel spectrogram based on the trained singer identification model to obtain singer information corresponding to the audio file to be identified, including:

2. The singer identification method of claim 1, wherein deleting musical instrument tracks in the plurality of sample audio signals by a preset music track splitting algorithm, respectively, to obtain a plurality of voice signals, comprises:

3. The singer identification method as claimed in claim 1, wherein the converting the target audio signal into mel frequency domain according to a preset fourier transform algorithm to obtain an audio mel spectrogram corresponding to the audio file to be identified includes:

4. A singer identification device, characterized in that the singer identification device comprises:

The identification module is used for identifying the singer on the basis of the trained singer identification model, and the audio Mel spectrogram and the melody Mel spectrogram to obtain singer information corresponding to the audio file to be identified;

the singer identification device further includes:

the adjustment module is used for adjusting the network parameters of the convolution recurrent neural network model according to the model loss result to obtain a trained singer identification model;

The expansion module comprises:

the fusion unit is used for respectively fusing each human sound signal with the melody signals through a preset data enhancement algorithm to obtain a target sample audio file;

the extraction module is specifically used for:

The identification module is specifically used for:

5. A singer identification apparatus, characterized in that the singer identification apparatus comprises: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invoking the instructions in the memory to cause the singer identification device to perform the singer identification method as claimed in any one of claims 1-3.

6. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement a singer identification method as recited in any one of claims 1-3.