CN117854528A

CN117854528A - Audio noise reduction method, training method and device of noise reduction model

Info

Publication number: CN117854528A
Application number: CN202311702835.7A
Authority: CN
Inventors: 陈洲旋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-04-09

Abstract

The application relates to an audio noise reduction method, a training method of a noise reduction model and a device, wherein the method comprises the following steps: acquiring noisy audio of a user; carrying out information extraction processing on the noisy frequency through a first cavity convolution module in a noise reduction model to obtain time-frequency information of the noisy audio; the first hole convolution module comprises a hole convolution network of a frequency domain axis and a hole convolution network of a time domain axis, and the time-frequency information comprises information of the noisy audio on the time domain and the frequency domain; inputting the time-frequency information of the noisy audio into a self-adaptive voiceprint module in the noise reduction model to obtain voiceprint characteristics of the user; and inputting the voiceprint characteristics of the user and the time-frequency information of the noisy audio into a second cavity convolution module in the noise reduction model to obtain the noisy audio corresponding to the noisy audio. By adopting the method, effective noise reduction of the noisy audio containing unsteady noise can be realized.

Description

Audio noise reduction method, training method and device of noise reduction model

Technical Field

The present application relates to the field of audio processing technology, and in particular, to an audio noise reduction method, a training method of a noise reduction model, a device, a computer device, a storage medium, and a computer program product.

Background

Audio noise reduction generally refers to a technique of extracting useful audio signals from noise background when the audio signals are disturbed or even submerged by various noise, and suppressing and reducing the noise disturbance. The existing audio noise reduction method mostly adopts traditional digital signal processing, uses various effective frequency domain conversion and time domain conversion to estimate the noise spectrum in the original noisy audio, and then predicts the noise-reduced audio signal from the recorded signal through the estimated noise spectrum.

However, such noise reduction methods based on frequency domain conversion and time domain conversion generally assume that noise is stationary, i.e., the statistical properties of the noise are constant throughout the signal, under which assumption the noise reduction method has a certain noise reduction effect. In singing state, accompaniment can be played, surrounding environment is complex (such as sound of passersby on roadside, sound of communication of family friends is also present in k songs in home, even accompaniment in k song scene can exist with sound and the like), accompaniment and environment sound can be collected completely during sound reception, and therefore later processing of human sound becomes more difficult.

Disclosure of Invention

Based on this, it is necessary to provide an audio noise reduction method, a training method of a noise reduction model, an apparatus, a computer device, a computer readable storage medium and a computer program product for the technical problem that the processing of the later human voice by the above method is difficult.

In a first aspect, the present application provides an audio noise reduction method. The method comprises the following steps:

acquiring noisy audio of a user;

carrying out information extraction processing on the noisy frequency through a first cavity convolution module in a noise reduction model to obtain time-frequency information of the noisy audio; the first hole convolution module comprises a hole convolution network of a frequency domain axis and a hole convolution network of a time domain axis, and the time-frequency information comprises information of the noisy audio on the time domain and the frequency domain;

inputting the time-frequency information of the noisy audio into a self-adaptive voiceprint module in the noise reduction model to obtain voiceprint characteristics of the user;

and inputting the voiceprint characteristics of the user and the time-frequency information of the noisy audio into a second cavity convolution module in the noise reduction model to obtain the noisy audio corresponding to the noisy audio.

In one embodiment, the inputting the time-frequency information of the noisy audio into the adaptive voiceprint module in the noise reduction model to obtain the voiceprint feature of the user includes:

performing voiceprint extraction on the time-frequency information of the noisy audio through the self-adaptive voiceprint module to obtain the current voiceprint characteristics of the user, and determining the cleanliness value of the time-frequency information;

And under the condition that the historical voiceprint characteristics of the user are prestored, determining the voiceprint characteristics of the user based on the cleanliness value, the current voiceprint characteristics of the user and the historical voiceprint characteristics.

In one embodiment, the determining the voiceprint feature of the user based on the cleanliness value, the current voiceprint feature of the user, and the historical voiceprint feature includes:

updating the historical voiceprint features of the user based on the current voiceprint features under the condition that the cleanliness value is greater than or equal to a threshold value to obtain the voiceprint features of the user;

and under the condition that the cleanliness value is smaller than the threshold value, determining the historical voiceprint characteristics of the user as the voiceprint characteristics of the user.

In one embodiment, the updating the historical voiceprint feature of the user based on the current voiceprint feature to obtain the voiceprint feature of the user includes:

acquiring a first weight preset for the historical voiceprint characteristics of the user;

determining a second weight of the current voiceprint feature based on the first weight;

and according to the first weight and the second weight, carrying out fusion processing on the current voiceprint feature and the historical voiceprint feature to obtain the voiceprint feature of the user.

In one embodiment, before the information extraction processing is performed on the noisy audio through the first hole convolution module in the noise reduction model to obtain the time-frequency information of the noisy audio, the method further includes:

sub-band decomposition processing is carried out on the noisy frequency to obtain a plurality of sub-bands of the noisy frequency;

respectively carrying out time-frequency conversion processing on each sub-band to obtain the audio characteristics of each sub-band;

the step of extracting information from the noisy audio through a first cavity convolution module in the noise reduction model to obtain time-frequency information of the noisy audio comprises the following steps:

and respectively extracting and processing the audio characteristics of each sub-band of the noisy audio through a first cavity convolution module in the noise reduction model to obtain the time-frequency information of each sub-band.

In one embodiment, the voiceprint features of the user include voiceprint features of the user corresponding to the respective sub-bands;

inputting the voiceprint characteristics of the user and the time-frequency information of the noisy audio into a second cavity convolution module in the noise reduction model to obtain the noisy audio corresponding to the noisy audio, wherein the method comprises the following steps of:

Respectively inputting voiceprint features corresponding to each sub-band and time-frequency information of each sub-band into a second cavity convolution module in the noise reduction model to obtain a noise reduction audio frequency spectrum corresponding to each sub-band;

performing inverse transformation of time-frequency transformation on the noise reduction audio frequency spectrum corresponding to each sub-band to obtain a noise reduction audio frequency segment of each sub-band;

and synthesizing the noise reduction audio fragments corresponding to each sub-band to obtain the noise reduction audio corresponding to the band noise frequency.

In a second aspect, the present application provides a method of training a noise reduction model. The method comprises the following steps:

generating a sample noisy audio set of a sample user, wherein the sample noisy audio set comprises sample noisy audio and clean audio corresponding to the sample noisy audio, and the sample noisy audio is obtained by superposing noise audio and/or accompaniment audio with different signal to noise ratios on the basis of the clean audio;

carrying out information extraction processing on the sample band noise frequency through a first cavity convolution module in the noise reduction model to be trained to obtain time-frequency information of the sample band noise frequency; the first hole convolution module comprises a hole convolution network of a frequency domain shaft and a hole convolution network of a time domain shaft, and the time-frequency information comprises information of the sample with noise frequency on the time domain and the frequency domain;

Inputting the time-frequency information of the sample noisy audio into a self-adaptive voiceprint module in the noise reduction model to be trained to obtain voiceprint characteristics of the sample user;

inputting the voiceprint characteristics of the sample user and the time-frequency information of the sample noisy audio into a second cavity convolution module in the noise reduction model to be trained to obtain the predicted noise reduction audio corresponding to the sample noisy audio;

and training the noise reduction model to be trained based on the difference information between the predicted noise reduction audio and the clean audio to obtain a noise reduction model after training.

In one embodiment, the sample noisy audio set comprises a first noisy audio set and a second noisy audio set; each first sample noisy audio in the first noisy audio set comprises target clean audio and at least one noise; each second sample noisy frequency in the second noisy frequency set comprises target clean audio, interfering clean audio, and at least one noise; the target clean audio is the voice of the sample user;

the method further comprises the steps of:

training a first cavity convolution module in the noise reduction model to be trained based on the first noise-carrying audio set to obtain a first trained noise reduction model;

And training the self-adaptive voiceprint module and the second cavity convolution network in the noise reduction model to be trained based on the second noisy audio set to obtain a second noise reduction model after training, wherein the parameters of the first cavity convolution module are kept unchanged.

In a third aspect, the present application also provides an audio noise reduction device. The device comprises:

the audio acquisition module is used for acquiring noisy audio of the user;

the information extraction module is used for carrying out information extraction processing on the noisy frequency through a first cavity convolution module in the noise reduction model to obtain time-frequency information of the noisy audio; the first hole convolution module comprises a hole convolution network of a frequency domain axis and a hole convolution network of a time domain axis, and the time-frequency information comprises information of the noisy audio on the time domain and the frequency domain;

the voiceprint extraction module is used for inputting the time-frequency information of the noisy audio into the self-adaptive voiceprint module in the noise reduction model to obtain voiceprint characteristics of the user;

and the audio noise reduction module is used for inputting the voiceprint characteristics of the user and the time-frequency information of the noisy audio into a second cavity convolution module in the noise reduction model to obtain the noisy audio corresponding to the noisy audio.

In a fourth aspect, the present application provides a training apparatus for a noise reduction model. The device comprises:

the system comprises a sample acquisition module, a sampling module and a sampling module, wherein the sample acquisition module is used for generating a sample noisy audio set of a sample user, the sample noisy audio set comprises sample noisy audio and clean audio corresponding to the sample noisy audio, and the sample noisy audio is obtained by superposing noise audio and/or accompaniment audio with different signal to noise ratios on the basis of the clean audio;

the information extraction module is used for carrying out information extraction processing on the sample noisy frequency through a first cavity convolution module in the noise reduction model to be trained to obtain time-frequency information of the sample noisy frequency; the first hole convolution module comprises a hole convolution network of a frequency domain shaft and a hole convolution network of a time domain shaft, and the time-frequency information comprises information of the sample with noise frequency on the time domain and the frequency domain;

the voiceprint extraction module is used for inputting the time-frequency information of the sample noisy audio into the self-adaptive voiceprint module in the noise reduction model to be trained to obtain voiceprint characteristics of the sample user;

the audio prediction module is used for inputting the voiceprint characteristics of the sample user and the time-frequency information of the sample noisy audio into a second cavity convolution module in the noise reduction model to be trained to obtain predicted noise reduction audio corresponding to the sample noisy audio;

The model training module is used for training the noise reduction model to be trained based on the difference information between the predicted noise reduction audio and the clean audio to obtain a noise reduction model after training.

In a fifth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the audio noise reduction method or the training method of the noise reduction model as claimed in any one of the above when the computer program is executed.

In a sixth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the audio noise reduction method or the training method of the noise reduction model as set forth in any of the above.

In a seventh aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the audio noise reduction method or the training method of the noise reduction model as claimed in any one of the above.

According to the audio noise reduction method, the training device, the computer equipment, the storage medium and the computer program product of the noise reduction model, the cavity convolution network of the frequency domain axis and the cavity convolution network of the time domain axis are constructed to serve as a first cavity convolution module to extract time-frequency information of the band noise frequency on the time domain and the frequency domain, the self-adaptive voiceprint module is further used for extracting voiceprint characteristics of a user based on the time-frequency information, and finally the second cavity convolution module is utilized to reduce the noise of the band noise frequency according to the voiceprint characteristics of the user, so that noise reduction audio is obtained. According to the method, the voiceprint characteristics of the user are extracted, the voiceprint characteristics of the user are used as the basis for denoising, and compared with the traditional denoising with the noise frequency spectrum in the noisy audio, whether the noise is stationary noise or not is not considered, so that effective denoising of the noisy audio containing non-stationary noise can be realized.

Drawings

FIG. 1 is a flow chart of an audio noise reduction method according to one embodiment;

FIG. 2 is a schematic diagram of audio framing in one embodiment;

FIG. 3 is a schematic diagram of a noise reduction model in one embodiment;

FIG. 4 is a flow chart illustrating steps for extracting voiceprint features of a user in one embodiment;

FIG. 5 is a flow chart of a training method of a noise reduction model in one embodiment;

FIG. 6 is a flow chart of a training method of a noise reduction model according to another embodiment;

FIG. 7 is a flow chart of a training method of a noise reduction model according to yet another embodiment;

FIG. 8 is a schematic diagram of a K-song work singing enhancement and mixing process according to one embodiment;

FIG. 9 is a flow chart of an audio noise reduction method according to another embodiment;

FIG. 10 is a block diagram of an audio noise reduction device in one embodiment;

FIG. 11 is a block diagram of a training device for a noise reduction model in one embodiment;

fig. 12 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein.

In one embodiment, as shown in fig. 1, an audio noise reduction method is provided, where the method is applied to a server for illustration, it is understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. The terminal can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment and portable wearable equipment, and the internet of things equipment can be smart speakers, smart televisions, smart air conditioners, smart vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. In this embodiment, the method includes the steps of:

Step S110, obtaining the noisy audio of the user.

Wherein the noisy frequency represents audio containing noise. The noise can be steady-state noise or unsteady-state noise, and the types of the noise are various, for example, the noise can be accompaniment sound, environment sound, human voice of other people and the like in a K song scene; the noise can also be car sound, loudspeaker sound, music sound played on the street, and the like in the road scene.

It is understood that the environment in which people live is everywhere noisy, and therefore audio signals are typically mixed with varying degrees of noise. For example, when the voice call is carried out on a road, a park, a square and other places, recorded voice can carry vehicle running sounds, loudspeaker sounds and the like, and different degrees of interference can be generated on the call. For another example, in the K song scene, the singing voice of the user is generally picked up by a microphone, which is limited by non-professional equipment and environment, and noise such as microphone fricatives, environmental background noise, accompaniment and the like is easily mixed in the singing voice picked up by the microphone. Similarly, noise-containing audio recorded in such non-professional device non-mute environments may be considered noisy audio.

Step S120, carrying out information extraction processing on the noise-carrying frequency through a first cavity convolution module in the noise reduction model to obtain time-frequency information of the noise-carrying frequency; the first cavity convolution module comprises a cavity convolution network of a frequency domain axis and a cavity convolution network of a time domain axis, and the time-frequency information comprises information of noisy audio frequency on the time domain and the frequency domain.

The cavity convolution network of the frequency domain axis can be used for learning the intra-frame information of each audio frame with noise frequency as the frequency domain information. Because audio is a time sequence, on a hole convolution network of a time domain axis, the method is mainly used for learning inter-frame information of each audio frame of noisy audio as time domain information.

It will be appreciated that each audio may be divided into a plurality of audio frames, for example, referring to fig. 2, which is a schematic diagram of audio frame division, a frame shift for frame division is 10ms, a frame length is 20ms, a frame of 20ms frame length may be extracted from an audio start time as a 1 st frame, a frame shift of 10ms from an audio start time is extracted, a frame of 20ms frame length is extracted again as a 2 nd frame, and so on, a plurality of audio frames may be obtained. Therefore, the hole convolution network of the frequency domain axis and the hole convolution network of the time domain axis can respectively learn the intra-frame information and the inter-frame information of each audio frame of the noisy audio.

In specific implementation, referring to fig. 3, a schematic model structure of a noise reduction model is shown in an embodiment, and as shown in fig. 3, the noise reduction model includes a first hole convolution module 31, an adaptive voiceprint module 32, and a second hole convolution module 33. The first hole convolution module 31 includes a hole convolution network 310 of a frequency domain axis and a hole convolution network 320 of a time domain axis. After the noisy frequency of the user is obtained, the noisy frequency may be input to the first hole convolution module 31 for processing to extract time-frequency information of the noisy frequency in the time domain and the frequency domain. Specifically, the frequency domain information may be extracted by inputting the noise frequency into the cavity convolution network 310 of the frequency domain, and then inputting the extracted frequency domain information into the cavity convolution network 320 of the time domain, so as to extract the time-frequency information including the frequency domain information and the time domain information.

It will be appreciated that a larger field of view of the frequency domain axis is of greater assistance in capturing correlation due to the presence of various harmonic components or overtones, etc. of the audio. Thus, the present application uses a hole convolution (Dilation Convolution, also referred to as an dilation convolution) in the frequency domain axis, and as shown in fig. 3, a 6-layer hole convolution network may be used, where the dilation scales of the layers are sequentially 1,2,4,8, 16, 32, i.e., an exponential multiple of 2. In addition, normalization was performed using a regularization layer (BatchNorm) on the frequency domain axis with PReLu as the active layer. And by means of cavity convolution of a frequency domain axis, the method has a good fidelity effect on more overtone harmonic waves of singing, and is shown to model the harmonic waves better. Meanwhile, the cavity convolution has strong capability of inhibiting transient noise/unsteady noise (such as door opening sound, keyboard sound and mouth water sound), namely preliminary noise reduction treatment can be carried out on the noise-carrying frequency when the time-frequency information is extracted.

In addition, because of the strong context of audio, such as the stretching of vowels by a singing concert, the singing is more rhythmic. Thus, the present application also creates a larger field of view on the time axis to capture the correlation over time. Specifically, the modeling of the time axis of the method adopts a cavity convolution network which is the same as that of the frequency domain axis, and the parameters of the modeling are consistent with those of the cavity convolution network of the frequency domain axis. It will be appreciated that elongating a word during singing, while remaining for a longer period of time, helps to better preserve integrity and avoid swallowing.

In one embodiment, the hole convolution network of each hole convolution module of the present application may use a residual network (Resnet) structure to improve convergence efficiency during model training. Meanwhile, the whole cavity convolution network can adopt a convolution neural network (Convolutional Neural Networks, CNN) structure, can be parallelized and can also be realized by using separate convolution, so that the whole cavity convolution network is a light-weight network and can be processed in real time.

In one embodiment, before the noise-containing frequency is processed through the noise reduction model, in order to better process and analyze the noise-containing frequency, a time-frequency transformation process may be further performed on the noise-containing frequency, that is, the noise-containing frequency is converted from the time domain to the frequency domain, so as to extract the audio characteristics of the noise-containing audio. And finally, inputting the audio characteristics of the noisy audio into a noise reduction model to extract time-frequency information.

And step S130, inputting the time-frequency information of the noisy audio into a self-adaptive voiceprint module in the noise reduction model to obtain voiceprint characteristics of a user.

The self-adaptive voiceprint module is used for extracting voiceprint features.

In the specific implementation, the self-adaptive voiceprint module can identify and analyze the time-frequency information with noise frequency to determine the cleanliness value of the time-frequency information, and can extract voiceprints of the time-frequency information with noise frequency to obtain the current voiceprint characteristics of the user. Based on the cleanliness value and the current voiceprint features, a manner of extracting the voiceprint features of the user is determined.

It will be appreciated that since the voiceprint features of the same user are features that characterize the user's voice characteristics, generally do not vary significantly, the extracted voiceprint features of the user may be preserved for the next use each time the user's noisy frequency is noise reduced. Thus, in determining the voiceprint characteristics of a user, it may be determined whether the user's historical voiceprint characteristics are pre-stored. If not, namely, when the voiceprint feature of the user is extracted for the first time, the current voiceprint feature of the extracted user can be directly used as the output of the self-adaptive voiceprint module. Otherwise, if the historical voiceprint characteristics of the user are prestored, the voiceprint characteristics of the user can be determined jointly by combining the cleanliness value, the current voiceprint characteristics of the user and the historical voiceprint characteristics.

And step S140, inputting the voiceprint characteristics of the user and the time-frequency information of the noisy audio into a second cavity convolution module in the noise reduction model to obtain the noisy audio corresponding to the noisy audio.

The second hole convolution module comprises a hole convolution network, the hole convolution network is similar to the hole convolution network of the frequency domain axis and the hole convolution network of the time domain axis in the first hole convolution module, and other parameters except the layer number is 4 can be set to be consistent.

In a specific implementation, after the voiceprint characteristics of the user and the time-frequency information of the noisy audio are input into the second cavity convolution module, the second cavity convolution module can remove the information with larger difference with the voiceprint characteristics of the user from the time-frequency information of the noisy audio according to the voiceprint characteristics of the user, and determine the information as the noise-reducing audio based on the left time-frequency information.

If the noise-band is time-frequency converted before being input into the noise-band noise reduction model, the noise-band noise reduction model outputs a frequency spectrum of the noise-band noise reduction audio corresponding to the noise-band noise on the frequency domain, so that the frequency spectrum of the noise-band noise reduction audio needs to be inverse-converted to convert the frequency spectrum of the noise-band noise from the frequency domain back to the time domain, thereby obtaining the noise-band noise reduction audio.

In the audio noise reduction method, the cavity convolution network of the frequency domain axis and the cavity convolution network of the time domain axis are constructed to serve as the first cavity convolution module to extract time-frequency information of the band noise frequency on the time domain and the frequency domain, the self-adaptive voiceprint module is further used for extracting voiceprint characteristics of a user based on the time-frequency information, and finally the second cavity convolution module is utilized to reduce noise of the band noise frequency according to the voiceprint characteristics of the user, so that noise reduction audio is obtained. According to the method, the voiceprint characteristics of the user are extracted, the voiceprint characteristics of the user are used as the basis for denoising, and compared with the traditional denoising with the noise frequency spectrum in the noisy audio, whether the noise is stationary noise or not is not considered, so that effective denoising of the noisy audio containing non-stationary noise can be realized.

In an exemplary embodiment, as shown in fig. 4, the step S130 inputs the time-frequency information of the noisy audio into the adaptive voiceprint module in the noise reduction model to obtain the voiceprint characteristics of the user, which specifically includes:

step S410, voice print extraction is carried out on time-frequency information with noise frequency through a self-adaptive voice print module, so that the current voice print characteristics of a user are obtained, and the cleanliness value of the time-frequency information is determined;

step S420, under the condition that the historical voiceprint characteristics of the user are pre-stored, determining the voiceprint characteristics of the user based on the cleanliness value, the current voiceprint characteristics and the historical voiceprint characteristics of the user.

The cleanliness value may represent a case where the time-frequency information contains noise. The cleanliness value may be represented by various calculation methods, for example, the cleanliness value may be a ratio of signal energy to noise energy, and the higher the cleanliness value, the less noise the time-frequency information contains, and the purer the time-frequency information.

Referring to the schematic structure of the noise reduction model shown in fig. 3, the adaptive voiceprint module may include a voiceprint extraction module, a signal-to-noise ratio module (SNR module), and a voiceprint update module. The voiceprint extraction module is used for carrying out voiceprint extraction on the time-frequency information with the noise frequency so as to obtain the current voiceprint characteristics of the user. The signal-to-noise ratio module is used for carrying out signal-to-noise ratio identification on the time-frequency information with the noise frequency so as to represent the cleanliness of the time-frequency information of the user. And the voiceprint updating module is used for updating the prestored historical voiceprint characteristics of the user. The self-adaptive voiceprint module controls updating of the voiceprint updating module according to the cleanliness value extracted by the signal-to-noise ratio module. The signal-to-noise ratio module adopts a 2-layer two-way long-short-term memory (Long Short Term Memory, LSTM) module, then is connected with a 1-dimensional convolutional network (CNN), and uses sigmoid as an activation function, and can map a real number to the interval of (0, 1) to reflect the cleanliness of the audio signal, namely, the higher the value is, the higher the signal-to-noise ratio is, which represents the cleanliness of the audio signal. Assuming that the output value of the signal-to-noise ratio module is beta, the value of beta is located in the (0, 1) interval.

In the specific implementation, when the time-frequency information with noise frequency is input into the self-adaptive voiceprint module, the time-frequency information with noise frequency is input into the voiceprint extraction module and the signal-to-noise ratio module in the self-adaptive voiceprint module, the current voiceprint characteristics of the user are extracted by the voiceprint extraction module, and the cleanliness value of the time-frequency information of the user is extracted by the signal-to-noise ratio module. And then, the self-adaptive voiceprint module can control the updating of the voiceprint updating module to be executed according to the magnitude relation between the cleanliness value and the preset threshold value.

Further, in an exemplary embodiment, step S420 determines the voiceprint feature of the user based on the cleanliness value, the current voiceprint feature and the historical voiceprint feature of the user, and specifically includes:

step S420a, updating the historical voiceprint characteristics of the user based on the current voiceprint characteristics under the condition that the cleanliness value is greater than or equal to the threshold value to obtain the voiceprint characteristics of the user;

in step S420b, if the cleanliness value is smaller than the threshold, the historical voiceprint feature of the user is determined as the voiceprint feature of the user.

The threshold value is an empirical value, and may be set according to actual conditions, for example, the threshold value may be set to 0.5.

Specifically, when the cleanliness value of the time-frequency information of the noisy audio is greater than or equal to the threshold value, the time-frequency information of the noisy audio is indicated to be cleaner, the noise is less, and more accurate voiceprints can be extracted. Therefore, the current voiceprint characteristics of the user extracted by the voiceprint extraction module can be used for updating the historical voiceprint characteristics of the user. On the contrary, when the cleanliness value of the time-frequency information of the noisy audio is smaller than the threshold value, the time-frequency information of the noisy audio contains more noise, and more accurate voiceprint is difficult to extract.

According to the embodiment, the self-adaptive voiceprint module of the noise reduction model is used for extracting the current voiceprint characteristics of the user and the cleanliness value of the time-frequency information of the noisy audio, and determining whether to update the historical voiceprint characteristics by adopting the current voiceprint characteristics of the user or not according to the magnitude relation between the cleanliness value and the threshold value and then outputting the updated historical voiceprint characteristics. The updating is executed only when the cleanliness value is larger than or equal to the threshold value, so that the cleanliness of the voiceprint characteristics of the user output by the adaptive voiceprint module can be guaranteed, the subsequent noise reduction effect of the second cavity convolution module based on the output voiceprint characteristics is guaranteed, and the quality of noise reduction audio is improved.

In an exemplary embodiment, in the step S420a, based on the current voiceprint feature, the historical voiceprint feature of the user is updated to obtain the voiceprint feature of the user, which includes: acquiring a first weight preset for historical voiceprint features of a user; determining a second weight of the current voiceprint feature based on the first weight; and according to the first weight and the second weight, fusing the current voiceprint characteristics and the historical voiceprint characteristics to obtain the voiceprint characteristics of the user.

Specifically, the first weight may be set empirically, and after determining the first weight for the historical voiceprint feature, the first weight may be subtracted by 1 to obtain the second weight for the current voiceprint feature, i.e., the sum of the first weight and the second weight is 1. After the first weight and the second weight are determined, the current voiceprint feature and the historical voiceprint feature can be subjected to weighted summation according to the first weight and the second weight, fusion of the current voiceprint feature and the historical voiceprint feature is achieved, and the voiceprint feature obtained by weighted summation is used as the voiceprint feature of the user to be output.

For example, if the first weight is denoted as α, the second weight may be denoted as 1- α for the current voiceprint feature E ₁ With historical voiceprint feature E ₂ The process of fusion can be expressed as:

E ₂ ←α×E ₂ +（1－α）×E ₁

wherein the relation can be understood as voiceprint feature E passing through the current time point ₁ Historical voiceprint feature E for the last time point ₂ And updating the voice print characteristics, wherein the updated voice print characteristics are output as voice print characteristics of a user to be output, and are used for replacing the historical voice print characteristics at the previous time point to be used as new historical voice print characteristics for determining the voice print characteristics of the user next time.

The value of the first weight α may be 0.95, 0.9,0.98, or the like.

Wherein the first weight needs to be greater than the second weight to be effective for the historical voiceprint feature E ₂ The sliding balance is carried out and the sliding balance is carried out,so that it is slowly updated and has no mutation.

It will be appreciated that there are generally two ways in which the user's historical voiceprint features can be obtained. One is obtained by extracting audio recorded by a user in a quiet environment; the other is to extract relatively clean audio from the audio of the user history (such as the history singing works), and then to perform voiceprint extraction on the audio to obtain voiceprint characteristics, so that the user is not required to record a new audio segment again in the scene. However, when the user does not have works, the voiceprint features of the user cannot be obtained, and when the user is required to record a section of human voice, a series of requirements are also required, such as a certain recording time (similar to a few minutes) in a quiet environment, so as to ensure the recording quality of the user. Meanwhile, voiceprints of users also change, different hardware acquisition and different recording environments can cause differences of communication channels, and voiceprint characteristics can be slightly different. In particular, when a user deducts different songs, their voiceprints will also differ greatly according to their different styles. In this embodiment, based on the first weight of the historical voiceprint feature and the second weight of the current voiceprint feature, the current voiceprint feature and the historical voiceprint feature are fused, so that the historical voiceprint feature is updated, the output voiceprint feature of the user can be gradually close to the voiceprint feature of the real-time audio of the user, noise reduction can be performed on the noisy frequency of the user better, and the voiceprint mutation caused by directly replacing the historical voiceprint feature with the current voiceprint feature can be avoided on the basis of achieving the effect of enabling the noise reduction audio to be clearer and cleaner.

In an exemplary embodiment, the step S120 further includes, before performing information extraction processing on the noisy frequency by using the first hole convolution module in the noise reduction model to obtain the time-frequency information of the noisy frequency: sub-band decomposition processing is carried out on the band noise frequency, so that a plurality of sub-bands with the band noise frequency are obtained; and respectively carrying out time-frequency conversion processing on each sub-band to obtain the audio characteristics of each sub-band.

Correspondingly, the step S120 performs information extraction processing on the noisy frequency through a first hole convolution module in the noise reduction model, and the implementation manner of obtaining the time-frequency information of the noisy frequency includes: and respectively extracting and processing the audio characteristics of each sub-band of the noisy audio through a first cavity convolution module in the noise reduction model to obtain the time-frequency information of each sub-band.

The time-frequency transformation may be a fourier transformation, and in particular may be a short-time fourier transformation.

In a specific implementation, a subband decomposition filter (prued-Tree-Structured Quadrature Mirror Filter, PQMF) may be used to perform subband decomposition on the band noise frequency, and other similar methods may be used, which is not specifically limited in this application. The purpose of sub-band decomposition of the band noise frequency includes: (1) The band noise is divided into a plurality of sub-band signals through the pseudo-orthogonal image filter bank, so that signal frequency division is realized, and the computational complexity can be reduced. (2) More energy is concentrated in the low frequency band due to human voice. In accompaniment, different musical instruments have different frequency bands distributed, and the accompaniment and noise elimination can be better carried out by dividing the noisy frequency band into different sub-bands, so that the performance is improved.

In order to facilitate the analysis processing of the noise reduction model on the noisy audio, after the sub-band decomposition processing is performed on the noisy frequency to obtain a plurality of sub-bands with the noisy frequency, the time-frequency transformation processing such as fourier transformation may be further performed on each sub-band, so as to convert the audio signal of each sub-band from the time domain to the frequency domain, and obtain the audio characteristics of each sub-band. Correspondingly, under the condition of carrying out the sub-bands, carrying out information extraction processing on the noisy frequency through a first cavity convolution module in the noise reduction model, specifically carrying out information extraction processing on the audio characteristics of each sub-band of the noisy audio, and obtaining time-frequency information of each sub-band.

In practical application, experiments show that the effect of dividing the band noise frequency into 4 sub-bands is better, taking c=4 as an example, in the structural schematic diagram of the noise reduction model shown in fig. 3, the first Conv of the input variable is two-dimensional convolution, the input channel is 4, the output channel is 8, and the kernel_size is (1, 1), stride (1, 1). The two cavity convolution networks of the first cavity convolution module have an input channel of 8 and an output channel of 8, and the kernel_size is (3, 3) and stride (1, 1). The second Conv is also a two-dimensional convolution with an input channel of 8 and an output channel of 4, kernel_size (1, 1), stride (1, 1).

In this embodiment, the complexity of directly processing the band noise frequency can be reduced by performing the sub-band decomposition processing on the band noise frequency, and different types of noise can be better eliminated, so as to improve the subsequent noise reduction effect, and then, each sub-band obtained by the decomposition is subjected to the time-frequency conversion processing, so that the noise reduction model is processed to obtain the audio characteristics of each sub-band, which can be more convenient for the noise reduction requirements of different frequencies, and improve the noise reduction effect.

In the embodiment, the time-frequency information of each sub-band of the noisy audio output by the first hole convolution module in the noise reduction model is correspondingly input into the adaptive voiceprint module, namely the time-frequency information of each sub-band, so that the voiceprint characteristics of the user output by the adaptive voiceprint module, specifically the voiceprint characteristics of the user corresponding to each sub-band.

Accordingly, in an exemplary embodiment, after obtaining the voiceprint features of the user corresponding to each subband, step S140 inputs the voiceprint features of the user and the time-frequency information of the noisy audio into a second hole convolution module in the noise reduction model to obtain the noisy audio corresponding to the noisy audio, and further includes:

step S141, respectively inputting the voiceprint characteristics corresponding to each sub-band and the time-frequency information of each sub-band into a second cavity convolution module in the noise reduction model to obtain a noise reduction audio frequency spectrum corresponding to each sub-band;

Step S142, performing inverse transformation of time-frequency transformation on the noise reduction audio frequency spectrum corresponding to each sub-band to obtain a noise reduction audio frequency segment of each sub-band;

step S143, synthesizing the noise reduction audio frequency fragments corresponding to each sub-band to obtain the noise reduction audio frequency corresponding to the band noise frequency.

In a specific implementation, as shown in the schematic model structure shown in fig. 3, the scale of the voiceprint feature E2 output by the adaptive voiceprint module is inconsistent with the scale of the time-frequency information Y output by the first hole convolution module. Therefore, the scale conversion process is also required to be performed on the voiceprint feature E2, and specifically, a convolution layer (Conv) B may be disposed after the adaptive voiceprint module, and the scale of E2 is converted into a scale consistent with the time-frequency information Y through the convolution layer B. And then, inputting the converted voiceprint characteristics and the time-frequency information Y into a second cavity convolution module.

Because the noise reduction model is processed for each sub-band, each sub-band needs to be subjected to the scale conversion processing of the voiceprint features and then is input into the second cavity convolution module, so that the noise reduction audio frequency spectrum corresponding to each sub-band is obtained. Because the time-frequency conversion processing is performed on each subband before the subband is input into the noise reduction model in order to facilitate the audio signal processing, after the noise reduction model outputs the noise reduction audio frequency spectrum of each subband, the inverse conversion processing is required, that is, the noise reduction audio frequency spectrum of each subband is converted from the frequency domain to the time domain, so as to obtain the noise reduction audio segment of each subband. Finally, the noise reduction audio frequency fragments of each sub-band need to be synthesized, and particularly, the noise reduction audio frequency corresponding to the noise frequency can be obtained by synthesizing by adopting a parallel reconstruction method.

In this embodiment, the second cavity convolution network performs noise reduction processing on each self band, and then performs inverse transformation of time-frequency transformation on each noise reduction audio frequency spectrum to obtain noise reduction audio frequency segments of each sub band, and performs synthesis processing on the noise reduction audio frequency segments of each sub band to obtain noise reduction audio frequency corresponding to the noise reduction audio frequency. According to the method, a single sub-band is used as a processing object, noise is reduced in the noise reduction model for each sub-band, and after the noise reduction is completed, noise reduction audio fragments of the sub-bands are synthesized, so that the difficulty of noise reduction processing of the noise frequency of the noise reduction model can be greatly reduced.

In an exemplary embodiment, the performing time-frequency conversion processing on each sub-band to obtain audio characteristics of each sub-band includes: performing time-frequency transformation processing on each sub-band to obtain amplitude characteristics and phase characteristics of the sub-band; and determining the amplitude characteristic and the phase characteristic as audio characteristics of the sub-bands.

Specifically, the time-frequency transform may be a fourier transform, and this embodiment will be described in detail below by taking the fourier transform as an example.

Firstly, before the time-frequency conversion of the sub-band, the waveform signal of the sub-band needs to be framing. The frame length of the frame can be denoted as L, and the frame shift can be denoted as P. Where L is typically an exponential multiple of 2, e.g., L is 1024. The frame shift may be 0.5L. After framing, each audio frame may be windowed, specifically by multiplying the signal of each frame by a window function to reduce the problem of spectral leakage due to discontinuities between audio frames, and then fourier transformed. Assuming that the frame length is L, after Fourier transformation, L frequency points are totally adopted, and L/2+1 frequency points are generally adopted due to symmetrical conjugation of the frequency points.

Assuming that the subbands are fourier transformed, the result is a complex domain:the corresponding amplitude is: />The corresponding phases are: />Wherein arctan is the arctangent function.

The time domain waveform of the sub-band is subjected to short-time Fourier transform, and a transformed initial frequency spectrum can be obtained. The |x| is the amplitude characteristic of the extracted initial spectrum, and α is the phase characteristic of the initial spectrum. The amplitude and phase characteristics may be considered as audio characteristics of the sub-band.

In this embodiment, the amplitude characteristic and the phase characteristic of each subband are obtained by performing time-frequency transformation processing on each subband, and the amplitude characteristic and the phase characteristic of each subband are used as the audio characteristic of each subband, so that noisy audio can be better understood and processed, and the time-frequency information can be conveniently extracted later.

In one embodiment, as shown in fig. 5, a training method of a noise reduction model is provided, where the method is applied to a server for illustration, it is understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step S510, generating a sample noisy audio set of a sample user, wherein the sample noisy audio set comprises sample noisy audio and clean audio corresponding to the sample noisy audio, and the sample noisy audio is obtained by superposing noise audio and/or accompaniment audio with different signal to noise ratios on the basis of the clean audio.

In a specific implementation, the process of acquiring the sample noisy audio set may include:

(1) A large amount of clean audio of users is collected, as well as various types of noise audio, wherein the noise may be scene type noise such as squares, roads, conference rooms, restaurants, cafes, and keyboard knocks.

(2) A large amount of accompaniment audio is collected, for example, accompaniment of the type of musical instrument such as piano, guitar, drum, and the like, and original song accompaniment, and the like. Accompaniment can be considered noise in a broad sense here because they need to be eliminated.

(3) Mixing clean audio with various types of noise and accompaniment to obtain a noisy frequency. The mixing mode can be synthesized according to different signal to noise ratios so as to meet scenes with different noise degrees. For example, the mixing mode of the noisy audio may be: a) Clean audio + noise audio; b) Clean audio + accompaniment audio; c) Clean audio + noise audio + accompaniment audio. The noise and accompaniment require the signal to noise ratio of superposition, and the mixed superposition can be performed according to the application scene. For example, the signal to noise ratio can be selected in the range of-15-20 dB, which is not limited in the application, and can be selected according to actual demand scenes.

Through the steps (1) - (3), a plurality of sample noisy audio with different noise degrees or different noise types and clean audio of a user corresponding to each sample noisy audio can be obtained, and each sample noisy audio and the corresponding clean audio thereof form a sample noisy audio set.

Step S520, carrying out information extraction processing on the sample band noise frequency through a first cavity convolution module in the noise reduction model to be trained to obtain time-frequency information of the sample band noise frequency; the first cavity convolution module comprises a cavity convolution network of a frequency domain axis and a cavity convolution network of a time domain axis, and the time-frequency information comprises information of a sample with noise frequency in the time domain and the frequency domain.

And step S530, inputting the time-frequency information of the sample noisy audio into the self-adaptive voiceprint module in the noise reduction model to be trained to obtain voiceprint characteristics of the sample user.

Step S540, inputting the voiceprint characteristics of the sample user and the time-frequency information of the sample noisy frequency into a second cavity convolution module in the noise reduction model to be trained, and obtaining the predicted noise reduction audio corresponding to the sample noisy frequency.

Specifically, similar to the process of denoising the noisy frequency through the denoising model in the foregoing embodiment, when the denoising model to be trained is trained, the sample noisy frequency is input into the first cavity convolution network of the denoising model to be trained, the time-frequency information of the sample noisy frequency is extracted through the cavity convolution network of the frequency domain axis and the cavity convolution network of the time domain axis in the first cavity convolution network, then the time-frequency information is input into the adaptive voiceprint module in the denoising model, the voiceprint characteristics of the sample user are obtained through processing by the adaptive voiceprint module, and finally the voiceprint characteristics are processed through the second cavity convolution module in the denoising model, so as to obtain the predicted denoising frequency corresponding to the sample noisy frequency.

Step S550, training the noise reduction model to be trained based on the difference information between the predicted noise reduction audio and the clean audio to obtain a noise reduction model after training.

Specifically, there are a plurality of sample noisy frequencies, and when training is performed on the noise reduction model to be trained, one noisy frequency combination (including one sample noisy audio and clean audio of the sample noisy audio) is adopted each time, so that the noise reduction model to be trained is trained. More specifically, in each training process, a sample with noise frequency is input into a noise reduction model to be trained to obtain predicted noise reduction audio, then a loss value between the predicted noise reduction audio and clean audio corresponding to the input sample with noise frequency is calculated and used as difference information, the noise reduction model to be trained is trained until the obtained loss value converges, or the preset training times are reached, training is finished, and the noise reduction model after training is completed is obtained.

Before the noise reduction model is trained through the sample band noise frequency, the sample band noise frequency can be subjected to feature extraction to obtain the audio features of the sample band noise frequency. Specifically, the feature extraction method is fourier transform, and is used for converting sample noisy audio from a time domain to a frequency domain, and obtaining amplitude features and phase features of the sample noisy audio as audio features thereof. And then taking the audio characteristics of the sample with noise frequency as an input variable, taking the clean audio of the sample with noise frequency as supervision information, and carrying out constraint through a loss function to train the noise reduction model.

For example, let the sample noisy audio be x, the clean audio corresponding to the sample noisy audio be s, the predicted noise-reduction audio of the noise-reduction model to be trained be z, and the spectrums obtained by performing fourier transform on the three audios respectively be Xf, sf, zf. The error between Sf and Zf can be calculated and the hyper-parameters in the noise reduction model adjusted based on the error so that the weights (weights) in the noise reduction model are sufficiently trained and learned. And training is completed when the error is smaller than a preset value through continuous iteration, and a noise reduction model with the completed training is obtained.

In the noise reduction model constructed by the training method of the noise reduction model, a cavity convolution network of a frequency domain axis and a cavity convolution network of a time domain axis are used as a first cavity convolution module, so that time-frequency information of the noise-carrying audio frequency on the time domain and the frequency domain is extracted, the voice print characteristics of a user are further extracted through a self-adaptive voice print module based on the time-frequency information, and finally the noise-carrying audio frequency is subjected to noise reduction according to the voice print characteristics of the user by utilizing a second cavity convolution module, so that noise-carrying audio frequency is obtained. According to the method, the voiceprint characteristics of the user are extracted, the voiceprint characteristics of the user are used as the basis for denoising, and compared with the traditional denoising with the noise frequency spectrum in the noisy audio, whether the noise is stationary noise or not is not considered, so that effective denoising of the noisy audio containing non-stationary noise can be realized.

In one embodiment, the sample noisy audio set comprises a first noisy audio set and a second noisy audio set; each first sample noisy audio in the first noisy audio set comprises target clean audio and at least one noise; each second sample band noise frequency in the second set of band noise frequencies comprises target clean audio, interfering clean audio, and at least one noise; the target clean audio is the human voice of the sample user.

As shown in fig. 6, the method further includes:

step S610, training a first cavity convolution module in the noise reduction model to be trained based on the first noise-carrying audio set to obtain a noise reduction model after the first training.

Step S620, the parameters of the first cavity convolution module are kept unchanged, and the adaptive voiceprint module and the second cavity convolution network in the noise reduction model to be trained are trained based on the second noisy audio set, so that a second trained noise reduction model is obtained and used as a noise reduction model after training is completed.

Specifically, the training of the noise reduction model of the present application includes two stages. The first stage: training a first cavity convolution module by adopting a first noise-carrying frequency set, and performing a second stage: freezing parameters of the first cavity convolution module, and training the self-adaptive voiceprint module and the second cavity convolution module by adopting a second noisy frequency set. The two stages are each described below in connection with fig. 3.

The first stage:

a first noisy audio set is constructed. The noisy frequencies are obtained by mixing the clean audio collected in advance with noise and accompaniment. The mixing manner is the same as that in step S510, for example, the first sample noisy frequency a may be: a) Target clean audio + noise audio; b) Target clean audio+accompaniment audio; c) Target clean audio + noise audio + accompaniment audio. It should be noted that each first sample of noisy audio contains only one clean audio of human voice, which may be referred to as target clean audio.

The first hole convolution module is trained by the first noisy frequency set. As shown in FIG. 3, training in a first stage with a first sample band in a first noisy audio setNoise frequency X ¹ Taking the output Y of the first cavity convolution module as a prediction result and taking the noise frequency X of the first sample as an input variable ¹ And training the first cavity convolution module by taking the corresponding target clean audio as supervision information. Specifically, the first sample is noisy with a frequency X ¹ Inputting a noise reduction model, obtaining an output variable Y of a first cavity convolution module, and combining the output variable Y with X ¹ Substituting the corresponding target clean audio into the loss function to obtain a loss value, and adjusting the super-parameters of the first cavity convolution module based on the loss value until the loss value converges or reaches the preset training times, ending the training, and obtaining the noise reduction model after the first training. The loss function may be a mean square error function between an output variable of the first hole convolution module and the target clean audio.

And a second stage:

a second noisy audio set is constructed. The noisy frequencies are obtained by mixing the clean audio collected in advance with noise and accompaniment. The mixing manner is the same as in step S510. It should be noted that the second stage is different from the first stage in that the clean audio of the first stage only contains one voice, and the interference voice is added in the second stage, so that the noise reduction model obtained by training can inhibit the interference voice and only keep the target voice. For example, the second sample band noise frequency B in the second band noise frequency set may be: a) Target clean audio + interference clean audio + noise audio; b) Target clean audio+interference clean audio+accompaniment audio; c) Target clean audio + interference clean audio + noise audio + accompaniment audio.

And training the self-adaptive voiceprint module and the second cavity convolution module through a second noisy frequency set. As shown in FIG. 3, the second stage of training is performed with a second sample noisy audio X in a second noisy frequency set ² Taking the output of the whole noise reduction model, namely the output Z of the second cavity convolution module as a prediction result and taking the second sample noisy audio frequency X as an input variable ² And training the self-adaptive voiceprint module and the second cavity convolution module by taking the corresponding target clean audio as supervision information. Specifically, a second sample noisy audio X in a second noisy frequency set ² Inputting a noise reduction model to obtain predicted noise reduction audio, substituting the predicted noise reduction audio and target clean audio into a loss function to obtain a loss value, and adjusting super parameters of the self-adaptive voiceprint module and the second cavity convolution module based on the loss value until the loss value converges or reaches the preset training times, ending training, and obtaining a noise reduction model after the second training. The loss function may be a mean square error function between an output variable of the noise reduction model and the target clean audio.

It will be appreciated that the parameters of the first phase may be frozen and only the model of the second phase updated as the training of the second phase is performed, since the training of the first phase has already been completed.

In this embodiment, considering that the desire of the adaptive voiceprint module is mainly to be able to distinguish the interfering voice, and the desire of the first cavity convolution module is to be able to distinguish other noises not including the voice, therefore, considering the task distinction of the two modules, the application adopts a staged training method, and trains the noise reduction model in two stages through different noisy audio sets, so that on one hand, the interfering voice can be effectively removed on the basis that the accompaniment audio and the noise audio can be removed by the noise reduction model, and on the other hand, the complexity of training can be reduced, the computing resources are saved, and the training rate is improved.

It should be noted that, in the training method of the noise reduction model provided in the present application, as shown in fig. 7, the implementation scheme of each embodiment of the audio noise reduction method related to the foregoing embodiment may also further include performing subband decomposition on the noise frequency of the sample band to obtain multiple subbands, performing time-frequency conversion on each subband, inputting the obtained audio feature of each subband into the noise reduction model to obtain the noise reduction audio spectrum of each subband, obtaining the noise reduction audio segment of each subband through inverse transformation, and finally performing synthesis processing on the noise reduction audio segment corresponding to each subband to obtain schemes such as noise reduction audio, which are not described herein.

In one embodiment, in order to facilitate understanding of the embodiments of the present application by those skilled in the art, in the following, in a K song scenario, a voice of a user collected through a microphone is taken as a noisy audio, and a detailed description of an audio noise reduction method provided in the present application will be given.

It should be noted that audio has a wide variation in intensity. Also, singing occurs with a wider frequency range. When singing a song, the vowels are sometimes prolonged, and the singing has a specific rhythm and melody. Sustained notes and tremolos distinguish them from speech. On the other hand, while speaking and singing both originate from the same human voice system, they differ essentially in terms of phoneme usage, pitch, respiration, and volume. For example, singing has a higher average intensity level than speech, and background music associated with singing is also unique. When playing singing, the accompaniment is also received, so that the later processing of the voice becomes difficult, and therefore, the removal of the accompaniment and the voice from the voice mixed with the accompaniment and the noise becomes important.

In particular, in the surrounding environment when singing, besides the singing voice of the user, other voices may be generated, such as the voices of passers-by at the roadside; in the family, k songs are also known, and communication sounds of family friends and the like are also known. Even the accompaniment in the k song scene may also exist as harmony or the like. The common noise reduction scene can eliminate noise and even accompaniment, but remains for human voice. The purpose of the application is that besides eliminating noise and accompaniment, other voice outside a user can be eliminated, and only the voice of the user is reserved as much as possible, so that the singing voice of the user is better enhanced.

Therefore, the voice of the user when singing is regarded as the target singing voice/voice, and other voices which are not the voice of the user are eliminated, and the flow chart is shown in fig. 8, wherein various effects such as Equalization (EQ) and reverberation are required to be processed on the voice of the user, then loudness equalization is carried out on the voice and accompaniment, and then sound mixing processing is carried out, and finally a complete musical composition is formed. If the accompaniment, the environmental noise, the disturbing human voice and the like are picked up simultaneously in addition to the singing voice of the user when the accompaniment is picked up by the microphone, the ideal effect is difficult to be called out because the accompaniment, the noise and the like are influenced when the effects such as the EQ, the reverberation and the like are processed, and particularly, the picked accompaniment is played by a loudspeaker in an accompaniment scene and is different from the accompaniment which is originally produced well through different paths such as the loudspeaker, the microphone and the like. In addition, the final work is mixed with the accompaniment, 2 accompaniments are superimposed, and the accompaniment proportion is not accurately calculated because of the accompaniment in the voice, and the difficulty is formed and the effect is hardly ensured when the loudness of the voice and the accompaniment is balanced finally. Therefore, by removing noise, accompaniment and interference of human voice, various processes are more controllable, and the quality of synthesized audio is improved.

Specifically, as shown in fig. 9, a specific flowchart of an audio noise reduction method capable of removing noise, accompaniment and interfering human voice provided in the present application is as follows:

step S910, obtaining the band noise frequency of the user, and carrying out sub-band decomposition processing on the band noise frequency to obtain a plurality of sub-bands of the band noise frequency;

step S920, performing Fourier transform processing on each sub-band to obtain amplitude characteristics and phase characteristics of each sub-band as audio characteristics;

step S930, inputting the audio features of each sub-band into a noise reduction model to obtain noise reduction audio segments of each sub-band;

the processing procedure of the noise reduction model for each sub-band comprises the following steps:

step S9301, carrying out information extraction processing on the audio characteristics of a single sub-band through a first cavity convolution module in the noise reduction model to obtain time-frequency information of the sub-band;

step S9302, voice-print extraction is carried out on time-frequency information of the sub-band through a self-adaptive voice-print module, so as to obtain current voice-print characteristics corresponding to the sub-band, and a cleanliness value of the time-frequency information is determined;

step S9303, judging whether the cleanliness value is larger than or equal to a threshold value under the condition that the historical voiceprint characteristics of the user are prestored;

Step S9304, if not, determining the historical voiceprint characteristics of the user as the voiceprint characteristics of the sub-band;

step S9305, if yes, determining a second weight of the current voiceprint feature based on a first weight preset for the historical voiceprint feature of the user;

step S9306, according to the first weight and the second weight, fusing the current voiceprint feature and the historical voiceprint feature to obtain the voiceprint feature corresponding to the sub-band;

step S9307, inputting voiceprint characteristics and time information corresponding to the sub-bands into a second cavity convolution module in the noise reduction model to obtain noise reduction audio frequency spectrums corresponding to the sub-bands;

step S9308, carrying out inverse transformation of time-frequency transformation on the noise reduction audio frequency spectrum corresponding to the sub-band to obtain a noise reduction audio frequency fragment of the sub-band;

step S940, the synthesis processing is performed on the noise reduction audio segments corresponding to each sub-band, so as to obtain the noise reduction audio corresponding to the band noise frequency.

According to the audio noise reduction method, the voiceprint characteristics of the user are extracted, the voiceprint characteristics of the user are used as the basis for noise reduction, and compared with the traditional method of denoising by using the noise spectrum in the noisy audio, whether noise is stationary noise or not is not considered, so that effective noise reduction of the noisy audio containing non-stationary noise can be achieved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an audio noise reduction device for realizing the above-mentioned audio noise reduction method and a training device for realizing the noise reduction model of the above-mentioned training method of the noise reduction model. The implementation of the solution provided by these two devices is similar to that described in the above method, so the specific limitations in one or more embodiments of the devices provided below may be referred to above for the limitations of the corresponding method, which are not repeated here.

In one embodiment, as shown in fig. 10, there is provided an audio noise reduction device including: an audio acquisition module 1001, an information extraction module 1002, a voiceprint extraction module 1003, and an audio noise reduction module 1004, wherein:

an audio acquisition module 1001, configured to acquire noisy audio of a user;

the information extraction module 1002 is configured to perform information extraction processing on the noisy frequency through a first hole convolution module in the noise reduction model, so as to obtain time-frequency information of the noisy frequency; the first cavity convolution module comprises a cavity convolution network of a frequency domain axis and a cavity convolution network of a time domain axis, and the time-frequency information comprises information of noisy audio on the time domain and the frequency domain;

the voiceprint extraction module 1003 is configured to input time-frequency information with noise frequency into the adaptive voiceprint module in the noise reduction model to obtain voiceprint features of the user;

the audio noise reduction module 1004 is configured to input the voiceprint feature of the user and the time-frequency information of the noisy audio into the second hole convolution module in the noise reduction model to obtain the noisy audio corresponding to the noisy audio.

In one embodiment, the voiceprint extraction module 1003 is further configured to perform voiceprint extraction on the time-frequency information with noise audio through the adaptive voiceprint module, obtain a current voiceprint feature of the user, and determine a cleanliness value of the time-frequency information; under the condition that the historical voiceprint characteristics of the user are prestored, determining the voiceprint characteristics of the user based on the cleanliness value, the current voiceprint characteristics of the user and the historical voiceprint characteristics.

In one embodiment, the voiceprint extraction module 1003 is further configured to update, based on the current voiceprint feature, the historical voiceprint feature of the user to obtain the voiceprint feature of the user if the cleanliness value is greater than or equal to the threshold; and under the condition that the cleanliness value is smaller than the threshold value, determining the historical voiceprint characteristics of the user as the voiceprint characteristics of the user.

In one embodiment, the voiceprint extraction module 1003 is further configured to obtain a first weight preset for a historical voiceprint feature of the user; determining a second weight of the current voiceprint feature based on the first weight; and according to the first weight and the second weight, fusing the current voiceprint characteristics and the historical voiceprint characteristics to obtain the voiceprint characteristics of the user.

In one embodiment, the apparatus further includes an audio processing module, configured to perform subband decomposition processing on the band noise frequency to obtain a plurality of subbands of the band noise frequency; respectively carrying out time-frequency conversion treatment on each sub-band to obtain the audio characteristics of each sub-band;

correspondingly, the information extraction module 1002 is further configured to perform information extraction processing on the audio features of each subband of the noisy audio through a first hole convolution module in the noise reduction model, so as to obtain time-frequency information of each subband.

In one embodiment, the voiceprint features of the user include voiceprint features of the user corresponding to each subband; the audio noise reduction module 1004 is further configured to input, to a second hole convolution module in the noise reduction model, voiceprint features corresponding to each subband and time-frequency information of each subband, so as to obtain a noise reduction audio frequency spectrum corresponding to each subband; performing inverse transformation of time-frequency transformation on the noise reduction audio frequency spectrum corresponding to each sub-band to obtain a noise reduction audio frequency fragment of each sub-band; and synthesizing the noise reduction audio frequency fragments corresponding to each sub-band to obtain the noise reduction audio frequency corresponding to the band noise frequency.

In one embodiment, as shown in fig. 11, there is provided a training apparatus of a noise reduction model, including: a sample acquisition module 1101, an information extraction module 1102, a voiceprint extraction module 1103, an audio prediction module 1104, and a model training module 1105, wherein:

the sample obtaining module 1101 is configured to generate a sample noisy audio set of a sample user, where the sample noisy audio set includes sample noisy audio and clean audio corresponding to the sample noisy audio, where the sample noisy audio is obtained by superimposing noise audio and/or accompaniment audio with different signal to noise ratios on the basis of the clean audio;

The information extraction module 1102 is configured to perform information extraction processing on the sample band noise frequency through a first hole convolution module in the noise reduction model to be trained, so as to obtain time-frequency information of the sample band noise frequency; the first cavity convolution module comprises a cavity convolution network of a frequency domain axis and a cavity convolution network of a time domain axis, and the time-frequency information comprises information of a sample with noise frequency on the time domain and the frequency domain;

the voiceprint extraction module 1103 is configured to input time-frequency information of the sample noisy audio to the adaptive voiceprint module in the noise reduction model to be trained, so as to obtain voiceprint features of the sample user;

the audio prediction module 1104 is configured to input voiceprint features of a sample user and time-frequency information of a sample with noise frequency into a second cavity convolution module in the noise reduction model to be trained, so as to obtain predicted noise reduction audio corresponding to the sample with noise frequency;

the model training module 1105 is configured to train the noise reduction model to be trained based on difference information between the predicted noise reduction audio and the clean audio, so as to obtain a noise reduction model after training.

In one embodiment, the sample noisy audio set comprises a first noisy audio set and a second noisy audio set; each first sample noisy audio in the first noisy audio set comprises target clean audio and at least one noise; each second sample band noise frequency in the second set of band noise frequencies comprises target clean audio, interfering clean audio, and at least one noise; the target clean audio is the voice of the sample user; the model training module 1105 is further configured to train the first hole convolution module in the noise reduction model to be trained based on the first noise-carrying audio set, so as to obtain a noise reduction model after the first training; and training the self-adaptive voiceprint module in the noise reduction model to be trained and the second cavity convolution network based on the second noisy audio set to obtain a second trained noise reduction model, and taking the second trained noise reduction model as a noise reduction model after training.

All or part of each module in the audio noise reduction device and the training device of the noise reduction model can be realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 12. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store data during the audio noise reduction process. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio noise reduction method.

It will be appreciated by those skilled in the art that the structure shown in fig. 12 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of audio noise reduction, the method comprising:

acquiring noisy audio of a user;

2. The method of claim 1, wherein inputting the time-frequency information of the noisy audio into an adaptive voiceprint module in the noise reduction model to obtain voiceprint features of the user comprises:

3. The method of claim 2, wherein the determining the user's voiceprint features based on the cleanliness values, the current voiceprint features and the historical voiceprint features of the user comprises:

4. The method of claim 3, wherein updating the historical voiceprint features of the user based on the current voiceprint features to obtain the voiceprint features of the user comprises:

5. The method according to claim 1, wherein before the step of extracting the information from the noisy audio by the first hole convolution module in the noise reduction model to obtain the time-frequency information of the noisy audio, the method further comprises:

6. The method of claim 5, wherein the voiceprint features of the user comprise voiceprint features of the user for the respective sub-bands;

7. A method of training a noise reduction model, the method comprising:

8. The method of claim 7, wherein the sample noisy audio set comprises a first noisy audio set and a second noisy audio set; each first sample noisy audio in the first noisy audio set comprises target clean audio and at least one noise; each second sample noisy frequency in the second noisy frequency set comprises target clean audio, interfering clean audio, and at least one noise; the target clean audio is the voice of the sample user;

the method further comprises the steps of:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the audio noise reduction method of any of claims 1 to 8 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the audio noise reduction method of any one of claims 1 to 8.