CN112820315A

CN112820315A - Audio signal processing method, audio signal processing device, computer equipment and storage medium

Info

Publication number: CN112820315A
Application number: CN202010670626.9A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2021-05-18
Anticipated expiration: 2040-07-13
Also published as: CN112820315B; WO2022012195A1

Abstract

The application relates to an audio signal processing method, an audio signal processing device, computer equipment and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: the method comprises the steps of obtaining a first audio signal, processing the first audio signal through a spectrum compensation model, obtaining a prediction result of prediction compensation of a distortion spectrum in the first audio signal, reconstructing the first audio signal according to the prediction result, and obtaining a target audio signal obtained after restoration of the distortion spectrum in the first audio signal. Based on the idea of artificial intelligence AI, the neural network model obtained through model training can be used for carrying out compensation prediction aiming at different equipment models and software versions in a unified manner, so that the problem that the application scene of voice signal restoration is limited is solved, and the universality of voice signal restoration is improved.

Description

Audio signal processing method, audio signal processing device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an audio signal processing method, an audio signal processing device, computer equipment and a storage medium.

Background

In the application of voice communication, a voice receiving device needs to perform voice enhancement processing on a received voice signal, algorithms for performing voice enhancement processing on voice receiving devices of different manufacturers, different kernel versions and different application software may be different, and the voice signal after the voice enhancement processing may have voice damage of different degrees.

In the related art, after speech damage is caused by speech enhancement processing, it is necessary to perform offline measurement and spectral analysis on a damaged speech signal for a model of a device that performs speech enhancement processing and a version of speech enhancement processing software to obtain a spectral distortion distribution and a distortion quantization value corresponding to the damaged speech signal, set a compensation strategy according to distortion results of the measurement, and then perform spectral gain compensation on the device corresponding to the model of the device and the version of the software by using the set compensation strategy.

However, in the related art, the compensation strategies need to be set for different device models and software versions, and the universality is poor, so that the application scene of voice signal repair is limited.

Disclosure of Invention

The embodiment of the application provides an audio signal processing method, an audio signal processing device, computer equipment and a storage medium, and can provide a general audio signal restoration scheme under the scenes of different equipment models and software versions and expand the application scene of audio signal restoration. The technical scheme is as follows:

in one aspect, an audio signal processing method is provided, the method including:

acquiring a first audio signal;

processing the first audio signal through a spectrum compensation model to obtain a prediction result of performing prediction compensation on a distortion spectrum in the first audio signal; the spectrum compensation model is a neural network model obtained by training a spectrum distortion audio sample and an original audio sample corresponding to the spectrum distortion audio sample;

and reconstructing the first audio signal according to the prediction result to obtain a target audio signal obtained by repairing the distortion frequency spectrum in the first audio signal.

In another aspect, an audio signal processing apparatus is provided, the apparatus comprising:

the signal acquisition module is used for acquiring a first audio signal;

the result obtaining module is used for processing the first audio signal through a spectrum compensation model to obtain a prediction result of prediction compensation on a distortion spectrum in the first audio signal; the spectrum compensation model is a neural network model obtained by training a spectrum distortion audio sample and an original audio sample corresponding to the spectrum distortion audio sample;

and the target acquisition module is used for reconstructing the first audio signal according to the prediction result to obtain a target audio signal after the distortion frequency spectrum in the first audio signal is repaired.

In one possible implementation, the result obtaining module includes:

a frequency domain conversion sub-module for converting the first audio signal into a corresponding frequency domain signal;

a sub-band division sub-module for dividing the frequency domain signal into at least one sub-band frequency domain signal;

the amplitude determining submodule is used for determining the amplitude of each frequency point in the at least one sub-band frequency domain signal;

the power spectrum value determining submodule is used for determining the power spectrum value of the at least one sub-band frequency domain signal according to the amplitude value of each frequency point;

and the sub-band result acquisition sub-module is used for inputting the power spectrum value of each sub-band frequency domain signal into the spectrum compensation model and acquiring the sub-band prediction result of each sub-band frequency domain signal.

In one possible implementation, the frequency domain converting sub-module includes:

a windowing processing unit, configured to perform frame-wise windowing on the first audio signal, and determine a processed time-domain signal;

and the signal acquisition unit is used for carrying out frequency domain conversion on the processed time domain signal to obtain the corresponding frequency domain signal.

In one possible implementation, the signal obtaining unit is configured to,

performing discrete Fourier transform on the processed time domain signal to obtain a corresponding frequency domain signal;

alternatively, the first and second electrodes may be,

and performing improved discrete cosine transform on the processed time domain signal to obtain the corresponding frequency domain signal.

In one possible implementation, the sub-band division sub-module includes:

and the sub-band division unit is used for dividing the frequency domain signal into the at least one sub-band frequency domain signal by taking a Bark domain as a scale.

In a possible implementation manner, the sub-band result obtaining sub-module includes:

and the sub-band result acquisition unit is used for inputting the power spectrum value corresponding to each sub-band frequency domain signal into the spectrum compensation model to obtain a predicted power spectrum value corresponding to each sub-band frequency domain as a sub-band prediction result of each sub-band frequency domain.

In one possible implementation manner, the target obtaining module includes:

a power spectrum value obtaining submodule, configured to obtain a power spectrum value obtained after the first audio signal is reconstructed, according to the prediction result;

and the target generation submodule is used for generating the target audio signal corresponding to the power spectrum value after the first audio signal is reconstructed.

In one possible implementation manner, the target obtaining module includes:

the power spectrum value generation submodule is used for taking the sum of the power spectrum value corresponding to the first audio signal and the frequency band damage rate as a reconstructed power spectrum value; the frequency band damage rate is a historical smooth value of the difference between a predicted power spectrum value corresponding to each of the at least one sub-band frequency domain signal and a power spectrum value corresponding to the first audio signal;

alternatively, the first and second electrodes may be,

and the power spectrum value determining submodule is used for taking the predicted power spectrum value corresponding to each sub-band frequency domain signal as the reconstructed power spectrum value.

In one possible implementation, the target generation sub-module includes:

and the time domain conversion unit is used for carrying out time domain transformation on the frequency domain signal corresponding to the reconstructed power spectrum value to obtain the target audio signal.

In one possible implementation manner, the signal obtaining module includes:

and the signal acquisition submodule is used for acquiring the first audio signal after audio enhancement processing.

In one possible implementation, the apparatus further includes:

a sample obtaining module, configured to obtain the original audio sample before processing the first audio signal through a spectrum compensation model and obtaining a prediction result of performing prediction compensation on a distortion spectrum in the first audio signal;

a distorted sample obtaining module, configured to perform suppression processing on a power spectrum value of a partial frequency band in a frequency domain signal corresponding to the original audio sample, to obtain a frequency spectrum distorted audio sample corresponding to the original audio sample;

and the model acquisition module is used for performing machine learning training by taking the frequency spectrum distortion audio sample as input and taking the original audio sample as a training target to obtain the frequency spectrum compensation model.

In one possible implementation, the spectrum compensation model is a recurrent neural network model RNN or a Long Short-Term Memory network model LSTM (Long Short-Term Memory network).

In another aspect, a computer device is provided, comprising a processor and a memory, in which at least one instruction, at least one program, set of codes, or set of instructions is stored, which is loaded and executed by the processor to implement the audio signal processing method as described above.

In another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the audio signal processing method as described above.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the audio signal processing method provided in the various alternative implementations of the above aspect.

The technical scheme provided by the application can comprise the following beneficial effects:

the neural network model obtained through model training can be used for carrying out compensation prediction aiming at different equipment models and software versions in a unified mode, the problem that the application scene of voice signal restoration is limited is solved, and therefore the universality of voice signal restoration is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a block diagram illustrating a model training and prediction compensation in accordance with an exemplary embodiment;

FIG. 2 is a model architecture diagram of a machine learning model, shown in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating an audio signal processing method according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating an audio signal processing method according to an exemplary embodiment;

FIG. 5 is an architectural diagram illustrating a method of audio signal processing according to an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating voice remediation applied in a voice telephony system, according to an example embodiment;

FIG. 7 is a schematic diagram illustrating another example of voice remediation applied in a voice telephony system in accordance with an illustrative embodiment;

fig. 8 is a block diagram illustrating a structure of an audio signal processing method according to an exemplary embodiment;

FIG. 9 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It is to be understood that reference herein to "a number" means one or more and "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

In the schemes shown in the subsequent embodiments of the present application, the extraction of the feature point pairs that are more concerned by the user in the spectrogram of the audio signal can be realized by means of Artificial Intelligence (AI). For convenience of understanding, terms referred to in the embodiments of the present disclosure are explained below.

1) Artificial intelligence AI

AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Along with the research and progress of artificial intelligence technology, the artificial intelligence technology develops research and application in a plurality of fields, for example, common smart homes, intelligent wearable devices, virtual assistants, smart sound boxes, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, smart video services and the like.

2) Machine Learning (Machine Learning, ML)

ML is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The scheme provided by the application relates to the technologies such as machine learning of artificial intelligence.

3) Speech enhancement processing

In the application scenario of voice call, a recording signal acquired by equipment needs to be subjected to voice enhancement processing, wherein the voice enhancement processing mainly comprises: echo cancellation, noise suppression, volume self-adjustment, frequency response equalization, etc.

The speech enhancement processing can be implemented by hardware, for example, by some audio chips, or respective speech enhancement processing functional modules can be added to the application layer.

In practical application, it can be found that voice enhancement processing algorithms of different manufacturers, different kernel versions and different application software are different in performance, wherein part of the algorithms can obviously damage the acquired recording signals, for example, the damage types can include high-frequency damage, and the high-frequency damage can be noise reduction or high-frequency damage of the voice signals caused by an echo cancellation algorithm, which is shown in the way that original high-frequency information in the voice signals is obviously weakened, so that the corresponding sounds of the voice signals become stuffy and unclear; in addition, the type of impairment may include band impairment, which may be caused by the fact that some fixed band signals are fixedly attenuated, and the equalization process is not well done, resulting in significant distortion in the audible perception of the corresponding voice signal.

The scheme of the embodiment of the application comprises a model training phase and a prediction phase. FIG. 1 is a block diagram illustrating a model training and prediction compensation in accordance with an exemplary embodiment. As shown in fig. 1, in the model training stage, the model training device 110 trains an end-to-end machine learning model through a pre-prepared sample set including original audio samples and corresponding spectrally distorted audio samples, and in the prediction compensation stage, the prediction device 120 directly predicts a prediction result of performing prediction compensation on a distorted spectrum in a first audio signal according to the trained machine learning model and the input first audio signal.

The model training device 110 and the prediction device 120 may be computer devices with machine learning capability, for example, the computer devices may be stationary computer devices such as a personal computer and a server, or the computer devices may also be mobile computer devices such as a tablet computer or an e-book reader.

Optionally, the model training device 110 and the prediction device 120 may be the same device, or the model training device 110 and the prediction device 120 may be different devices. Also, when the model training device 110 and the prediction device 120 are different devices, the model training device 110 and the prediction device 120 may be the same type of device, for example, the model training device 110 and the prediction device 120 may both be personal computers; alternatively, the model training device 110 and the prediction device 120 may be different types of devices, for example, the model training device 110 may be a server, and the prediction device 120 may be a mobile terminal device, etc. The embodiment of the present application is not limited to the specific types of the model training device 110 and the prediction device 120.

FIG. 2 is a model architecture diagram illustrating a machine learning model in accordance with an exemplary embodiment. As shown in fig. 2, the machine learning model 20 in the embodiment of the present application may include a sample set in which samples are generated and stored, and a spectrum compensation model 210 portion, where the sample set stores original audio samples obtained by acquisition and corresponding spectrum distortion audio samples, or original audio samples generated by artificial construction and corresponding spectrum distortion audio samples. Each of the spectrum distortion audio samples stored in the sample set is input into the spectrum compensation model 210, and the spectrum compensation model 210 is model-trained with the corresponding original audio sample as an output target, where the spectrum compensation model 210 is configured to output a prediction result, that is, a predicted spectrum restoration audio signal corresponding to the first audio signal, according to the input first audio signal.

Reference is now made to fig. 3, which is a diagram illustrating an audio signal processing method that may be performed by an audio processing device, according to an exemplary embodiment. The audio processing device may be the prediction device 120 in the system shown in fig. 1. As shown in fig. 3, the audio signal processing method may include the steps of:

step 301, a first audio signal is obtained.

In a possible implementation manner, the first audio signal is an original audio signal acquired by an audio processing device, the audio signal is acquired after speech enhancement processing in hardware or software inherent to the audio processing device, or the first audio signal is an audio signal acquired by the audio processing device after speech enhancement processing, and the first audio signal may be distorted in different situations due to speech enhancement processing by different algorithms.

The audio processing device performs speech enhancement processing on the audio signal through the inherent hardware or software, and the algorithm used by the audio processing device may be different according to the manufacturer of the audio processing device, the kernel version and the type of the application software. Resulting in different distortion situations in the acquired first audio signal.

The voice enhancement processing may include echo cancellation processing, noise suppression processing, volume self-adjustment processing, frequency response equalization processing, and other aspects, audio processing devices of different manufacturers or different kernel versions, and different types of application software, and the degrees of emphasis in various aspects are different when the voice enhancement processing is performed, so that distortion conditions in the first audio signal obtained after the voice enhancement processing are different.

For example, the distortion condition in the first audio signal may be represented by high-frequency damage, which may be caused by noise suppression processing or echo cancellation processing, that is, original high-frequency information in the audio signal is significantly attenuated, so that the sound becomes stuffy and unclear; it may also be band-impairment, i.e. some fixed-band signals are fixedly attenuated, possibly due to the frequency response equalization process not being well-handled, resulting in a noticeable distortion of the audio in the perception of hearing.

Step 302, processing the first audio signal through a spectrum compensation model to obtain a prediction result of performing prediction compensation on a distortion spectrum in the first audio signal; the spectrum compensation model is a neural network model obtained by training the spectrum distortion audio samples and the original audio samples corresponding to the spectrum distortion audio samples.

In one possible implementation, the first audio signal is input to a spectrum compensation model, and the predicted undistorted condition of the first audio signal is output as a prediction result.

The spectrum compensation model is a neural network model obtained by training a spectrum distortion audio sample and an original audio sample corresponding to the spectrum distortion audio sample and updating relevant parameters.

Step 303, reconstructing the first audio signal according to the prediction result, and obtaining a target audio signal obtained by repairing the distortion frequency spectrum in the first audio signal.

In a possible implementation manner, the first audio signal may be reconstructed through a prediction result of the first audio signal output by the neural network model, and an audio signal obtained by repairing the first audio signal is generated as the target audio signal.

The audio signal after the first audio signal as the target audio signal is repaired can solve the problem of audio distortion in practical application.

For example, in the process of performing a voice call by using social software, a terminal on the voice sending side receives a voice signal of a user, the received voice signal needs to be subjected to voice enhancement processing first, then the voice signal subjected to the voice enhancement processing is reconstructed and repaired, the repaired voice signal is sent to the terminal on the voice receiving side, and the speech signal is played, so that the user on the voice receiving side can clearly receive the voice content. Or, reconstructing the restored audio signal may also be used to process recorded audio, optimize audio in a live broadcast process, process audio with impaired sound quality by music playing software, and optimize audio by video playing software.

In conclusion, the neural network model obtained through model training can be used for carrying out compensation prediction aiming at different equipment models and software versions in a unified mode, the problem that the application scene of voice signal restoration is limited is solved, and therefore the universality of voice signal restoration is improved.

Reference is made to fig. 4, which is a schematic diagram illustrating an audio signal processing method that may be performed by an audio processing device, according to an example embodiment. The audio processing device may be the model training device 110 and the prediction device 120 in the system shown in fig. 1. As shown in fig. 4, the audio signal processing method may comprise two stages, a model training stage, in which the model training device 110 may perform, and a model application stage, wherein the model offline training stage may comprise the steps of:

in step 401, raw audio samples are obtained.

In the embodiment of the present application, the obtained original audio sample may be collected from the outside or obtained by manual construction.

In one possible implementation, the original audio samples may be stored in advance in the model training device or collected and stored by the model training device.

In one possible implementation, the original audio samples are constructed manually by first preparing a batch of noise-free audio signals and a batch of different types of noise sequences, and then linearly superimposing the noise-free audio signals and the noise sequences according to the configured different signal-to-noise ratios by configuring different SNR (signal-to-noise ratio) combinations to generate the noise-free original audio signals, while multiplying the noise-free original audio signals as a whole by different gain values in order to make the original audio samples more widely covered at energy level.

Among them, the prepared different types of noise sequences may be babble noise, street noise, office noise, white noise, or the like.

In step 402, a power spectrum value of a partial frequency band in the frequency domain signal corresponding to the original audio sample is suppressed to obtain a spectrum distortion audio sample corresponding to the original audio sample.

In a possible implementation manner, after the original audio samples are converted from the time domain signal to the frequency domain signal, a part of frequency bands on the frequency domain signal are randomly extracted for suppression processing.

The method comprises the steps of obtaining a plurality of spectrum distortion audio samples corresponding to original audio samples by multiplying power spectrum values corresponding to partial frequency bands on randomly extracted frequency domain signals by a random value smaller than or equal to 1.

In a possible implementation, the extraction of the partial frequency band on the frequency domain signal may also be selected according to the actual application.

For example, when the original audio sample is an audio signal with an 8khz spectrum, the signal impairment of the part of the signal with frequencies below 2khz is relatively small, and the signal impairment of the part above 4khz is relatively serious, so that the frequency band extraction can be performed according to the distribution of the frequency band impairment probability.

Wherein, the selected samples have a certain number of samples with different frequency band damage structures. To ensure the accuracy of the training.

In step 403, a spectrum compensation model is obtained by performing machine learning training with the spectrum distortion audio sample as input and the original audio sample as a training target.

In the embodiment of the present application, a spectrum compensation model is trained and a model update is performed through the obtained original audio samples and the corresponding spectrum distortion audio samples.

In one possible implementation, the spectrum compensation model is a neural network model trained by the spectrum distortion audio samples and the original audio samples corresponding to the spectrum distortion audio samples.

Wherein, the spectrum compensation model is a recurrent neural network model RNN or a long-short term memory network model LSTM.

In one possible implementation, the spectrally distorted audio samples on the input side used for training the spectral compensation model and the original audio samples on the target output side need to be subjected to a logarithmic process.

The power spectrum value corresponding to the spectrum distortion audio sample at the input side and the power spectrum value corresponding to the original audio sample need to be subjected to logarithmic processing, and a calculation formula of the logarithmic processing can be as follows:

S_dB(i,k)＝20log(S(i,k))

wherein S is_dBAnd (i, k) is a power spectrum logarithm value, i is a corresponding frame number, and k is a corresponding frequency point index value.

In a possible implementation manner, the spectrum compensation model obtained by training the sample set obtained after the logarithmic processing is still logarithmic in the input side and the output side in practical application.

Through steps 401 to 403, the model training device may complete training and updating of the spectrum compensation model.

Next, a model application phase may be performed by the prediction device 120, wherein the model application phase may comprise the steps of:

in step 404, a first audio signal is acquired.

In one possible implementation, a first audio signal is obtained after audio enhancement processing.

Wherein the first audio signal may be a time domain signal with partial distortion.

In step 405, the first audio signal is converted into a corresponding frequency domain signal.

In the embodiment of the present application, the first audio signal is a time-frequency signal, and the time-frequency signal is converted into a corresponding frequency-domain signal through an operation, so as to perform a subsequent calculation.

In a possible implementation manner, before performing operation conversion to the corresponding frequency domain signal, it is necessary to perform frame-division windowing on the first audio signal, and determine a processed time domain signal.

In the process of signal processing, because computer equipment can only process signals with limited length, the first audio signal needs to be subjected to framing and windowing processing, so that the purposes of cutting by sampling time and processing limited signals are achieved.

The function window may include a rectangular window, a triangular window, a haining window, a hamming window, a cather window, and the like. Windowing the first audio signal may use the evolution of the hamming window.

For example, when the window length is 20ms and the frame data length is 10ms, the window function corresponding to the evolution of the hamming window used can be as follows:

wherein N is an integer value in [0, N-1], and N is a window sample length of 20 ms.

The windowed first audio signal can be obtained by multiplying the first audio signal by a window function, and the specific formula is as follows:

x_w(n)＝x_in(n)*win(n)

wherein x is_inIs composed of a time domain signal containing 10ms of the previous frame and a time domain signal containing 10ms of the current frame.

In one possible implementation manner, after the windowing processing is performed on the first audio signal, the frequency domain conversion is performed on the processed time domain signal, so as to obtain a corresponding frequency domain signal.

The manner of converting the time domain signal into the frequency domain signal may include different algorithms.

In one possible implementation, the corresponding frequency domain signal is obtained by performing Discrete Fourier Transform (DFT) on the processed time domain signal.

The calculation formula of the discrete Fourier transform is as follows:

the amplitude value X (i, k) of each frequency point in the frequency domain signal can be obtained through discrete Fourier transform.

In another possible implementation manner, the modified discrete cosine transform is performed on the processed time domain signal to obtain a corresponding frequency domain signal.

Among them, Modified Discrete Cosine Transform (MDCT) is a fourier Transform-related Transform based on the fourth type Discrete Cosine Transform (DCT-IV). The modified discrete cosine transform is similar to the discrete fourier transform, but only uses real numbers. The calculation is similar to the algorithm of discrete fourier transform.

In step 406, the frequency domain signal is divided into at least one sub-band frequency domain signal.

In the embodiment of the present application, the entire segment of frequency domain signal corresponding to the first audio signal is divided into at least one sub-band frequency domain signal.

The division into at least one sub-band frequency domain signal may include division in a linear frequency domain transform and division in a non-linear frequency domain transform.

In one possible implementation, when performing the linear frequency domain transform, the frequency domain signal is divided into at least one sub-band frequency domain signal, and the frequency domain signal may be equally divided into at least one sub-band frequency domain signal.

In another possible implementation manner, when performing the non-linear frequency domain transform, the frequency domain signal is divided into at least one sub-band frequency domain signal, and the frequency domain signal may be divided into at least one sub-band frequency domain signal by taking a Bark domain as a scale.

The Bark domain is 24 critical bands of hearing simulated by the hearing filter, the Bark domain can be used for describing signals, the frequency domain signals are divided into at least one sub-band frequency domain signal through the Bark domain, and each sub-band frequency domain signal is uneven.

The method for dividing the frequency domain signal into at least one sub-band frequency domain signal by taking a Bark domain as a scale can be carried out on the basis of dividing the frequency domain signal into at least one equally divided sub-band frequency domain signal. And acquiring the self serial number corresponding to each frequency point by mapping the frequency point in each uniformly divided sub-band frequency domain signal to the sub-band frequency domain signal of the Bark domain.

In step 407, the amplitudes of the frequency points in at least one sub-band frequency domain signal are determined.

In the embodiment of the application, the amplitudes of frequency points contained in each sub-band frequency signal are determined according to the division condition of at least one sub-band frequency signal.

In a possible implementation manner, according to the division condition of at least one sub-band frequency signal, the frequency point sequence number included in each sub-band frequency signal is determined. And determining the corresponding frequency point amplitude according to the frequency point sequence number.

The frequency point amplitude may be obtained by calculating through the formula shown in step 402.

For example, if the 0 th sub-band of the Bark domain corresponds to the 0 th to 2 th uniformly divided frequency points, the uniformly divided sub-band is converted into the 0 th to 2 th frequency points which are divided by taking the Bark domain as a scale, and the frequency point amplitude of the 0 th sub-band can be mapped to X_b[0]＝1/3*(x[0]+x[1]+x[2])。

In step 408, a power spectrum value of at least one sub-band frequency domain signal is determined according to the magnitude of each frequency point.

In the embodiment of the application, the power spectrum value corresponding to each sub-band frequency domain signal is determined by obtaining each frequency point amplitude value contained in each sub-band frequency domain signal.

In a possible implementation manner, the square of each frequency point amplitude obtained after fourier transform may be further calculated to be a corresponding power spectrum value.

The formula for calculating the corresponding power spectrum value through the amplitude value of each frequency point may be as follows:

S(i,k)＝|X(i,k)|²k＝1,2,3…,N

wherein i is the corresponding frame number, and k is the corresponding frequency point index value.

In a possible implementation manner, the frequency point amplitudes obtained after the improved discrete cosine transform may also be obtained by calculating the square of each frequency point amplitude as the corresponding power spectrum value.

In step 409, the power spectrum value of each sub-band frequency domain signal is input into the spectrum compensation model, and a sub-band prediction result of each sub-band frequency domain signal is obtained.

In the embodiment of the application, the power spectrum value corresponding to each sub-band frequency domain signal is input into the spectrum compensation model, and the predicted power spectrum value corresponding to each sub-band frequency domain is obtained and used as the sub-band prediction result of each sub-band frequency domain.

In a possible implementation manner, the power spectrum value corresponding to each sub-band frequency domain signal is subjected to logarithm processing, the power spectrum value subjected to logarithm processing is input into the spectrum compensation model, and a logarithm value of a predicted power spectrum value is obtained as a prediction result.

In step 410, a reconstructed power spectrum value of the first audio signal is obtained according to the prediction result.

In the embodiment of the application, the reconstructed power spectrum value corresponding to each sub-band frequency domain signal is obtained according to the prediction result corresponding to each sub-band frequency domain signal output by the spectrum compensation model.

Wherein the reconstructed power spectrum value may be a logarithmic value.

In a possible implementation manner, the reconstructed power spectrum value of the first audio signal may be directly obtained according to the prediction result.

And directly taking the predicted power spectrum value corresponding to at least one sub-band frequency domain signal as the reconstructed power spectrum value.

For example, the power spectrum logarithm is the prediction result output by the spectrum compensation model

Can be combined with

Directly as the reconstructed power spectrum value.

In another possible implementation manner, the reconstructed power spectrum value is determined according to the power spectrum value corresponding to the prediction result and each input sub-band frequency domain signal.

The sum of the power spectrum value corresponding to the first audio signal and the frequency band damage rate may be used as the reconstructed power spectrum value.

The frequency band damage rate is a historical smooth value of the difference between a predicted power spectrum value corresponding to each of at least one sub-band frequency domain signal and a power spectrum value corresponding to the first audio signal.

The difference value between the corresponding actual power spectrum logarithm value and the predicted power spectrum logarithm value of each frequency point is as follows:

then, calculating a historical smooth value of the difference between the actual power spectrum logarithm value and the predicted power spectrum logarithm value through the difference corresponding to each frequency point, and determining the historical smooth value as a frequency band damage rate, wherein the historical smooth value can be calculated through the following formula:

wherein alpha is a parameter with a value range of (0, 1).

In a possible implementation manner, if the sub-band frequency domain signal is divided through nonlinear frequency domain transformation, the difference value between the actual power spectrum logarithm value and the predicted power spectrum logarithm value corresponding to the frequency point is used as the gain of the frequency point, and the reconstructed frequency point amplitude value is determined.

For example, if X_b[0]Obtaining a corresponding prediction result X through the prediction of a neural network model_b′[0]Is mixing X_b′[0]-X_b[0]The gain of the 0 th sub-band of the Bark domain is obtained as gain 0]And the reconstructed frequency point is X' [0 ]]＝X[0]+gain[0]；X′[1]＝X[1]+gain[0]；X′[2]＝X[2]+gain[0]。

In step 411, a target audio signal corresponding to the reconstructed power spectrum value of the first audio signal is generated.

In the embodiment of the application, the generated logarithm value of the reconstructed power spectrum value is converted into a linear value, the frequency point amplitude value corresponding to the power spectrum value is determined, and a corresponding time domain signal is generated according to the frequency point amplitude value and serves as a target audio signal.

In a possible implementation manner, when a power spectrum value reconstructed for the first audio signal is directly obtained according to a prediction result, a prediction power spectrum value corresponding to each of at least one sub-band frequency domain signal serving as the reconstructed power spectrum value is converted into a linear value, and the linear value is squared to determine the amplitude value of each frequency point.

For example, can be

Directly as reconstructed power spectrum value S'_dB(i,k)。

In another possible implementation manner, when the reconstructed power spectrum value is determined according to the prediction result and the power spectrum value corresponding to each input sub-band frequency domain signal, the reconstructed power spectrum value is obtained by adding the power spectrum value corresponding to each sub-band frequency domain signal to the frequency band damage rate. And converting the reconstructed power spectrum value into a linear value, squaring the linear value, and determining the amplitude value of each frequency point.

The formula for calculating the reconstructed power spectrum value may be as follows:

the calculation formula for converting the log value of the power spectrum into the linear value can be as follows:

S′(i,k)＝power(10,0.05*S′_dB(i,k))

wherein, the amplitude of each frequency point in the discrete Fourier transform domain can pass

And (6) performing calculation.

In a possible implementation manner, the frequency domain signal corresponding to the reconstructed power spectrum value is subjected to time domain transformation to obtain a target audio signal.

Wherein the conversion from the frequency domain signal to the time domain signal may be by an inverse discrete fourier transform or a modified inverse discrete cosine transform.

The calculation formula of the inverse discrete Fourier transform is as follows:

then, windowing is carried out on the time domain signal obtained through the inverse discrete Fourier transform, and the calculation formula is as follows:

x_out(n)＝x_idft(n)*win(n)

for example, x_outWhich may be 20ms data, at a position where the current frame is 10ms, the output of the current frame is a result of adding the second half 10ms data result of the 20ms data of the previous frequency domain signal converted into the time domain signal to the first half 10ms data result of the current frame, and the addition result is taken as the target audio signal output by the current frame.

Through steps 404 to 411, the prediction apparatus may perform compensation reconstruction on the first audio signal through the spectrum compensation model to obtain a target audio signal.

Reference is now made to fig. 5, which is an architectural diagram illustrating an audio signal processing method that may be performed by an audio processing device, according to an exemplary embodiment. The audio processing device may be the model training device 110 and the prediction device 120 in the system shown in fig. 1. As shown in fig. 5, the audio signal processing method may include two stages, a model training stage and a model application stage.

The model training phase may be performed by an offline training by a model training apparatus, and may comprise the following steps 51:

in step 51, the spectrum damage audio is used as an input end, and the corresponding original audio is used as an output target, the deep neural network model is trained offline, and the trained neural network model is updated.

The spectrum damage audio and the corresponding original audio can be sample data stored in a sample set in advance, the model training device takes the spectrum damage audio as an input end, the corresponding original audio is taken as an output target, and a spectrum compensation model which can be a recurrent neural network model RNN or a long-short term memory network model LSTM is trained and updated.

The model application phase may be performed online by the predictive device, and may include the following steps 52 through 59:

in step 52, the prediction device captures an audio signal of the outside world through an audio capture function.

The module in charge of audio acquisition in the audio processing device may acquire external audio through the microphone module, for example, to obtain an audio signal.

In step 53, the prediction device performs speech enhancement processing on the acquired audio signal by software and hardware inherent in the prediction device to generate an audio signal after the speech enhancement processing.

The inherent software and hardware speech enhancement processing may include processing the audio signal by applying different speech enhancement algorithms, and the generated audio signal after the speech enhancement processing may have signal impairments of different degrees.

In step 54, the audio signal after the speech enhancement is a time domain signal, and the time domain signal is transformed into a frequency domain signal through an operation.

The operation mode of transforming the time domain signal into the frequency domain signal may be discrete fourier transform or discrete cosine transform.

In step 55, the frequency domain signal is input to a neural network model for prediction.

When the spectrum compensation model is trained offline, the samples of the input end and the target output end are logarithmic values of the power spectrum, so that the logarithmic values of the power spectrum corresponding to the frequency domain signals need to be calculated, and the logarithmic values of the power spectrum are input into the spectrum compensation model for prediction.

In step 56, the prediction result output by the neural network model is the predicted audio signal of the predicted distortion signal after the speech enhancement processing is repaired, and the sub-band damage rate analysis is performed on the obtained predicted audio signal and the previously input frequency domain signal to obtain the sub-band spectrum signal damage rate of the current signal.

The prediction result output by the spectrum compensation model is a power spectrum logarithm value corresponding to the predicted audio signal, sub-band damage rate analysis can be performed on the power spectrum logarithm value corresponding to the input frequency domain signal and the power spectrum logarithm value corresponding to the power spectrum logarithm value, and the sub-band damage rate is a historical smooth value of a difference value of the power spectrum logarithm value and the power spectrum logarithm value corresponding to the input frequency domain signal.

Wherein this step may be omitted, i.e. the following step is performed taking the directly obtained predicted audio signal as a basis.

In step 57, the damaged frequency band is reconstructed to generate a reconstructed frequency domain signal.

Wherein, each input sub-band frequency domain signal is a damaged frequency band, and the power spectrum logarithm value of the reconstructed frequency domain signal is calculated and determined in a sub-band damage rate manner, or, when step 56 is omitted, the power spectrum logarithm value of the directly obtained predicted audio signal is used as the power spectrum logarithm value of the reconstructed frequency domain signal.

In step 58, the reconstructed frequency domain signal is converted into a time domain signal.

If the operation of converting the time domain signal into the frequency domain signal is discrete fourier transform in step 54, the inverse discrete fourier transform may be used as the operation of converting the reconstructed frequency domain signal into the time domain signal, or if the operation of converting the time domain signal into the frequency domain signal is discrete cosine transform in step 54, the inverse discrete cosine transform may be used as the operation of converting the reconstructed frequency domain signal into the time domain signal.

In step 59, the time domain signal is output as the target audio whose repair is completed.

The time domain signal converted from the reconstructed frequency domain signal is output as a target audio signal, and may be played through an audio playing module of the prediction device, for example, the target audio is played through a speaker module.

Please refer to fig. 6, which is a schematic diagram illustrating a voice repair applied in a voice call system according to an exemplary embodiment, where the voice call system may include a voice sending end 620, a voice receiving end 640, and a server end 630 performing data transmission. As shown in fig. 6, the process of performing voice restoration in the voice call system may be as follows:

in the process of a call, the voice sending end 620 collects voice signals from the outside, and performs inherent software and hardware voice enhancement processing on the collected voice signals in the voice sending end 620, then performs voice restoration on the voice signals with certain damage after the voice enhancement processing through the spectrum compensation model in the above embodiment, sends the restored enhanced voice signals to the voice receiving end 640 through the server end 630, and the voice receiving end 640 plays the restored enhanced voice signals through a voice playing module of the voice receiving end 640.

The spectrum compensation model at the voice sender 620 is obtained through offline model training by the model training device 610, and the voice sender 620 can install and update the trained spectrum compensation model by downloading in an application program.

Please refer to fig. 7, which is a schematic diagram illustrating another voice repair method applied in a voice call system according to an exemplary embodiment, where the voice call system may include a voice sending end 720, a voice receiving end 740, and a server end 730 for data transmission. As shown in fig. 7, the process of performing voice restoration in the voice call system may be as follows:

in the process of a call, the voice sending end 720 collects voice signals from the outside, and performs inherent software and hardware voice enhancement processing on the collected voice signals in the voice sending end 720, and then sends the voice signals with certain damage after the voice enhancement processing to the server end 730, and performs voice restoration processing on the voice signals with certain damage after the voice enhancement processing by the server end 730 through the spectrum compensation model in the above embodiment, and then sends the restored enhanced voice signals to the voice receiving end 740, and the voice receiving end 640 plays the restored enhanced voice signals through a voice playing module of itself.

The spectrum compensation model in the server 730 is obtained by performing offline model training through the model training device 710, and the server 730 can perform speech restoration processing on a speech signal with certain damage after speech enhancement processing by obtaining the trained spectrum compensation model.

Fig. 8 is a block diagram illustrating a structure of an audio signal processing apparatus according to an exemplary embodiment. The audio signal processing method may be performed by an audio processing device to perform all or part of the steps of the method shown in the corresponding embodiment of fig. 3 or 4. The audio signal processing apparatus may include:

a signal obtaining module 810, configured to obtain a first audio signal;

a result obtaining module 820, configured to process the first audio signal through a spectrum compensation model, and obtain a prediction result of performing prediction compensation on a distortion spectrum in the first audio signal; the spectrum compensation model is a neural network model obtained by training a spectrum distortion audio sample and an original audio sample corresponding to the spectrum distortion audio sample;

the target obtaining module 830 is configured to reconstruct the first audio signal according to the prediction result, and obtain a target audio signal obtained by repairing a distortion frequency spectrum in the first audio signal.

In one possible implementation, the result obtaining module 820 includes:

In one possible implementation, the signal obtaining unit is configured to,

alternatively, the first and second electrodes may be,

In one possible implementation, the sub-band division sub-module includes:

In a possible implementation manner, the target obtaining module 830 includes:

alternatively, the first and second electrodes may be,

In one possible implementation, the target generation sub-module includes:

In one possible implementation, the signal obtaining module 810 includes:

In one possible implementation, the apparatus further includes:

In one possible implementation, the spectral compensation model is a recurrent neural network model RNN or a long-short term memory network model LSTM.

FIG. 9 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment. The computer device may be implemented as an audio processing device. The audio processing device may comprise the model training device 110 and the prediction device 120 shown in fig. 1. The computer apparatus 900 includes a Central Processing Unit (CPU) 901, a system Memory 904 including a Random Access Memory (RAM) 902 and a Read-Only Memory (ROM) 903, and a system bus 905 connecting the system Memory 904 and the CPU 901. The computer device 900 also includes a basic Input/Output system (I/O system) 906 for facilitating information transfer between devices within the computer, and a mass storage device 907 for storing an operating system 913, application programs 914, and other program modules 915.

The mass storage device 907 is connected to the central processing unit 901 through a mass storage controller (not shown) connected to the system bus 905. The mass storage device 907 and its associated computer-readable media provide non-volatile storage for the computer device 900. That is, the mass storage device 907 may include a computer-readable medium (not shown) such as a hard disk or Compact Disc-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), CD-ROM, Digital Video Disk (DVD), or other optical, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 904 and mass storage device 907 described above may be collectively referred to as memory.

The computer device 900 may be connected to the internet or other network device through a network interface unit 911 connected to the system bus 905.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processor 901 implements all or part of the steps of the method shown in fig. 3 or fig. 4 by executing the one or more programs.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory comprising computer programs (instructions), executable by a processor of a computer device to perform all or part of the steps of the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of audio signal processing, the method comprising:

acquiring a first audio signal;

2. The method of claim 1, wherein the processing the first audio signal through a spectral compensation model to obtain a prediction result for performing predictive compensation on a distortion spectrum in the first audio signal comprises:

converting the first audio signal into a corresponding frequency domain signal;

dividing the frequency domain signal into at least one sub-band frequency domain signal;

determining the amplitude of each frequency point in the at least one sub-band frequency domain signal;

determining a power spectrum value of the at least one sub-band frequency domain signal according to the amplitude value of each frequency point;

and inputting the power spectrum value of each sub-band frequency domain signal into the spectrum compensation model to obtain the sub-band prediction result of each sub-band frequency domain signal.

3. The method of claim 2, wherein said converting the first audio signal into a corresponding frequency domain signal comprises:

performing frame windowing on the first audio signal, and determining a processed time domain signal;

and performing frequency domain conversion on the processed time domain signal to obtain the corresponding frequency domain signal.

4. The method of claim 3, wherein the frequency-domain converting the processed time-domain signal to obtain the corresponding frequency-domain signal comprises:

alternatively, the first and second electrodes may be,

5. The method of claim 2, wherein the dividing the frequency domain signal into at least one sub-band frequency domain signal comprises:

and dividing the frequency domain signal into the at least one sub-band frequency domain signal by taking a Bark domain as a scale.

6. The method according to claim 2, wherein the inputting the power spectrum value of each sub-band frequency-domain signal into the spectrum compensation model to obtain the sub-band prediction result of each sub-band spectrum signal comprises:

and inputting the power spectrum value corresponding to each sub-band frequency domain signal into the spectrum compensation model to obtain a predicted power spectrum value corresponding to each sub-band frequency domain as a sub-band prediction result of each sub-band frequency domain.

7. The method according to claim 6, wherein reconstructing the enhanced audio signal according to the prediction result to obtain a target audio signal with a repaired distortion spectrum in the first audio signal comprises:

acquiring a power spectrum value after the first audio signal is reconstructed according to the prediction result;

and generating the target audio signal corresponding to the reconstructed power spectrum value of the first audio signal.

8. The method according to claim 7, wherein said obtaining the reconstructed power spectrum value of the first audio signal according to the prediction result comprises:

taking the sum of the power spectrum value corresponding to the first audio signal and the frequency band damage rate as a reconstructed power spectrum value; the frequency band damage rate is a historical smooth value of the difference between a predicted power spectrum value corresponding to each of the at least one sub-band frequency domain signal and a power spectrum value corresponding to the first audio signal;

alternatively, the first and second electrodes may be,

and taking the predicted power spectrum value corresponding to each sub-band frequency domain signal as the reconstructed power spectrum value.

9. The method according to claim 7, wherein said generating the target audio signal corresponding to the reconstructed power spectrum value of the first audio signal comprises:

and performing time domain transformation on the frequency domain signal corresponding to the reconstructed power spectrum value to obtain the target audio signal.

10. The method of claim 1, wherein the obtaining the first audio signal comprises:

and acquiring the first audio signal after audio enhancement processing.

11. The method of claim 1, wherein before the processing the first audio signal by the spectral compensation model to obtain the prediction result for predictively compensating the distortion spectrum in the first audio signal, further comprising:

obtaining the original audio sample;

performing suppression processing on a power spectrum value on a partial frequency band in a frequency domain signal corresponding to the original audio sample to obtain a frequency spectrum distortion audio sample corresponding to the original audio sample;

and performing machine learning training by taking the frequency spectrum distortion audio sample as input and the original audio sample as a training target to obtain the frequency spectrum compensation model.

12. The method according to any of the claims 1 to 11, wherein the spectral compensation model is a recurrent neural network model RNN or a long short term memory network model LSTM.

13. An audio signal processing apparatus, characterized in that the apparatus comprises:

the signal acquisition module is used for acquiring a first audio signal;

14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the audio signal processing method of any of claims 1 to 12.

15. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the audio signal processing method according to any one of claims 1 to 12.