CN114242110A

CN114242110A - Model training method, audio processing method, device, equipment, medium and product

Info

Publication number: CN114242110A
Application number: CN202111549625.XA
Authority: CN
Inventors: 任新蕾; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-25

Abstract

The present disclosure relates to a method for training an audio processing model, an audio processing method, an apparatus, a device, a medium, and a product, wherein the method for training the audio processing model includes: acquiring an audio training sample, wherein the audio training sample comprises a first audio, a second audio and a third audio; based on the first audio, obtaining reverberation characteristics of the first environment by utilizing a pre-trained characteristic extraction model; obtaining an estimated third audio by using an audio processing model based on a second audio and the reverberation characteristics of the first environment; calculating a value of a first loss function based on the estimated third audio and the third audio; and adjusting parameters of the audio processing model according to the calculated value of the first loss function, and training the audio processing model. According to the training method of the audio processing model and the audio processing method, the sense of incongruity on the hearing after combination caused by different recording environments of the audio can be eliminated to the greatest extent.

Description

Model training method, audio processing method, device, equipment, medium and product

Technical Field

The present disclosure relates to the field of audio processing, and more particularly, to a method, an apparatus, a device, a medium, and a product for training an audio processing model.

Background

In the process of editing the audio and video by a user, the situation that the audio in different recording environments needs to be combined is often encountered, for example, when a section of audio and video is edited, the situation that partial contents need to be additionally recorded often occurs, the user is difficult to return to the original recording environment for additional recording due to various reasons, at this time, the additional recording can be performed only in the other recording environment, but the difference between the two recording environments causes the additional recorded contents to have heavier missense of hearing compared with the original audio, and how to alleviate the missense is the problem faced at present.

Disclosure of Invention

The present disclosure provides a training method of an audio processing model, an audio processing method, an apparatus, a device, a medium, and a product, to at least solve the above-mentioned problems in the related art.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method of an audio processing model, including: obtaining an audio training sample, wherein the audio training sample comprises a first audio, a second audio and a third audio, the first audio is a reverberation audio obtained based on a first pure voice and a first environment, the second audio is a reverberation audio obtained based on a second pure voice and a second environment, and the third audio is a reverberation audio obtained based on the second pure voice and the first environment; based on the first audio, obtaining reverberation characteristics of the first environment by utilizing a pre-trained characteristic extraction model; obtaining an estimated third audio using the audio processing model based on the second audio and reverberation characteristics of the first environment; calculating a value of a first loss function based on the estimated third audio and the third audio; and adjusting parameters of the audio processing model according to the calculated value of the first loss function, and training the audio processing model.

Optionally, the first audio, the second audio and the third audio are obtained by: convolving the first pure voice with the reverberation characteristics of the first environment to obtain the first audio; convolving the second pure voice with the reverberation characteristics of the second environment to obtain the second audio; and convolving the second pure voice with the reverberation characteristics of the first environment to obtain the third audio.

Optionally, the reverberation characteristic of the first environment is a first room impulse response, the reverberation characteristic of the second environment is a second room impulse response, and the first room impulse response and the second room impulse response are generated by a mirror sound source method.

Optionally, the feature extraction model is implemented by a self-encoder comprising an encoder and a decoder; the obtaining the reverberation characteristic of the first environment by using a pre-trained characteristic extraction model based on the first audio comprises: based on the first audio, obtaining, with the encoder, reverberation characteristics of the first environment.

Optionally, the feature extraction model is pre-trained by: obtaining, with the encoder, reverberation characteristics of the first environment based on the first audio; obtaining, with the decoder, estimated first audio based on the reverberation characteristics of the first environment and the first pure speech; calculating a value of a second loss function from the estimated first audio and the first audio; and adjusting parameters of the encoder and the decoder according to the calculated value of the second loss function, and training the feature extraction model.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing method, including: acquiring a first audio obtained by recording a first pure voice in a first recording environment and a second audio obtained by recording a second pure voice in a second recording environment; based on the first audio, obtaining reverberation characteristics of the first recording environment by utilizing a pre-trained characteristic extraction model; and obtaining a third audio which is obtained by recording second pure voice in the first recording environment and is estimated based on the second audio and the reverberation characteristics of the first recording environment by utilizing a pre-trained audio processing model.

Optionally, the feature extraction model is implemented by a self-encoder comprising an encoder and a decoder; the obtaining the reverberation characteristic of the first recording environment by using a pre-trained characteristic extraction model based on the first audio comprises: obtaining, with the encoder, reverberation characteristics of the first recording environment based on the first audio.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for training an audio processing model, including: a sample acquisition unit configured to: obtaining an audio training sample, wherein the audio training sample comprises a first audio, a second audio and a third audio, the first audio is a reverberation audio obtained based on a first pure voice and a first environment, the second audio is a reverberation audio obtained based on a second pure voice and a second environment, and the third audio is a reverberation audio obtained based on the second pure voice and the first environment; a feature extraction unit configured to: based on the first audio, obtaining reverberation characteristics of the first environment by utilizing a pre-trained characteristic extraction model; an audio estimation unit configured to: obtaining an estimated third audio using the audio processing model based on the second audio and reverberation characteristics of the first environment; a loss function calculation unit configured to: calculating a value of a first loss function based on the estimated third audio and the third audio; a training unit configured to: and adjusting parameters of the audio processing model according to the calculated value of the first loss function, and training the audio processing model.

Optionally, the feature extraction model is implemented by a self-encoder comprising an encoder and a decoder;

the feature extraction unit is configured to: based on the first audio, obtaining, with the encoder, reverberation characteristics of the first environment.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus comprising: an audio acquisition unit configured to: acquiring a first audio obtained by recording a first pure voice in a first recording environment and a second audio obtained by recording a second pure voice in a second recording environment; a feature acquisition unit configured to: based on the first audio, obtaining reverberation characteristics of the first recording environment by utilizing a pre-trained characteristic extraction model; a third audio acquisition unit configured to: and obtaining a third audio which is obtained by recording second pure voice in the first recording environment and is estimated based on the second audio and the reverberation characteristics of the first recording environment by utilizing a pre-trained audio processing model.

Optionally, the feature extraction model is implemented by a self-encoder comprising an encoder and a decoder, the feature acquisition unit being configured to: obtaining, with the encoder, reverberation characteristics of the first recording environment based on the first audio.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a training method or an audio processing method of an audio processing model according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method of training an audio processing model or a method of audio processing according to the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which are executable by a processor of a computer device to perform a method of training an audio processing model or a method of audio processing according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the training method of the audio processing model, the audio processing method, the device, the equipment, the medium and the product, the first environment reverberation characteristic capable of representing the reverberation characteristic of the recording environment of the first audio is extracted through the pre-trained characteristic extraction model, the recording environment of the second audio is converted into the recording environment of the first audio through the audio processing model based on the first environment reverberation characteristic and the second audio, the auditory sense mismatching feeling after combination caused by different recording environments of the audio can be eliminated to the maximum extent, therefore, in a scene needing to be recorded additionally, when a user finds that part of content needs to be recorded additionally, the user does not need to return to the original recording environment for additional recording, and the cost for editing the audio and the video is greatly reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating a training process and an inference process of an audio processing model according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a training method of an audio processing model according to an exemplary embodiment of the present disclosure.

Fig. 3 is an overall schematic diagram illustrating a structure and training process of an auto-encoder according to an exemplary embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating a U-Net structure according to an exemplary embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present disclosure.

Fig. 6 is a block diagram illustrating a training apparatus of an audio processing model according to an exemplary embodiment of the present disclosure.

Fig. 7 is a block diagram illustrating an audio processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating an electronic device 800 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

In the process of editing the audio and video by a user, the situation that partial contents need to be additionally recorded often occurs, the user is difficult to return to the original recording environment for additional recording due to various reasons, at the moment, the additional recording can only be performed in the other recording environment, but the additional recorded contents have heavier disagreement with the original audio in audibility due to the difference of the two recording environments, in order to reduce the disagreement and reduce the audio and video editing cost of the user, the disclosure provides a training method, an audio processing method, a device, equipment, a medium and a product of an audio processing model, specifically, a first environment reverberation feature capable of representing the reverberation feature of the recording environment of a first audio is extracted through a pre-trained feature extraction model, and the recording environment of a second audio is converted into the recording environment of the first audio through the audio processing model based on the first environment reverberation feature and the second audio, the audio editing method can eliminate the mismatching sense of the audiences after combination caused by different recording environments of the audio to the maximum extent, so that in a scene needing to additionally record the audio, when a user finds that partial content needs to be additionally recorded, the user does not need to return to the original recording environment for additional recording, and the audio and video editing cost is greatly reduced. Hereinafter, a training method of an audio processing model, an audio processing method, apparatus, device, medium, and product according to exemplary embodiments of the present disclosure will be described in detail with reference to fig. 1 to 8.

Fig. 1 is a schematic diagram illustrating a training process and an inference process of an audio processing model according to an exemplary embodiment of the present disclosure. The audio processing model may be implemented by an artificial neural network (e.g., a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), etc.).

Referring to fig. 1 (a), the audio processing model requires three training samples: the audio a recorded in the environment a, the audio B recorded in the environment B, and the audio B 'recorded in the environment a (as training labels) may be respectively recorded in the environment a and the environment B, and the audio B' consistent with the sound source and content of the audio B may be recorded in the environment a, or these three training samples may be generated through simulation, which is not limited in this respect. Here, the exemplary description generates these three training samples through simulation: first, pure voices corresponding To the audio a and the audio B are collected (for example, downloaded from a Text To Speech (TTS) database with high quality, which is open source), then, a Room Impulse Response (RIR _ B) and a RIR _ a with reverberation information of the environment B and the environment a are generated by using a mirror image sound source method, respectively, the pure voice corresponding To the audio a is convolved with the RIR _ a To obtain an audio a recorded in the environment a, the pure voice corresponding To the audio B is convolved with the RIR _ B To obtain an audio B recorded in the environment B, and the pure voice corresponding To the audio B is convolved with the RIR _ a To obtain an audio B' recorded in the environment a. During training, the audio a recorded in the environment a is input into a feature extraction model (for example, a pre-trained self-encoder or any other neural network, such as a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), etc.), so as to obtain reverberation features of the environment a, and the reverberation features, the audio B recorded in the environment B, and the audio B' recorded in the environment a are sent to an audio processing model for training, so as to obtain a required audio processing model. Referring to fig. 1 (B), the audio a recorded in environment a may be input into the feature extraction model to obtain the reverberation feature of environment a, and then the audio B recorded in environment B and the obtained environmental reverberation feature of environment a are combined and sent to the trained audio processing model to obtain the estimated audio B ″ recorded in environment a.

According to the audio processing model trained by the training method, the recording environment of the audio b can be converted into the recording environment of the audio a, so that when the audio b is supplemented or added into the audio a, auditory sense of incongruity can be relieved to a great extent, and in a scene needing to be additionally recorded, a user does not need to return to the environment A to additionally record the audio, convenience is brought to the user, and meanwhile the audio and video editing cost of the user is reduced.

Fig. 2 is a flowchart illustrating a training method of an audio processing model according to an exemplary embodiment of the present disclosure. Referring to fig. 2, in step 201, an audio training sample may be obtained, where the audio training sample includes a first audio, a second audio, and a third audio, the first audio is a reverberant audio obtained based on a first pure speech and a first environment, the second audio is a reverberant audio obtained based on a second pure speech and a second environment, and the third audio is a reverberant audio obtained based on the second pure speech and the first environment. Here, the clean speech is speech that does not include noise (e.g., ambient noise or other speech noise), the reverberation is mixed sound that is reflected by an obstacle in the environment when the sound propagates, and the reverberation is different depending on the size, shape, spatial position, and the like of the obstacle in different environments, and thus the reverberation reflects the environmental characteristics of the environment in which the speech is generated or propagated. According to an exemplary embodiment of the present disclosure, the audio training samples may be obtained by recording in an actual recording environment, or may be generated by simulation, which is not limited in this respect. For example, in the case of generating audio training samples through simulation, first pure speech corresponding to a first audio and second pure speech corresponding to a second audio may be first gathered, for example, but not limited to, by downloading from an open source high quality TTS database, or from a specialized pure speech database, etc. Then, the reverberation characteristics of the first environment and the reverberation characteristics of the second environment are obtained, where the reverberation characteristics of the first environment may be a first room impulse response, the reverberation characteristics of the second environment is that the second environment may be a second room impulse response, and the first room impulse response and the second room impulse response may be generated, for example, but not limited to, by a mirror sound source method, or may be obtained by collecting room impulse responses in real different environments. Then, a first ambient reverberation process may be performed on the first pure speech to obtain a first audio, a second ambient reverberation process may be performed on the second pure speech to obtain a second audio, and a first ambient reverberation process may be performed on the second pure speech to obtain a third audio, where the first ambient reverberation process refers to mixing a reverberation characteristic of the first environment in the pure speech, and the second ambient reverberation process refers to mixing a reverberation characteristic of the second environment in the pure speech. Specifically, a first pure speech may be convolved with a reverberation characteristic of a first environment (e.g., a first room impulse response) to obtain a first audio, a second pure speech may be convolved with a reverberation characteristic of a second environment (e.g., a second room impulse response) to obtain a second audio, and a second pure speech may be convolved with a reverberation characteristic of the first environment (e.g., a second room impulse response) to obtain a third audio.

In step 202, a reverberation feature of the first environment may be obtained by using a pre-trained feature extraction model based on the first audio. For example, the time domain feature and/or the frequency domain feature of the first audio frequency may be extracted, and the extracted feature may be input to a pre-trained feature extraction model to obtain the first ambient reverberation feature.

According to an exemplary embodiment of the present disclosure, the feature extraction model is trained in advance, and in some embodiments, the feature extraction model may be implemented by an auto-encoder including an encoder and a decoder, and the reverberation feature of the first environment may be obtained based on the first audio by using the encoder of the auto-encoder. Fig. 3 is an overall schematic diagram illustrating a structure and a training process of an auto encoder according to an exemplary embodiment of the present disclosure, and referring to fig. 3, when training the auto encoder, two types of training data are required, namely, a first audio and first pure speech corresponding to the first audio (for example, obtained by downloading from an open-source high-quality TTS database or performing preset processing on the first audio recorded in a first environment), a reverberation characteristic of the first environment can be obtained by using the encoder based on the first audio (for example, directly inputting the first audio into the encoder or inputting the audio into the encoder after performing time-frequency transformation (for example, short-time fourier transform (STFT)) on the first audio, that is, embedding with first environment reverberation information output by the encoder can be used as the reverberation characteristic of the first environment; then, based on the reverberation characteristics of the first environment and the first pure speech, obtaining an estimated first audio by using a decoder (for example, but not limited to, the reverberation characteristics of the first environment and the time domain information and/or the frequency domain information of the first pure speech may be spliced and input to the decoder); then, calculating the value of a second loss function according to the estimated first audio and the first audio; and adjusting parameters of the encoder and the decoder according to the calculated value of the second loss function, and training the feature extraction model. Here, the second loss function may be implemented by, but not limited to, Mean Square Error (MSE), Mean Absolute Error (MAE), and the like. For example, a time domain or frequency domain MAE may be used as a second loss function, which may be expressed, for example, but not limited to, as:

therein, loss_{t_mae}MAE representing the time domain; s represents a first audio as a tag;

representing the estimated first audio. Of course, the present disclosure is not limited to the second loss function, and any possible loss function may be utilized to train the feature extraction model of the present disclosure.

In some embodiments, the feature extraction model may also be implemented by other neural networks (e.g., a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), etc.), in which case, the feature extraction model may be trained by performing feature extraction on the first audio through the neural network to obtain an estimated reverberation feature of the first environment, and calculating a value of a loss function through the estimated reverberation feature of the first environment and a real reverberation feature of the first environment to adjust corresponding parameters of the neural network. Of course, the feature extraction model may also be other model structures, which are not illustrated here.

At step 203, an estimated third audio may be derived using an audio processing model based on the second audio and the first ambient reverberation characteristics.

According to an exemplary embodiment of the present disclosure, time domain information and/or frequency domain information of the second audio and the first ambient reverberation characteristics may be input to an audio processing model, resulting in an estimated time domain or frequency domain third audio. Here, time domain information and/or frequency domain information of the second audio may be input as an audio processing model, and the reverberation characteristic of the first environment may be input to the audio processing model at any position of the audio processing model according to a user setting. For example, the time domain information and/or the frequency domain information of the second audio is spliced with the reverberation characteristics of the first environment and then input to the audio processing model. For another example, the reverberation characteristics of the first environment may be input to an intermediate layer or an output layer of the audio processing model.

According to an exemplary embodiment of the present disclosure, the feature extraction model and the audio processing model may each be combined from different network layers, for example, each may be a U-Net structure as shown in fig. 4. Fig. 4 is a schematic diagram illustrating a U-Net structure according to an exemplary embodiment of the present disclosure, in which a shaded portion denotes an encoder section mainly composed of convolutional layers, a non-shaded portion denotes a decoder section mainly composed of transposed convolutional layers, and the number in fig. 4 represents the number of channels per convolutional layer. Of course, the feature extraction model and the audio processing model according to the present disclosure may also assume any possible network structure.

In step 204, a value of the first loss function may be calculated based on the estimated third audio and the third audio.

Display according to the present disclosureIn an exemplary embodiment, the first loss function may be implemented by, but is not limited to, MSE, MAE, and the like. For example, the estimated MSE or MAE of the third audio in the time domain and the third audio in the time domain may be calculated to calculate the loss value, or the estimated MSE or MAE of the third audio in the frequency domain and the third audio in the frequency domain may be calculated to calculate the loss value. For example, a time domain or frequency domain MAE may be used as a first loss function, which may be represented, for example, but not limited to, by equation (1), but in this case, s represents a third audio as a tag;

representing the estimated third audio. Of course, the present disclosure is not limited to the first loss function, and any possible loss function may be utilized to train the audio processing model of the present disclosure.

In step 205, the audio processing model is trained by adjusting parameters of the audio processing model according to the calculated value of the first loss function. That is, the parameters of the audio processing model may be adjusted by back-propagation of the loss calculated by the first loss function. Furthermore, in the model training process, the parameters of the audio processing model may be adjusted (or updated) using a batch of audio training samples, and iteratively adjusted (or updated) until the audio processing model converges with the value of the minimization loss function as the target.

In step 501, a first audio resulting from recording a first clean voice in a first recording environment and a second audio resulting from recording a second clean voice in a second recording environment may be obtained.

According to an exemplary embodiment of the present disclosure, the second pure speech may be a segment of the first pure speech. Specifically, in the process of editing the audio/video by the user, a situation that a part of content needs to be additionally recorded often occurs, and the user is difficult to return to the first recording environment for additional recording due to various reasons, at this time, the additional recording can be performed only in the second recording environment, and the second pure voice corresponding to the second audio is a fragment of the first pure voice corresponding to the first audio due to the additional recording performed on the first audio. In other embodiments, the second pure speech may be independent of the first pure speech, specifically, in the process of editing the audio/video by the user, the second audio may need to be added to the first audio to obtain the desired editing effect, and in this case, the second pure speech corresponding to the second audio may be different from the first pure speech corresponding to the first audio.

In step 502, a reverberation feature of the first recording environment may be obtained based on the first audio by using a pre-trained feature extraction model.

According to an exemplary embodiment of the present disclosure, the feature extraction model may be implemented by an auto-encoder including an encoder and a decoder, and the reverberation feature of the first recording environment may be obtained based on the first audio using the encoder.

In step 503, a third audio estimated from recording a second pure speech in the first recording environment may be obtained by using a pre-trained audio processing model based on the second audio and the reverberation characteristics of the first recording environment. Here, the audio processing model may be trained in advance using the aforementioned training method of the audio processing model, for example.

By the audio processing method disclosed by the disclosure, the environmental reverberation characteristics of different audios obtained by recording pure voice in different recording environments tend to be consistent, so that when a user combines the audios obtained in different recording environments, the missense of hearing can be eliminated to the greatest extent.

Referring to fig. 6, the training apparatus 600 of an audio processing model according to an exemplary embodiment of the present disclosure may include a sample acquisition unit 601, a feature extraction unit 602, an audio estimation unit 603, a loss function calculation unit 604, and a training unit 605.

The sample obtaining unit 601 may obtain an audio training sample, where the audio training sample includes a first audio obtained by performing a first ambient reverberation process on a first pure speech, a second audio obtained by performing a second ambient reverberation process on a second pure speech, and a third audio obtained by performing the first ambient reverberation process on the second pure speech. The feature extraction unit 602 may obtain a first ambient reverberation feature by using a feature extraction model based on the first audio, where the first ambient reverberation feature represents a reverberation feature obtained by performing a first ambient reverberation process. The audio estimation unit 603 may derive an estimated third audio using the aforementioned audio processing model based on the second audio and the first ambient reverberation feature derived from the feature extraction unit 602. The loss function calculation unit 604 may calculate a value of the second loss function based on the estimated first audio and the first audio. The training unit 605 may train the audio processing model by adjusting parameters of the audio processing model by the calculated value of the first loss function.

Since the training method of the audio processing model shown in fig. 2 can be performed by the training apparatus 600 of the audio processing model shown in fig. 6, and the sample obtaining unit 601, the feature extracting unit 602, the audio estimating unit 603, the loss function calculating unit 604, and the training unit 605 can respectively perform operations corresponding to step 201, step 202, step 203, step 204, and step 205 in fig. 2, any relevant details related to the operations performed by the units in fig. 6 can be referred to the corresponding description about fig. 2, and are not repeated here.

Referring to fig. 7, an audio processing apparatus 700 according to an exemplary embodiment of the present disclosure may include an audio acquisition unit 701, a feature acquisition unit 702, and a third audio acquisition unit 703.

The audio acquiring unit 701 may acquire a first audio obtained by recording a first pure voice in a first recording environment and a second audio obtained by recording a second pure voice in a second recording environment. The feature obtaining unit 702 may obtain a first environmental reverberation feature by using a pre-trained feature extraction model based on the first audio, where the first environmental reverberation feature represents a reverberation feature of the first recording environment. The third audio obtaining unit 703 may obtain, based on the second audio and the first environmental reverberation characteristics, a third audio obtained by recording the second pure speech in the first recording environment, which is estimated by using the audio processing model trained by the aforementioned training method for the audio processing model.

Since the audio processing method shown in fig. 5 can be executed by the audio processing apparatus 700 shown in fig. 7, and the audio obtaining unit 701, the feature obtaining unit 702, and the third audio obtaining unit 703 can respectively execute operations corresponding to step 501, step 502, and step 503 in fig. 5, any relevant details related to the operations executed by the units in fig. 7 can be referred to in the corresponding description of fig. 5, and are not repeated here.

Fig. 8 is a block diagram of an electronic device 800 according to an example embodiment of the present disclosure.

Referring to fig. 8, an electronic device 800 includes at least one memory 801 and at least one processor 802, the at least one memory 801 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 802, perform a method of training an audio processing model or a method of audio processing according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 800 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. Here, the electronic device 800 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 800, the processor 802 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 802 may execute instructions or code stored in the memory 801, wherein the memory 801 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 801 may be integrated with the processor 802, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 801 may include a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 801 and the processor 802 may be operatively coupled or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 802 can read files stored in the memory.

Further, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, which, when executed by at least one processor, cause the at least one processor to perform a training method of an audio processing model or an audio processing method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, in which instructions are executable by a processor of a computer device to perform a training method of an audio processing model or an audio processing method according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training an audio processing model, comprising:

obtaining an audio training sample, wherein the audio training sample comprises a first audio, a second audio and a third audio, the first audio is a reverberation audio obtained based on a first pure voice and a first environment, the second audio is a reverberation audio obtained based on a second pure voice and a second environment, and the third audio is a reverberation audio obtained based on the second pure voice and the first environment;

based on the first audio, obtaining reverberation characteristics of the first environment by utilizing a pre-trained characteristic extraction model;

obtaining an estimated third audio using the audio processing model based on the second audio and reverberation characteristics of the first environment;

calculating a value of a first loss function based on the estimated third audio and the third audio;

and adjusting parameters of the audio processing model according to the calculated value of the first loss function, and training the audio processing model.

2. The method of claim 1, wherein the first audio, the second audio, and the third audio are obtained by:

convolving the first pure voice with the reverberation characteristics of the first environment to obtain the first audio;

convolving the second pure voice with the reverberation characteristics of the second environment to obtain the second audio;

and convolving the second pure voice with the reverberation characteristics of the first environment to obtain the third audio.

3. The method of claim 2, wherein the reverberation characteristic of the first environment is a first room impulse response and the reverberation characteristic of the second environment is a second room impulse response, and wherein the first room impulse response and the second room impulse response are generated by a mirror sound source method.

4. The method of claim 1, wherein the feature extraction model is implemented by a self-encoder comprising an encoder and a decoder;

the obtaining the reverberation characteristic of the first environment by using a pre-trained characteristic extraction model based on the first audio comprises:

based on the first audio, obtaining, with the encoder, reverberation characteristics of the first environment.

5. An audio processing method, comprising:

acquiring a first audio obtained by recording a first pure voice in a first recording environment and a second audio obtained by recording a second pure voice in a second recording environment;

based on the first audio, obtaining reverberation characteristics of the first recording environment by utilizing a pre-trained characteristic extraction model;

and obtaining a third audio which is obtained by recording second pure voice in the first recording environment and is estimated based on the second audio and the reverberation characteristics of the first recording environment by utilizing a pre-trained audio processing model.

6. An apparatus for training an audio processing model, comprising:

a sample acquisition unit configured to: obtaining an audio training sample, wherein the audio training sample comprises a first audio, a second audio and a third audio, the first audio is a reverberation audio obtained based on a first pure voice and a first environment, the second audio is a reverberation audio obtained based on a second pure voice and a second environment, and the third audio is a reverberation audio obtained based on the second pure voice and the first environment;

a feature extraction unit configured to: based on the first audio, obtaining reverberation characteristics of the first environment by utilizing a pre-trained characteristic extraction model;

an audio estimation unit configured to: obtaining an estimated third audio using the audio processing model based on the second audio and reverberation characteristics of the first environment;

a loss function calculation unit configured to: calculating a value of a first loss function based on the estimated third audio and the third audio;

a training unit configured to: and adjusting parameters of the audio processing model according to the calculated value of the first loss function, and training the audio processing model.

7. An audio processing apparatus, comprising:

an audio acquisition unit configured to: acquiring a first audio obtained by recording a first pure voice in a first recording environment and a second audio obtained by recording a second pure voice in a second recording environment;

a feature acquisition unit configured to: based on the first audio, obtaining reverberation characteristics of the first recording environment by utilizing a pre-trained characteristic extraction model;

a third audio acquisition unit configured to: and obtaining a third audio which is obtained by recording second pure voice in the first recording environment and is estimated based on the second audio and the reverberation characteristics of the first recording environment by utilizing a pre-trained audio processing model.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a method of training an audio processing model as claimed in any one of claims 1 to 4 or a method of audio processing as claimed in claim 5.

9. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method of training an audio processing model according to any one of claims 1 to 4 or a method of audio processing according to claim 5.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by at least one processor, implement a method of training an audio processing model according to any of claims 1 to 4 or a method of audio processing according to claim 5.