CN113963686A - Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium - Google Patents

Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN113963686A
CN113963686A CN202110932183.0A CN202110932183A CN113963686A CN 113963686 A CN113963686 A CN 113963686A CN 202110932183 A CN202110932183 A CN 202110932183A CN 113963686 A CN113963686 A CN 113963686A
Authority
CN
China
Prior art keywords
audio
training
model
predetermined
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110932183.0A
Other languages
Chinese (zh)
Inventor
王子腾
纳跃跃
刘章
田彪
付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Institute Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202110932183.0A priority Critical patent/CN113963686A/en
Publication of CN113963686A publication Critical patent/CN113963686A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/18Methods or devices for transmitting, conducting or directing sound
    • G10K11/20Reflecting arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application discloses an audio processing method and device, an audio model training method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring audio to be processed; extracting a feature vector of the audio to be processed; the feature vectors are computed using a predetermined model obtained using a reverberation training audio training generated based on predetermined sampled audio to obtain processed audio. The embodiment of the application trains the model by using the audio generated by the direct sound and the early-stage reflected audio as the training target audio in the model training, and processes the mixed audio by using the model trained in the practical use, so that the original target audio can be effectively protected by selecting the early-stage reflected sound instead of the direct sound as the model training and recovery target, and the naturalness and the definition of the processed audio listening feeling are ensured.

Description

Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to an audio processing method and apparatus, an audio model training method and apparatus, an electronic device, and a computer-readable storage medium.
Background
With the development of cloud technology, more and more users choose to use cloud services for conference discussion and classroom teaching through the internet. However, these cloud conferences or cloud classes require that the voice of the user, which is uttered near the terminal of the user, is collected and transmitted to the user participating in the cloud conference or the cloud class through the internet and played. However, users often speak in a room in such a cloud conference or cloud classroom, and therefore, it is inevitable that the voice audio collected by the voice collecting apparatus at the user terminal is actually mixed with direct audio in which the user uttered voice is directly transferred to the collecting apparatus, reflected audio in which the user uttered voice is reflected once or twice by an object such as a wall after uttered voice, and late reverberation after multiple reflections. Such mixed audio can severely degrade the intelligibility of speech uttered by the user, greatly affecting the listening experience of other users listening to the speech.
Disclosure of Invention
The embodiment of the application provides an audio processing method and device, an audio model training method and device, an electronic device and a computer readable storage medium, so as to solve the defect that the reverberation audio processing effect is unnatural in the prior art.
To achieve the above object, an embodiment of the present application provides an audio processing method, including:
acquiring audio to be processed;
extracting a feature vector of the audio to be processed;
the feature vectors are computed using a predetermined model obtained using a reverberation training audio training generated based on predetermined sampled audio to obtain processed audio.
The embodiment of the application further provides an audio model training method, which comprises the following steps:
generating reverberation training audio for predetermined sample audio using a predetermined algorithm;
generating a training target audio from at least a portion of the predetermined sampled audio and the reverberant training audio;
training a predetermined model using the reverberant training audio as an input and the training target audio as validation data.
The embodiment of the application further provides a conference audio processing method, which comprises the following steps:
acquiring speaking audio sent by a conference participating terminal participating in a conference through an audio acquisition device;
extracting a feature vector of the speech audio;
calculating the feature vector using a predetermined model obtained by training a reverberation training audio generated based on a predetermined sampling audio to obtain a processed audio;
and sending the processed audio to other conference participating terminals participating in the conference.
The embodiment of the application also provides a classroom audio processing method, which comprises the following steps:
acquiring teaching audio sent by a teacher during teaching through an audio acquisition device arranged in a classroom;
extracting a feature vector of the teaching audio;
calculating the feature vector using a predetermined model obtained by training a reverberation training audio generated based on a predetermined sampling audio to obtain a processed audio;
and transmitting the processed audio to a terminal for listening to classroom teaching through a network.
An embodiment of the present application further provides an audio processing apparatus, including:
the acquisition module is used for acquiring audio to be processed;
the extraction module is used for extracting the characteristic vector of the audio to be processed;
and the processing module is used for calculating the characteristic vector by using a preset model obtained by training a reverberation training audio generated based on the preset sampling audio so as to obtain a processed audio.
The embodiment of the present application further provides an audio model training device, including:
a first generation module for generating a reverberant training audio for a predetermined sample audio using a predetermined algorithm;
a second generation module for generating a training target audio from at least a portion of the predetermined sampled audio and the reverberant training audio;
a training module to train a predetermined model using the reverberation training audio as an input and the training target audio as validation data.
An embodiment of the present application further provides an electronic device, including:
a memory for storing a program;
and the processor is used for operating the program stored in the memory, and the program executes the audio processing method or the audio model training method provided by the embodiment of the application when running.
Embodiments of the present application further provide a computer-readable storage medium on which a computer program executable by a processor is stored, where the program, when executed by the processor, implements an audio processing method or an audio model training method as provided by embodiments of the present application.
An embodiment of the present application further provides a computer program product, including: a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform steps in an audio processing method or an audio model training method as provided by embodiments of the present application.
According to the audio processing method and device, the audio model training method and device, the electronic equipment and the computer readable storage medium, the model is trained by using the audio generated by the direct sound and the early reflected audio in the model training as the training target audio, and the model trained in the way is used for processing the mixed audio in actual use, so that the original target audio can be effectively protected by selecting the early reflected sound instead of the direct sound as the model training and restoring target, and the naturalness and the definition of the processed audio listening feeling are ensured.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic view of an application scenario of an audio processing scheme provided in an embodiment of the present application;
FIG. 2 is a flow diagram of one embodiment of an audio processing method provided herein;
FIG. 3 is a flow diagram of one embodiment of an audio processing method provided herein;
fig. 4a is a schematic structural diagram of an embodiment of an audio processing apparatus provided in the present application;
FIG. 4b is a schematic structural diagram of an embodiment of an audio model training apparatus provided in the present application;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example one
The scheme provided by the embodiment of the application can be applied to any system with audio data processing capability, such as a server system of a chip element with an audio processing function and the like. Fig. 1 is a schematic diagram of an application scenario of an audio processing scheme provided in an embodiment of the present application, and the scenario shown in fig. 1 is only one example to which the technical scheme of the present application is applicable.
With the development of cloud technology, more and more users choose to use cloud services for conference discussion and classroom teaching through the internet. However, these cloud conferences or cloud classes require that the voice of the user, which is uttered near the terminal of the user, is collected and transmitted to the user participating in the cloud conference or the cloud class through the internet and played. However, users often speak in a room in such a cloud conference or cloud classroom, and therefore, it is inevitable that the voice audio collected by the voice collecting apparatus at the user terminal is actually mixed with direct audio in which the user uttered voice is directly transferred to the collecting apparatus, reflected audio in which the user uttered voice is reflected once or twice by an object such as a wall after uttered voice, and late reverberation after multiple reflections. Such mixed audio can severely degrade the intelligibility of speech uttered by the user, greatly affecting the listening experience of other users listening to the speech. For this reason, a technical solution capable of suppressing reverberation of the acquired audio is needed.
For example, in an audio capturing scene such as a classroom as shown in fig. 1, when a teacher utters voice at a lecture table on the left side of the classroom to express lecture contents, the uttered voice may be propagated to the right and captured by an audio capturing device on the right. Reverberation occurs during the acquisition. That is, in the field of audio processing technology, reverberation refers to a process in which a sound source decays in a space after the sound source stops sounding, that is, a process in which the sound source continues to exist in the space is called a reverberation process. Therefore, in a scene of capturing sounds in a classroom as shown in fig. 1, it can be considered that sound audio finally captured by the capturing apparatus includes three parts: audio 1, direct voice which is sent by a teacher and directly reaches the acquisition device without any reflection and is acquired by the acquisition device; audio 2, early reflected audio where the teacher's spoken voice has reflected once or twice through, for example, the walls of a classroom to reach the capture device; audio 3, late reflected audio where the teacher's spoken voice has reflected three or more times through, for example, the three side walls of the classroom to reach the capture device. Thus, the finally acquired audio signal is actually mixed audio composed of audio 1 as direct speech, audio 2 as early-stage reflected audio, and audio 3 as late-stage reflected audio. While normally only late reflections have a large impact on the intelligibility of the speech in the captured mixed audio, early reflections can actually even enhance the energy of the speech if the direct sound intensity is weak. Therefore, in the field of mixed audio processing, it is generally necessary to suppress late-stage reflected audio in the captured mixed audio.
For this reason, a scheme based on signal processing has been proposed in the prior art, such as calculating wiener gain by estimating the energy of late reverberation in mixed audio through a pre-assumed statistical model of reverberation in a single-channel pickup scene, but this scheme is not ideal for suppressing the effects of reverberation. In a multi-channel sound pickup scene, a WPE (Weighted Prediction Error) method is also proposed, which obtains an estimate of a weak reverberation signal in the sense of maximum likelihood by estimating the reverberation tail of the audio signal and then subtracting the estimated reverberation tail from the audio signal, but the method does not significantly improve the listening feeling when the microphone data is small.
In addition, although a reverberation suppression algorithm based on a deep learning model has been proposed in the prior art, the reverberation suppression algorithm uses a direct audio in a mixed audio as a processing target during training, but because the reverberation suppression degree of the model is not smooth enough in time, the processed audio has obvious energy fluctuation, and the auditory sensation is unnatural.
For this reason, in the embodiment of the present application, when training the model, for example, a voice audio with a sampling frequency of 16k/48k is prepared as sampling voice data, and the positions of the sound source and the acquisition device can be set at random in a room generated at random by simulation. For example, the sound source may be disposed in the middle of the left side of the room and the collection device may be disposed in the middle of the right side as shown in fig. 1. Thereafter, room impulse Response (RIP) data, which can be used to describe the reverberation characteristics of the room generated by the simulation, can be generated using, for example, the mirror IMAGE sound source model method (IMAGE) proposed by Allen and Berkley in 1979. The model can then be configured and initialized according to actual requirements. For example, in the embodiment of the present application, the model may use a linear transformation model, a DFSMN (deep feed forward sequential memory network) model, and a deep neural network model of a nonlinear activation function. After initialization of the model parameters, mixed audio may be obtained through convolution operation using the sampled speech data and RIP data, and training target audio may be obtained through convolution operation using early-stage reflected audio in the sampled speech and RIP data, i.e., audio 2 in fig. 1. In the embodiment of the present application, the early reflection audio may be selected from the RIP data 50ms or 100ms after the direct sound.
A short-time fourier transform (STFT) algorithm may then be used to extract energy feature vectors from the mixed audio previously obtained by convolution operations based on sampled speech data and RIP data. In the embodiment of the present application, the extracted energy feature may be a filter bank (filerbbank) feature vector. The extracted feature vectors can be input into the model and subjected to forward computation of the model to obtain, for example, time-frequency masking data. Therefore, at this time, the time-frequency masking data may be convolved with the target audio obtained previously based on the sampled speech and the audio 2 shown in fig. 1 to calculate a loss function, for example, a mean square error between the time-frequency masking data output by the model and the masking data of the target audio may be calculated, and then a gradient back propagation (back propagation) algorithm may be used to adjust model parameters according to the calculation result, and the model calculation may be performed again until the loss function of the current round is no longer significantly reduced compared with the loss function calculated in the previous round, which may indicate that the model has converged, that is, the training of the model has been completed.
The thus trained models can then be used in embodiments of the present application to compute the captured mixed audio. Similarly to the training process, for example, a filter bank feature vector may be extracted from the mixed audio collected by the collecting device, and the extracted feature vector may be input into the trained model for processing. For example, the masking data output by the model is multiplied by the acquired time-frequency spectrum of the mixed audio, and then the time-domain signal with suppressed reverberation can be obtained through inverse fourier transform.
Therefore, according to the audio processing scheme provided by the embodiment of the application, the model is trained by using the audio generated by the direct sound and the early reflected audio as the training target audio in the model training, and the model trained in this way is used for processing the mixed audio in the actual use, so that the original target audio can be effectively protected by selecting the early reflected sound instead of the direct sound as the model training and restoring target, and the naturalness and the definition of the processed audio listening feeling are ensured.
The above embodiments are illustrations of technical principles and exemplary application frameworks of the embodiments of the present application, and specific technical solutions of the embodiments of the present application are further described in detail below through a plurality of embodiments.
Example two
Fig. 2 is a flowchart of an embodiment of an audio processing method provided in the present application, and an execution subject of the method may be various terminal or server devices with audio processing capability, or may be a device or chip integrated on these devices. As shown in fig. 2, the audio processing method may include the steps of:
s201, obtaining the audio to be processed.
In the embodiment of the application, the voice emitted by the voice source can be collected in the same space with the voice source. In other words, when the voice emitted from the voice source propagates in the capturing space, a part of the voice which may propagate along the connection direction of the voice source and the capturing device may be directly captured by the capturing device, for example, as shown in fig. 1 as direct audio 1, while another part of the voice may propagate in other directions and reach the capturing device after being reflected 1 or 2 times via, for example, a wall as shown in fig. 1, for example, as early reflected audio 2 as shown in fig. 1, and finally, another part of the audio may undergo multiple reflections before reaching the capturing device, for example, as late reflected audio 3 as shown in fig. 1. Therefore, the audio to be processed composed of such direct audio 1, early reflected audio 2, and late reflected audio 3 is finally acquired in step S201.
S202, extracting the feature vector of the audio to be processed.
After the audio to be processed in which the direct audio, the early reflected audio, and the late reflected audio are mixed is acquired in step S201, a feature vector may be extracted for the audio to be processed first in step S202. For example, filter bank feature vectors can be extracted.
S203, calculating the feature vector using a predetermined model to obtain the processed audio.
The feature vectors extracted in step S202 may be input to a predetermined model for processing in step S203. For example, such models may be a linear transformation model, a DFSMN (deep feed forward sequential memory network) model, and a deep neural network model of a nonlinear activation function. For example, in the embodiment of the present application, in step S203, the feature vector extracted in step S202 may be calculated using a predetermined model obtained by training a reverberation training audio generated based on a predetermined sampling audio to obtain a processed audio.
For example, the predetermined model of the embodiment of the present application may be trained via the following training manner according to the embodiment of the present application.
S204, generating reverberation training audio aiming at the preset sampling audio by using a preset algorithm.
In the embodiment of the present application, when training the model, for example, a voice audio with a sampling frequency of 16k/48k may be prepared as the sampled voice data, and the positions of the sound source and the acquisition device may be randomly set in a room generated randomly by simulation. For example, the sound source may be disposed in the middle of the left side of the room and the collection device may be disposed in the middle of the right side as shown in fig. 1. Various predetermined algorithms may then be used to generate reverberation training audio for the sampled audio. For example, a mirror IMAGE sound source model method (IMAGE) may be used to generate room impulse Response (RIP) data, which may be used to describe the reverberation characteristics of the room generated by the simulation. The model can then be configured and initialized according to actual requirements. For example, in the embodiment of the present application, the model may use a linear transformation model, a DFSMN (deep feed forward sequential memory network) model, and a deep neural network model of a nonlinear activation function. After model parameter initialization, a mixed tone may be obtained by convolution operation using sampled speech data and RIP data
S205, generating a training target audio according to at least one part of the predetermined sampling audio and the reverberated training audio.
In the embodiment of the present application, the number of sampling voices used in step S204 and the early reflection audio in the RIP data obtained in step S204, for example, the audio 2 in fig. 1, may be used to obtain the training target audio through a convolution operation. In the embodiment of the present application, the early reflected audio used in step S205 may be selected from early reflected audio 50ms or 100ms after the direct sound in the RIP data.
S206, training the predetermined model by using the reverberation training audio as input and using the training target audio as verification data.
The model used in step S203 may be trained in step S206 using the reverberation training audio obtained in step S205 as an input to the model and using the training target audio obtained in step S205 as verification data of the model.
For example, a short-time fourier transform (STFT) algorithm may be used to extract energy feature vectors from the mixed audio previously obtained by a convolution operation based on sampled speech data and RIP data. In the embodiment of the present application, the extracted energy feature may be a filter bank (filerbbank) feature vector. The extracted feature vectors can be input into the model and subjected to forward computation of the model to obtain, for example, time-frequency masking data. Therefore, at this time, the time-frequency masking data may be convolved with the target audio obtained previously based on the sampled speech and the audio 2 shown in fig. 1 to calculate a loss function, for example, a mean square error between the time-frequency masking data output by the model and the masking data of the target audio may be calculated, and then a gradient back propagation (back propagation) algorithm may be used to adjust model parameters according to the calculation result, and the model calculation may be performed again until the loss function of the current round is no longer significantly reduced compared with the loss function calculated in the previous round, which may indicate that the model has converged, that is, the training of the model has been completed.
Therefore, according to the audio processing scheme provided by the embodiment of the application, the model is trained by using the audio generated by the direct sound and the early reflected audio as the training target audio in the model training, and the model trained in this way is used for processing the mixed audio in the actual use, so that the original target audio can be effectively protected by selecting the early reflected sound instead of the direct sound as the model training and restoring target, and the naturalness and the definition of the processed audio listening feeling are ensured.
EXAMPLE III
Fig. 3 is a flowchart of an embodiment of an audio processing method provided in the present application, and an execution subject of the method may be various terminal or server devices with audio processing capability, or may be a device or chip integrated on these devices. As shown in fig. 3, the audio processing method may include the steps of:
s301, obtaining the audio to be processed.
In the embodiment of the application, the voice emitted by the voice source can be collected in the same space with the voice source. In other words, when the voice emitted from the voice source propagates in the capturing space, a part of the voice which may propagate along the connection direction of the voice source and the capturing device may be directly captured by the capturing device, for example, as shown in fig. 1 as direct audio 1, while another part of the voice may propagate in other directions and reach the capturing device after being reflected 1 or 2 times via, for example, a wall as shown in fig. 1, for example, as early reflected audio 2 as shown in fig. 1, and finally, another part of the audio may undergo multiple reflections before reaching the capturing device, for example, as late reflected audio 3 as shown in fig. 1. Therefore, the audio to be processed composed of such direct audio 1, early reflected audio 2, and late reflected audio 3 is finally acquired in step S301.
S302, extracting the feature vector of the audio to be processed.
After the audio to be processed in which the direct audio, the early reflected audio, and the late reflected audio are mixed is acquired in step S301, a feature vector may be extracted for the audio to be processed first in step S302. For example, filter bank feature vectors can be extracted.
S303, forward-computing the feature vector using a predetermined model to obtain masking data.
And S304, multiplying the masking data by the time frequency spectrum of the audio to be processed, and performing inverse Fourier transform to obtain the processed audio.
In step S303, the feature vector extracted in step S302 may be input to a predetermined model to perform forward calculation on the feature vector and obtain, for example, time-frequency masking data, and then in step S304, the masking data thus obtained may be multiplied by the time-frequency spectrum of the audio to be processed obtained in step S301, and then inverse fourier transform may be performed to obtain a processed time-domain signal, which may be used as the processed audio for speech recognition or playing.
In particular, in the embodiment of the present application, such models may be a linear transformation model, a DFSMN (deep feed forward sequential memory network) model, and a deep neural network model of a nonlinear activation function, for example, in the embodiment of the present application, the feature vector extracted in step S302 may be forward-calculated in step S303 using a predetermined model obtained by training reverberation training audio generated based on predetermined sampling audio to obtain masking data.
For example, the predetermined model of the embodiment of the present application may be trained via the following training manner according to the embodiment of the present application.
S305, performing convolution calculation by using the preset sampling audio and the preset room impact response data to obtain reverberation training audio.
In the embodiment of the present application, when training the model, for example, a voice audio with a sampling frequency of 16k/48k may be prepared as the sampled voice data, and the positions of the sound source and the acquisition device may be randomly set in a room generated randomly by simulation. For example, the sound source may be disposed in the middle of the left side of the room and the collection device may be disposed in the middle of the right side as shown in fig. 1. Various predetermined algorithms may then be used to generate reverberation training audio for the sampled audio. For example, a mirror IMAGE sound source model method (IMAGE) may be used to generate room impulse Response (RIP) data, which may be used to describe the reverberation characteristics of the room generated by the simulation. The model can then be configured and initialized according to actual requirements. For example, in the embodiment of the present application, the model may use a linear transformation model, a DFSMN (deep feed forward sequential memory network) model, and a deep neural network model of a nonlinear activation function. After model parameter initialization, a mixed tone may be obtained by convolution operation using sampled speech data and RIP data
And S306, performing convolution calculation by using the preset sampling audio and the early reflecting audio to obtain the training target audio.
In the embodiment of the present application, the early reflection audio in the sampled voice data used in step S305 and the RIP data obtained in step S305, for example, the audio 2 in fig. 1, may be used to obtain the training target audio through a convolution operation. In the embodiment of the present application, the early reflected audio used in step S306 may be selected from early reflected audio 50ms or 100ms after the direct sound in the RIP data.
S307, the predetermined model is trained using the reverberation training audio as an input and the training target audio as verification data.
The model used in step S303 may be trained in step S307 using the reverberation training audio obtained in step S305 as an input to the model and the training target audio obtained in step S306 as verification data of the model.
For example, a short-time fourier transform (STFT) algorithm may be used to extract energy feature vectors from the mixed audio previously obtained by a convolution operation based on sampled speech data and RIP data. In the embodiment of the present application, the extracted energy feature may be a filter bank (filerbbank) feature vector. The extracted feature vectors can be input into the model and subjected to forward computation of the model to obtain, for example, time-frequency masking data. Therefore, at this time, the time-frequency masking data may be convolved with the target audio obtained previously based on the sampled speech and the audio 2 shown in fig. 1 to calculate a loss function, for example, a mean square error between the time-frequency masking data output by the model and the masking data of the target audio may be calculated, and then a gradient back propagation (back propagation) algorithm may be used to adjust model parameters according to the calculation result, and the model calculation may be performed again until the loss function of the current round is no longer significantly reduced compared with the loss function calculated in the previous round, which may indicate that the model has converged, that is, the training of the model has been completed.
Therefore, according to the audio processing scheme provided by the embodiment of the application, the model is trained by using the audio generated by the direct sound and the early reflected audio as the training target audio in the model training, and the model trained in this way is used for processing the mixed audio in the actual use, so that the original target audio can be effectively protected by selecting the early reflected sound instead of the direct sound as the model training and restoring target, and the naturalness and the definition of the processed audio listening feeling are ensured.
Example four
Fig. 4a is a schematic structural diagram of an embodiment of an audio processing apparatus provided in the present application, which can be used to execute the audio processing method shown in fig. 2 or fig. 3. As shown in fig. 4a, the audio processing apparatus may include: an acquisition module 41, an extraction module 42, and a processing module 43.
The obtaining module 41 may be configured to obtain the audio to be processed.
In the embodiment of the present application, the obtaining module 41 may collect the voice uttered by the voice source in the same space as the voice source. In other words, when the voice emitted from the voice source propagates in the capturing space, a part of the voice which may propagate along the connection direction between the voice source and the capturing module 41 may be directly captured by the capturing module 41, for example, as shown in fig. 1, direct audio 1, while another part of the voice may propagate to other directions and may reach the capturing module 41 after being reflected 1 or 2 times via the wall, for example, as shown in fig. 1, an early reflected audio 2, and finally, another part of the audio may experience multiple reflections before reaching the capturing module 41, for example, as shown in fig. 1, a late reflected audio 3. Therefore, the to-be-processed audio composed of such direct audio 1, early reflected audio 2, and late reflected audio 3 is finally acquired by the acquisition module 41.
The extraction module 42 may be configured to extract feature vectors of the audio to be processed.
After the obtaining module 41 obtains the audio to be processed in which the direct audio, the early reflected audio, and the late reflected audio are mixed, the extracting module 42 may extract a feature vector for the audio to be processed first. For example, filter bank feature vectors can be extracted.
The processing module 43 may be configured to calculate the feature vectors using a predetermined model to obtain the processed audio. For example, in the embodiment of the present application, the processing module 43 may calculate the feature vectors extracted by the extraction module 42 using a predetermined model obtained by training a reverberation training audio generated based on a predetermined sampling audio to obtain a processed audio.
The processing module 43 may input the feature vectors extracted by the extraction module 42 into a predetermined model for processing. For example, the processing module 43 may input the feature vector extracted by the extraction module 42 into a predetermined model to perform forward calculation on the feature vector and obtain, for example, time-frequency masking data, and then may perform multiplication processing on the masking data thus obtained and the time-frequency spectrum of the audio to be processed obtained by the acquisition module 41, and further perform inverse fourier transform to obtain a processed time-domain signal, which may be used as the processed audio for processing such as speech recognition or playing. For example, such models may be a linear transformation model, a DFSMN (deep feed forward sequential memory network) model, and a deep neural network model of a nonlinear activation function. For example, the predetermined model of the embodiment of the present application may be trained via an audio model training apparatus according to the embodiment of the present application as shown in fig. 4 b.
Fig. 4b is a schematic structural diagram of an embodiment of an audio model training apparatus provided in the present application.
For example, an audio model training apparatus according to an embodiment of the present application may include: a first generation module 44, a second generation module 45, and a training module 46.
The first generation module 44 may be used to generate reverberation training audio for predetermined sample audio using a predetermined algorithm. In the embodiment of the present application, when the audio model training device trains the model, for example, a voice audio with a sampling frequency of 16k/48k may be prepared as the sampled voice data, and the audio model training device may randomly set the positions of the sound source and the sound collecting device in a room generated randomly by simulation. For example, the sound source may be disposed in the middle of the left side of the room and the collection device may be disposed in the middle of the right side as shown in fig. 1. Thereafter, the first generation module 44 may use various predetermined algorithms to generate reverberation training audio for the sample audio. For example, a mirror IMAGE sound source model method (IMAGE) may be used to generate room impulse Response (RIP) data, which may be used to describe the reverberation characteristics of the room generated by the simulation. The model can then be configured and initialized according to actual requirements. For example, in the embodiment of the present application, the model may use a linear transformation model, a DFSMN (deep feed forward sequential memory network) model, and a deep neural network model of a nonlinear activation function. After initialization of the model parameters, the first generation module 44 may obtain mixed audio through a convolution operation using the sampled speech data and the RIP data.
The second generation module 45 may be used to generate the training target audio from at least a portion of the predetermined sampled audio and the reverberant training audio.
In the embodiment of the present application, the second generating module 45 may use the same number of sampled voices to obtain the training target voice through convolution operation with the early reflected voice in the RIP data, for example, the voice 2 in fig. 1. In the embodiment of the present application, the early reflected audio used by the second generating module 45 may be selected from early reflected audio 50ms or 100ms after the direct sound in the RIP data.
The training module 46 may be used to train the predetermined model used by the processing module 43 using the reverberant training audio as input and the training target audio as validation data.
For example, a short-time fourier transform (STFT) algorithm may be used to extract energy feature vectors from the mixed audio previously obtained by a convolution operation based on sampled speech data and RIP data. In the embodiment of the present application, the extracted energy feature may be a filter bank (filerbbank) feature vector. The extracted feature vectors can be input into the model and subjected to forward computation of the model to obtain, for example, time-frequency masking data. Therefore, at this time, the time-frequency masking data may be convolved with the target audio obtained previously based on the sampled speech and the audio 2 shown in fig. 1 to calculate a loss function, for example, a mean square error between the time-frequency masking data output by the model and the masking data of the target audio may be calculated, and then a gradient back propagation (back propagation) algorithm may be used to adjust model parameters according to the calculation result, and the model calculation may be performed again until the loss function of the current round is no longer significantly reduced compared with the loss function calculated in the previous round, which may indicate that the model has converged, that is, the training of the model has been completed.
Therefore, the audio processing device provided by the embodiment of the application trains the model by using the audio generated by the direct sound and the early reflected audio as the training target audio in the model training, and processes the mixed audio by using the model trained in the actual use, so that the original target audio can be effectively protected by selecting the early reflected sound instead of the direct sound as the model training and restoring target, and the naturalness and the definition of the processed audio listening feeling can be ensured.
EXAMPLE five
The internal functions and structure of the data processing apparatus, which can be implemented as an electronic device, are described above. Fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application. As shown in fig. 5, the electronic device includes a memory 51 and a processor 52.
The memory 51 stores programs. In addition to the above-described programs, the memory 51 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 51 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The processor 52 is not limited to a Central Processing Unit (CPU), but may be a processing chip such as a Graphic Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an embedded neural Network Processor (NPU), or an Artificial Intelligence (AI) chip. And a processor 52, coupled to the memory 51, for executing the program stored in the memory 51 to perform the audio processing methods of the second and third embodiments.
Further, as shown in fig. 5, the electronic device may further include: communication components 53, power components 54, audio components 55, display 56, and other components. Only some of the components are schematically shown in fig. 5, and it is not meant that the electronic device comprises only the components shown in fig. 5.
The communication component 53 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component 53 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 53 further comprises a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
A power supply component 54 provides power to the various components of the electronic device. The power components 54 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.
The audio component 55 is configured to output and/or input audio signals. For example, the audio component 55 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 51 or transmitted via the communication component 53. In some embodiments, audio assembly 55 also includes a speaker for outputting audio signals.
The display 56 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (13)

1. An audio model training method, comprising:
generating reverberation training audio for predetermined sample audio using a predetermined algorithm;
generating a training target audio from at least a portion of the predetermined sampled audio and the reverberant training audio;
training a predetermined model using the reverberant training audio as an input and the training target audio as validation data.
2. The audio model training method of claim 1, wherein said generating reverberation training audio for predetermined sample audio using a predetermined algorithm comprises:
and performing convolution calculation by using the preset sampling audio and preset room impact response data to obtain reverberation training audio.
3. The audio model training method of claim 1, wherein at least a portion of the reverberation training audio is an early reflection audio of the predetermined sample audio over a predetermined time, and the generating training target audio from the predetermined sample audio and the at least a portion of the reverberation training audio comprises:
performing a convolution calculation using the pre-sampled audio and the early reflected audio to obtain the training target audio.
4. The audio model training method of claim 1, wherein the training the predetermined model using the reverberation training audio as an input and the training target audio as validation data further comprises:
calculating a loss function from the output data of the predetermined model and the verification data;
adjusting parameters of the predetermined model according to the loss function;
determining that the model training has converged according to a difference between the loss function value and a loss function value obtained in a previous round of training.
5. The audio model training method of claim 4, wherein said calculating a loss function from the output data of the predetermined model and the validation data comprises:
calculating the mean square error between the output mask and the ideal mask, an
Said adjusting parameters of said predetermined model according to said loss function comprises:
and adjusting the parameters through a gradient back-transmission algorithm according to the mean square error.
6. An audio processing method, comprising:
acquiring audio to be processed;
extracting a feature vector of the audio to be processed;
the feature vectors are computed using a predetermined model obtained using a reverberation training audio training generated based on predetermined sampled audio to obtain processed audio.
7. The audio processing method according to claim 6, wherein the calculating the feature vector using a predetermined model to obtain the processed audio comprises:
forward computing the feature vectors using the predetermined model to obtain masking data;
and multiplying the masking data by the time-frequency spectrum of the audio to be processed and performing inverse Fourier transform to obtain the processed audio.
8. The audio processing method of claim 6, wherein at least a portion of the reverberation training audio is early reflected audio of the predetermined sampled audio over a predetermined time.
9. A conference audio processing method, comprising:
acquiring speaking audio sent by a conference participating terminal participating in a conference through an audio acquisition device;
extracting a feature vector of the speech audio;
calculating the feature vector using a predetermined model obtained by training a reverberation training audio generated based on a predetermined sampling audio to obtain a processed audio;
and sending the processed audio to other conference participating terminals participating in the conference.
10. A classroom audio processing method comprising:
acquiring teaching audio sent by a teacher during teaching through an audio acquisition device arranged in a classroom;
extracting a feature vector of the teaching audio;
calculating the feature vector using a predetermined model obtained by training a reverberation training audio generated based on a predetermined sampling audio to obtain a processed audio;
and transmitting the processed audio to a terminal for listening to classroom teaching through a network.
11. An electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the audio model training method of any one of claims 1 to 3 or the audio processing method of any one of claims 4 to 8.
12. A computer-readable storage medium, on which a computer program is stored which is executable by a processor, wherein the program, when executed by the processor, implements the audio model training method of any one of claims 1 to 3 or the audio processing method of any one of claims 4 to 8.
13. A computer program product, comprising: a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method of any one of claims 1-10.
CN202110932183.0A 2021-08-13 2021-08-13 Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium Pending CN113963686A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110932183.0A CN113963686A (en) 2021-08-13 2021-08-13 Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110932183.0A CN113963686A (en) 2021-08-13 2021-08-13 Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113963686A true CN113963686A (en) 2022-01-21

Family

ID=79460551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110932183.0A Pending CN113963686A (en) 2021-08-13 2021-08-13 Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113963686A (en)

Similar Documents

Publication Publication Date Title
US11894014B2 (en) Audio-visual speech separation
CN111756942B (en) Communication device and method for performing echo cancellation and computer readable medium
US10872602B2 (en) Training of acoustic models for far-field vocalization processing systems
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
Qian et al. Speech Enhancement Using Bayesian Wavenet.
CN111161752B (en) Echo cancellation method and device
Luo et al. Real-time single-channel dereverberation and separation with time-domain audio separation network.
JP7258182B2 (en) Speech processing method, device, electronic device and computer program
KR20200115107A (en) System and method for acoustic echo cancelation using deep multitask recurrent neural networks
EP4394761A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
US20230298593A1 (en) Method and apparatus for real-time sound enhancement
CN113228162A (en) Context-based speech synthesis
US20240177726A1 (en) Speech enhancement
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
US11790930B2 (en) Method and system for dereverberation of speech signals
Abel et al. Novel two-stage audiovisual speech filtering in noisy environments
US20220254332A1 (en) Method and apparatus for normalizing features extracted from audio data for signal recognition or modification
Chen et al. CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
Leutnant et al. Bayesian feature enhancement for reverberation and noise robust speech recognition
WO2023287782A1 (en) Data augmentation for speech enhancement
CN113963686A (en) Audio processing method and device, audio model training method and device, electronic equipment and computer readable storage medium
Marti et al. Automatic speech recognition in cocktail-party situations: A specific training for separated speech
CN113516995B (en) Sound processing method and device
US20240055012A1 (en) Method and System for Reverberation Modeling of Speech Signals
US20230306980A1 (en) Method and System for Audio Signal Enhancement with Reduced Latency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination