WO2023061258A1 - Audio processing method and apparatus, storage medium and computer program - Google Patents

Audio processing method and apparatus, storage medium and computer program Download PDF

Info

Publication number
WO2023061258A1
WO2023061258A1 PCT/CN2022/123819 CN2022123819W WO2023061258A1 WO 2023061258 A1 WO2023061258 A1 WO 2023061258A1 CN 2022123819 W CN2022123819 W CN 2022123819W WO 2023061258 A1 WO2023061258 A1 WO 2023061258A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
target
reverberation
tested
time
Prior art date
Application number
PCT/CN2022/123819
Other languages
French (fr)
Chinese (zh)
Inventor
王子腾
纳跃跃
刘章
田彪
付强
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2023061258A1 publication Critical patent/WO2023061258A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present application relates to the technical field of audio processing, and in particular, to an audio processing method, device, storage medium and computer program.
  • Reverberation is an acoustic phenomenon in which the sound continues to exist after the sound source in the space stops.
  • the existence of reverberation makes the speech clarity collected by the audio collection equipment low, affecting the intelligibility of the collected speech.
  • Embodiments of the present application provide an audio processing method, device, storage medium, and computer program to at least solve the technical problem of low clarity of audio collected by a sound pickup device due to the existence of reverberation in a space.
  • an audio processing method including: obtaining the feature vector of the audio to be tested; inputting the feature vector of the audio to be tested into the target model for processing to obtain target time-frequency masking information, wherein the target The model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; The audio to be tested is processed according to the target time-frequency masking information to obtain the target audio.
  • another audio processing method including: the cloud server receives the audio to be tested; the cloud server obtains the feature vector of the audio to be tested, and uses the target model to process the feature vector of the audio to be tested , to obtain the target time-frequency masking information, and process the audio to be tested according to the target time-frequency masking information to obtain the target audio, where the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to
  • the audio processing is the target type audio, which includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; the cloud server returns the target audio to the client.
  • another audio processing method including: collecting the audio to be tested, and playing the audio to be tested on the audio player; playing the target audio corresponding to the audio to be tested on the audio player, Among them, the target audio is the audio obtained by processing the audio to be tested through the target time-frequency masking information, the target time-frequency masking information is the information obtained by processing the feature vector of the audio to be tested through the target model, and the target model is used to determine the reverberation audio The corresponding time-frequency masking information.
  • another audio processing method including: collecting audio generated in the teaching space through at least two collectors to obtain the first audio; obtaining the feature vector of the first audio, and Input the feature vector of the first audio into the target model for processing to obtain the target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio as the target type audio, the target type audio contains the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; process the first audio according to the target time-frequency masking information to obtain the second audio; send the second audio to the corresponding teaching space remote classroom.
  • an audio processing device including: a first acquisition unit, configured to acquire a feature vector of the audio to be tested; a first processing unit, configured to convert the feature vector of the audio to be tested to Input the target model for processing to obtain the target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into the target type audio, in the target type audio
  • the direct sound and the early reflection sound of the sound source corresponding to the reverberation audio are included; the second processing unit is configured to process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
  • the storage medium includes a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute any one of the above audio processing methods.
  • a computer program which is characterized in that, when the computer program is executed by a processor, any one of the above audio processing methods is implemented.
  • the target time-frequency masking information is obtained, wherein the target model is used to determine the time-frequency corresponding to the reverberation audio Masking information, the time-frequency masking information is used to process the reverberation audio into the target type audio, the target type audio contains the direct sound and early reflections of the sound source corresponding to the reverberation audio; process the audio to be tested according to the target time-frequency masking information, Get the target audio.
  • the audio to be tested is processed by the target model to obtain the target time-frequency masking information, and the audio to be tested is processed by using the target time-frequency masking information to obtain the target audio, which achieves the purpose of suppressing the reverberation in the audio to be tested, thereby realizing the improvement of pickup
  • FIG. 1 is a block diagram of a hardware structure of a computer terminal according to an embodiment of the present application
  • FIG. 2 is a flowchart of an audio processing method provided according to Embodiment 1 of the present application.
  • Fig. 3 is a schematic diagram of the magnitude of the room impulse response according to the embodiment of the present application.
  • FIG. 4 is a schematic diagram of a signal of a room impulse response according to an embodiment of the present application.
  • FIG. 5 is a flowchart of an audio processing method provided according to Embodiment 2 of the present application.
  • FIG. 6 is a flowchart of an audio processing method provided according to Embodiment 3 of the present application.
  • FIG. 7 is a flowchart of an audio processing method provided according to Embodiment 4 of the present application.
  • FIG. 8 is a flowchart of an audio processing method provided according to Embodiment 5 of the present application.
  • FIG. 9 is a schematic diagram of an audio processing device provided according to Embodiment 6 of the present application.
  • Fig. 10 is a structural block diagram of an optional computer terminal according to an embodiment of the present application.
  • the reverberation suppression algorithm based on the deep learning model uses direct sound as the training and recovery target. Since the degree of reverberation suppression is not smooth enough in time, the processed audio will have obvious energy fluctuations and the sense of hearing is unnatural. In addition, the current deep learning model reverberation suppression algorithm only considers a single pickup device, does not fully utilize the complementary information of multiple pickup devices, and the reverberation suppression effect is not good.
  • Reverberation is the acoustic phenomenon in which the sound continues after the sound source stops producing sounds.
  • RIR (Room Impluse Response), room impact response, can be used to describe the characteristics of room reverberation, RIR contains three parts that are continuous in time: direct sound, early reflection sound and late reverberation.
  • DFSMN (Deep Feedforward Sequential Memory Network), a deep feedforward memory network, is a neural network model structure.
  • an embodiment of an audio processing method is provided. It should be noted that the steps shown in the flow charts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, although in The flowcharts show a logical order, but in some cases the steps shown or described may be performed in an order different from that shown or described herein.
  • FIG. 1 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing an audio processing method.
  • the computer terminal 10 may include one or more (shown by 102a, 102b, ..., 102n in the figure) processors 102 (the processors 102 may include but not limited to microprocessor A processing device such as a processor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device for communication functions.
  • FIG. 1 is only a schematic diagram, and it does not limit the structure of the above-mentioned electronic device.
  • computer terminal 10 may also include more or fewer components than shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .
  • the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits".
  • the data processing circuit may be implemented in whole or in part as software, hardware, firmware or other arbitrary combinations.
  • the data processing circuit can be a single independent processing module, or be fully or partially integrated into any of the other elements in the computer terminal 10 (or mobile device).
  • the data processing circuit is used as a processor control (for example, the selection of the terminal path of the variable resistor connected to the interface).
  • the memory 104 can be used to store software programs and modules of application software, such as the program instruction/data storage device corresponding to the audio processing method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104 by running Various functional applications and data processing, that is, the audio processing method for realizing the above-mentioned application programs.
  • the memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory that is remotely located relative to the processor 102 , and these remote memories may be connected to the computer terminal 10 through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the transmission device 106 is used to receive or transmit data via a network.
  • the specific example of the above-mentioned network may include a wireless network provided by the communication provider of the computer terminal 10 .
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • the display may be, for example, a touchscreen liquid crystal display (LCD), which may enable a user to interact with the user interface of the computer terminal 10 (or mobile device).
  • LCD liquid crystal display
  • FIG. 2 is a flowchart of an audio processing method according to Embodiment 1 of the present application.
  • the audio to be tested can be the audio obtained by the pickup collecting the sound from the sound source.
  • the pickup and the sound source are in the same target space, and the target space can be a room. Due to the existence of reverberation in the room, the audio to be tested for reverb audio.
  • the feature vector of the audio to be tested can be extracted first.
  • obtaining the feature vector of the audio to be tested includes: performing Fourier transform on the audio to be tested , obtain the frequency domain information of the audio to be tested, and obtain the feature vector of the audio to be tested from the frequency domain information.
  • the audio to be tested can be converted from the time domain to the frequency domain by short-time Fourier transform (short-time Fourier transform, STFT), and the time spectrum corresponding to the audio to be tested can be obtained, and the time spectrum is used to represent the frequency domain information , and then obtain the frequency-domain feature vector from the time-spectrum, for example, you can obtain the filter bank (filter bank) from the time-spectrum, you can also obtain Mel-Frequency Cepstral Coefficients (Mel-Frequency Cepstral Coefficients, MFCC), gamma pass Feature vectors such as a filter bank (Gammatone filter bank), the embodiment of the present application does not limit the specific type of feature vectors to be obtained.
  • short-time Fourier transform short-time Fourier transform
  • STFT short-time Fourier transform
  • the target model can be a neural network model
  • the structure of the neural network model can include: the first layer of linear change unit, followed by a 6-layer or 9-layer DFSMN (Deep Feedforward Sequential Memory Network, deep feedforward memory network) ), each layer of DFSMN has an activation function, the activation function can specifically be RELU or Sigmoid, a layer of linear change units in DFSMN, and a nonlinear activation function Sigmoid of the output layer.
  • DFSMN can be replaced by model units such as LSTM (Long Short-Term Memory, long-term short-term memory network), GRU (Gated Recurrent Unit, gated cycle unit), or a combination of LSTM and GRU. This embodiment does not limit the neural network unit type.
  • the target model can be trained from multiple sets of reverberation audio and their corresponding time-frequency masking information, so that the model can output the target time-frequency masking information by inputting the characteristic variables corresponding to the reverberation audio to be tested. Since the frequency masking information It can be used to suppress the reverberation feature in the reverberation audio, and the reverberation audio to be tested is processed by using the target time-frequency masking information to obtain the target audio.
  • RIR Room Impluse Response
  • the target type audio in the embodiment of the present application includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio. Specifically, it can be the early reflection of about 100 ms including the direct sound.
  • the target type audio Determine the time-frequency masking information corresponding to the reverberation audio with the reverberation audio. Therefore, when using the time-frequency masking information to suppress the reverberation of the reverberation audio to be tested, the early reflections can be retained, and the mid-term reflections and late reverberations can be suppressed, so that The resulting target audio is smoother and more natural.
  • S23 Process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
  • the audio to be tested is masked by the time-frequency masking information of the target, and the mid-term reflection and late reverberation in the audio to be tested are removed to obtain the target audio containing only the direct sound and the early reflection.
  • processing the audio to be tested according to the target time-frequency masking information to obtain the target audio includes: processing the audio to be tested by using the target time-frequency masking information to obtain the target frequency domain information, and performing inverse Fourier transform on the target frequency domain information to obtain the target audio.
  • the time spectrum of the audio to be tested is obtained, and the frequency domain information is obtained, and the time spectrum of the test audio is processed by using the target time-frequency masking information, and then the processed time spectrum is converted from the frequency domain to the time domain by inverse Fourier transform , to obtain the time domain information, that is, to obtain the target audio that the user can listen to and recognize.
  • the Fourier transform of the audio to be tested is STFT when acquiring the feature vector of the audio to be tested
  • the inverse Fourier transform performed on the target frequency domain information may be iSTFT.
  • using the target time-frequency masking information to process the audio to be tested, and obtaining the target frequency domain information includes: processing the target time-frequency masking information and time spectrum information corresponding to the audio to be tested Multiply, get the target frequency domain information.
  • the target time-frequency masking information is obtained, wherein the target model is used to determine the time-frequency corresponding to the reverberation audio Masking information, the time-frequency masking information is used to process the reverberation audio into the target type audio, the target type audio contains the direct sound and early reflections of the sound source corresponding to the reverberation audio; process the audio to be tested according to the target time-frequency masking information, Get the target audio.
  • the audio to be tested is processed by the target model to obtain the target time-frequency masking information, and the audio to be tested is processed by using the target time-frequency masking information to obtain the target audio, which achieves the purpose of suppressing the reverberation in the audio to be tested, thereby realizing the improvement of pickup
  • the audio to be tested is the pair of sound sources of at least two collectors in the target space.
  • the target model is used to determine the time-frequency masking information corresponding to at least two reverberation audio of the same sound source, and obtaining the feature vector of the audio to be tested includes: separately calculating the audio collected by each collector in the target space The eigenvectors of at least two eigenvectors are obtained; the at least two eigenvectors are spliced to generate the eigenvectors of the audio to be tested.
  • multiple feature vectors of multiple reverberation audio collected by multiple collectors from one sound source are concatenated, and the concatenated feature vectors and corresponding time-frequency masks are used as samples data.
  • the feature vectors of the multi-channel reverberation audio collected by multiple collectors in the target space are concatenated as the input of the target model, and the target time-frequency mask is obtained after model processing.
  • the target time-frequency mask Membranes can be used to suppress reverberant features in reverberant audio captured by multiple harvesters.
  • the The method before the feature vector of the audio to be tested is input into the target model for processing to obtain the target time-frequency masking information, the The method also includes: separately obtaining the room shock response characteristics corresponding to the sound sources in different spaces, and obtaining the direct sound in the room shock response characteristics; The reverberation audio, and determine the target type audio corresponding to the sound source according to the voice emitted by the sound source and the early reflections; determine the time-frequency masking information corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio; each The reverberation audio and the time-frequency masking information corresponding to the reverberation audio are determined as a set of sample data to obtain multiple sets of sample data; the preset neural network model is trained through the multiple sets of sample data to generate a target model.
  • 16k/48k sampled voice data can be prepared first, a room of random size is simulated, and the sound source is randomly set in the room (the voice data sent out is pre-accurate sampled voice data) and multiple receivers (R1, R2..., RM), and generate the RIR data of the sampled voice data in the room.
  • the IMAGE method can be used to generate the RIR data.
  • the preset deep network model is a deep neural network that can include linear transformation units, DFSMN and ReLU (Rectified Linear Unit, linear rectification function) units, and Sigmoid nonlinear activation functions. network model.
  • sample data is generated in batches, specifically, the reverberation audio (x1, x2..., xM) is obtained by convolution of sampled speech data and RIR; the target audio (s1 ,s2...,sM), specifically, the early reflections can be selected from the direct sound and the early reflections 50ms or 100ms later.
  • calculate the time-frequency mask corresponding to the reverberation audio according to the reverberation audio and the target audio which is the expected time-frequency mask of the reverberation audio, such as phase sensitive mask or complex ratio mask.
  • the reverberation audio corresponding to the same sound source can be multi-channel reverberation audio, and the feature vector of each reverberation audio can be extracted. Specifically, it can be the frequency domain feature after Fourier transform, and the extracted feature vector Splicing is performed to obtain the common eigenvectors of multiple channels of reverberation audio, and mask the common feature vectors and expected time-frequency of multiple channels of reverberation audio.
  • the training of the model is carried out.
  • the expected time-frequency mask in the sample data is used as the optimization target, and the loss function is calculated.
  • the output time-frequency mask and the expected time-frequency mask are calculated. The mean square error between them, and use the gradient backpropagation algorithm to adjust the model parameters, and repeat the above process until the loss function of the model on the verification set no longer decreases significantly, indicating that the model has converged, and the target model is determined according to the converged model parameters.
  • the trained model can have an obvious reverberation suppression effect, thereby effectively improving the audio quality of hearing.
  • the reverberation audio data collected by multiple devices is simulated, and the model is trained on the basis of simulated mixed data and a small amount of actual data training. While reducing sampling costs, the trained model It can improve the listening effect of long-distance pickup in the actual space.
  • the method further includes: adding noise information to the reverberation audio of the sound source to obtain the processed reverberation audio; according to the reverberation audio of each sound source and the target type audio Determining the time-frequency masking information corresponding to the reverberation audio includes: determining the time-frequency masking information corresponding to the reverberation audio according to the processed reverberation audio and the target type audio.
  • the sampled speech in the speech library (speech not including noise and reverberation) is compared with the reverberation feature in the reverberation feature library and the noise in the noise library. Combine to obtain the reverberation audio, and combine the sampled speech with the early reflection sound in the reverberation feature to obtain the target audio. Then calculate the time-frequency mask according to the reverberation audio and the target audio. The time-frequency mask can suppress reverberation and noise at the same time.
  • the sample data is obtained by combining the time-frequency mask and the reverberation audio.
  • the trained model can process the audio containing noise and reverberation features, obtain the corresponding time-frequency mask, and process the audio to be tested through the obtained time-frequency mask, achieving simultaneous The effect of suppressing reverb and noise.
  • processing the audio to be tested according to the target time-frequency masking information, and obtaining the target audio includes: smoothing the target time-frequency masking information, And use the processed target time-frequency masking information to process the audio to be tested to obtain the target audio; or use the target time-frequency masking information to process the audio to be tested to obtain the processed audio, and smooth the processed audio to obtain the target audio.
  • smoothing processing may be smoothing processing in the time dimension, and temporal smoothing processing may be performed on the mask output by the model or the frequency spectrum after the mask is applied, so as to make the obtained target audio smoother and more natural.
  • an audio processing method is also provided, as shown in FIG. 5, the method includes:
  • observation signal and the target speech are subjected to STFT transformation respectively to obtain the feature vectors of the observation signal and the target speech, and the feature vectors of the observation signal and the feature vectors of the target speech constitute the training set data, and the model is trained through the training set data , the obtained model can process the feature vector of the audio to be tested, wherein the audio to be tested can contain reverberation and noise at the same time, and the iSTFT transformation is performed on the processed data to obtain a prediction signal, which does not contain noise and mid-term reflection Acoustic and late reverberation, that is, through the method of this embodiment, the effect of simultaneously suppressing reverberation and noise in the audio to be tested is achieved, and the audio quality of the audio is greatly improved.
  • an audio processing method is also provided, as shown in FIG. 6, the method includes:
  • the cloud server receives audio to be tested.
  • the audio to be tested can be the audio obtained by the pickup collecting the sound from the sound source.
  • the pickup and the sound source are in the same target space, and the target space can be a room. Due to the existence of reverberation in the room, the audio to be tested for reverb audio.
  • the cloud server obtains the feature vector of the audio to be tested, uses the target model to process the feature vector of the audio to be tested, obtains the target time-frequency masking information, and processes the audio to be tested according to the target time-frequency masking information to obtain the target audio, wherein, the target The model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio.
  • the audio to be tested can be converted from the time domain to the frequency domain to obtain the time spectrum corresponding to the audio to be tested.
  • the time spectrum is used to represent the frequency domain information, and then the frequency domain feature vector is obtained from the time spectrum, and the frequency domain feature Vector input to the target model for processing.
  • the target model can be obtained by training multiple groups of reverberation audio and its corresponding time-frequency masking information, so that the model can output the target time-frequency masking information by inputting the characteristic variables corresponding to the reverberation audio to be tested. Since the time-frequency masking information It can be used to suppress the reverberation feature in the reverberation audio, and the reverberation audio to be tested is processed by using the target time-frequency masking information to obtain the target audio.
  • the target type audio in the embodiment of the present application includes the direct sound and early reflections of the sound source corresponding to the reverberation audio
  • the time-frequency masking information corresponding to the reverberation audio is determined through the target type audio and the reverberation audio , when using the time-frequency masking information to perform reverberation suppression on the reverberation audio to be tested, early reflections can be preserved, and mid-term reflections and late reverberations can be suppressed.
  • the cloud server returns the target audio to the client.
  • the early reflections are preserved in the target audio, and the mid-term reflections and late reverberation are suppressed, so that the user's listening experience is smooth and natural.
  • an audio processing method is also provided, as shown in FIG. 7, the method includes:
  • S71 Collect the audio to be tested, and play the audio to be tested on the audio player.
  • the audio to be tested can be the audio obtained by the pickup collecting the sound from the sound source.
  • the pickup and the sound source are in the same target space, and the target space can be a room. Due to the existence of reverberation in the room, the audio to be tested The audio is reverberated, therefore, the user cannot get a clear sense of hearing when the audio to be tested is played on the audio player.
  • the target audio is the audio obtained after processing the audio to be tested through the target time-frequency masking information
  • the target time-frequency masking information is the feature of the audio to be tested through the target model
  • the information obtained by processing the vector, the target model is used to determine the time-frequency masking information corresponding to the reverberation audio.
  • the target model can be obtained by training multiple groups of reverberation audio and their corresponding time-frequency masking information, so that the model can input the characteristic variables corresponding to the reverberation audio to be tested, and the target type audio can be obtained.
  • the target type audio contains the direct sound and early reflections of the sound source corresponding to the reverberation audio, and the time-frequency masking information corresponding to the reverberation audio is determined through the target type audio and the reverberation audio, and the time-frequency masking information is used to treat the test reverberation When the audio is reverberated, the early reflections can be preserved, and the mid-term reflections and late reverberation can be suppressed.
  • the audio player plays the target audio corresponding to the audio to be tested, since the early reflections are preserved in the target audio, the mid-term reflections and late reverberation are suppressed, so that the user's sense of hearing is smooth and natural.
  • the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to enable a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the various embodiments of the present application.
  • an audio processing method is also provided, as shown in FIG. 8, the method includes:
  • S81 Use at least two collectors to collect audio generated in the teaching space to obtain a first audio.
  • the teaching space can be an offline classroom of a remote classroom. Two or more collectors are distributed in the classroom.
  • the first audio that is, the audio generated in the teaching space, can be used for the teacher’s lecture or the students’ answers in the classroom.
  • the sound generated at the time may also be the sound generated by the multimedia equipment in the teaching space. Due to the existence of the reverberation phenomenon in the room, the first audio frequency is the reverberation audio.
  • the first audio can be converted from the time domain to the frequency domain to obtain the time spectrum corresponding to the first audio, the time spectrum is used to represent the frequency domain information, and then the frequency domain feature vector is obtained from the time spectrum, and the frequency domain feature Vector input to the target model for processing.
  • the target model can be obtained by training multiple groups of reverberation audio and its corresponding time-frequency masking information, so that the model can output the target time-frequency masking information by inputting the feature variable corresponding to the first audio, because the frequency masking information can be used for The reverberation feature in the reverberation audio is suppressed, and the target time-frequency masking information is used to process the first audio to obtain the second audio.
  • the target type audio in the embodiment of the present application includes the direct sound and early reflections of the sound source corresponding to the reverberation audio
  • the time-frequency masking information corresponding to the reverberation audio is determined through the target type audio and the reverberation audio , using the time-frequency masking information to suppress reverberation on the first audio, and the obtained second audio can retain early reflections and suppress mid-term reflections and late reverberation.
  • the second audio is sent to the remote classroom corresponding to the teaching space, and the second audio is played in the remote classroom. Since the early reflections are retained in the second audio, mid-term reflections and late reverberation are suppressed, compared to Playing the first audio in the remote classroom can improve the intelligibility of the audio content to the students in the remote classroom.
  • a device for implementing the above audio processing method is also provided, as shown in FIG. 9 , the device includes:
  • the first obtaining unit 91 is configured to obtain a feature vector of the audio to be tested.
  • the first processing unit 92 is configured to input the feature vector of the audio to be tested into the target model for processing to obtain target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to
  • the reverberation audio is processed into the target type audio, and the target type audio includes the direct sound and the early reflection sound of the sound source corresponding to the reverberation audio.
  • the second processing unit 93 is configured to process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
  • first acquisition unit 91, first processing unit 92 and second processing unit 93 correspond to step S21, step S22 and step S22 in Embodiment 1, and the two modules and corresponding steps realize The examples and application scenarios are the same, but are not limited to the content disclosed in Embodiment 1 above. It should be noted that, as a part of the device, the above modules can run in the computer terminal 10 provided in the first embodiment.
  • the audio to be tested is the audio obtained by collecting the sound source by at least two collectors in the target space
  • the target model is used to determine at least two reverberations of the same sound source
  • the first acquisition unit 91 includes: a calculation module, which is used to separately calculate the feature vectors of the audio collected by each collector in the target space, to obtain at least two feature vectors; At least two feature vectors are concatenated to generate a feature vector of the audio to be tested.
  • the device further includes: a second acquisition unit, configured to input the feature vector of the audio to be tested into the target model for processing to obtain the target time-frequency masking information, respectively Obtain the room shock response characteristics corresponding to the sound sources in different spaces, and obtain the direct sound in the room shock response characteristics; the first determination unit is used to determine the sound source according to the voice emitted by each sound source and the corresponding room shock response characteristics Corresponding reverberation audio, and determine the target type audio corresponding to the sound source according to the voice emitted by the sound source and the early reflections; the second determination unit is used to determine the reverberation audio according to the reverberation audio of each sound source and the target type audio Corresponding time-frequency masking information; the third determination unit is used to determine each reverberation audio and the time-frequency masking information corresponding to the reverberation audio as a set of sample data to obtain multiple sets of sample data; the model generation unit is used to pass Multiple
  • the device further includes: an audio processing unit, configured to determine the time-frequency masking information corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio Before, noise information is added to the reverberation audio of the sound source to obtain the processed reverberation audio; the fourth determination unit is used to determine the time-frequency corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio
  • the masking information includes: a fifth determining unit configured to determine time-frequency masking information corresponding to the reverberation audio according to the processed reverberation audio and the target type audio.
  • the first acquisition unit 801 includes: a first processing module, configured to perform Fourier transform on the audio to be tested to obtain frequency domain information of the audio to be tested, and obtain frequency domain information from the frequency domain
  • the feature vector of the audio to be tested is obtained from the information
  • the second processing module is used to process the audio to be tested according to the target time-frequency masking information
  • obtaining the target audio includes: a third processing module, which is used to process the audio to be tested by using the target time-frequency masking information , to obtain the target frequency domain information, and perform an inverse Fourier transform on the target frequency domain information to obtain the target audio.
  • the third processing module is further configured to process the target time-frequency masking information and multiply the time-frequency spectrum information corresponding to the audio to be tested to obtain the target frequency domain information.
  • the second processing module includes: a first processing submodule, configured to perform smoothing processing on the target time-frequency masking information, and use the processed target time-frequency masking information to treat The test audio is processed to obtain the target audio; or the second processing submodule is used to process the audio to be tested by using the target time-frequency masking information to obtain the processed audio, and smooth the processed audio to obtain the target audio.
  • Embodiments of the present application may provide a computer terminal, and the computer terminal may be any computer terminal device in a group of computer terminals.
  • the foregoing computer terminal may also be replaced with a terminal device such as a mobile terminal.
  • the foregoing computer terminal may be located in at least one network device among multiple network devices of the computer network.
  • the above-mentioned computer terminal can execute the program code of the following steps in the audio processing method of the application program: obtain the feature vector of the audio to be tested; input the feature vector of the audio to be tested into the target model for processing, and obtain the target time-frequency mask information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound of the sound source corresponding to the reverberation audio and early reflections; the audio to be tested is processed according to the time-frequency masking information of the target to obtain the target audio.
  • FIG. 10 is a structural block diagram of a computer terminal according to an embodiment of the present application.
  • the computer terminal A may include: one or more (only one is shown in the figure) processors, memory, and transmission means.
  • the memory can be used to store software programs and modules, such as the program instructions/modules corresponding to the audio processing method and device in the embodiment of the present application, and the processor executes various functional applications by running the software programs and modules stored in the memory. And data processing, that is, realizing the above-mentioned audio processing method.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory may further include memory located remotely relative to the processor, and these remote memories may be connected to terminal A through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: obtain the feature vector of the audio to be tested; input the feature vector of the audio to be tested into the target model for processing, and obtain the target time-frequency masking information, wherein , the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, which contains the direct sound and early reflections of the sound source corresponding to the reverberation audio sound; process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
  • a computer terminal is provided.
  • the information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; the step of processing the audio to be tested according to the target time-frequency masking information to obtain the target audio,
  • the purpose of suppressing the reverberation in the audio to be tested is achieved, thereby achieving the technical effect of improving the clarity of the audio collected by the pickup device, and then solving the problem of the reverberation phenomenon in the space that caused the pickup device to collect A technical issue with low audio clarity.
  • FIG. 10 does not limit the structure of the above-mentioned electronic device.
  • the computer terminal 10 may also include more or fewer components (eg, network interface, display device, etc.) than those shown in FIG. 10 , or have a configuration different from that shown in FIG. 10 .
  • the embodiment of the present application also provides a storage medium.
  • the foregoing storage medium may be used to store program codes executed by the audio processing method provided in Embodiment 1 above.
  • the above-mentioned storage medium may be located in any computer terminal in the group of computer terminals in the computer network, or in any mobile terminal in the group of mobile terminals.
  • the storage medium is configured to store program codes for performing the following steps: obtaining the feature vector of the audio to be tested; inputting the feature vector of the audio to be tested into the target model for processing to obtain the target time-frequency Masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes direct access to the sound source corresponding to the reverberation audio sound and early reflections; the audio to be tested is processed according to the time-frequency masking information of the target to obtain the target audio.
  • the disclosed technical content can be realized in other ways.
  • the device embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for enabling a computer device (which may be a personal computer, server or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The present application discloses an audio processing method and apparatus, a storage medium and a computer program. The method comprises: obtaining a feature vector of an audio to be tested; inputting the feature vector of the audio to be tested into a target model for processing to obtain target time-frequency masking information, wherein the target model is used to determine time-frequency masking information corresponding to reverberation audio, the time-frequency masking information is used to process the reverberation audio into a target type of audio, and the target type of audio comprises a direct sound and early reflection sound of a sound source corresponding to the reverberation audio; and according to the target time-frequency masking information, processing the audio to be tested to obtain a target audio. The present application solves the technical problem that the clarity of an audio collected by a sound pickup device is low due to the existence of reverberation in a space.

Description

音频处理方法、装置、存储介质及计算机程序Audio processing method, device, storage medium and computer program
本公开要求于2021年10月14日提交中国专利局、申请号为CN 202111194926.5、申请名称为“音频处理方法、装置、存储介质及计算机程序”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims the priority of a Chinese patent application with application number CN 202111194926.5 and application title "Audio Processing Method, Device, Storage Medium, and Computer Program" filed with the China Patent Office on October 14, 2021, the entire contents of which are incorporated by reference incorporated in this disclosure.
技术领域technical field
本申请涉及音频处理技术领域,具体而言,涉及一种音频处理方法、装置、存储介质及计算机程序。The present application relates to the technical field of audio processing, and in particular, to an audio processing method, device, storage medium and computer program.
背景技术Background technique
混响是空间内的声源发音停止后声音继续存在的声学现象,混响的存在使得音频采集设备采集到的语言清晰度低,影响采集到的语音的可懂度。Reverberation is an acoustic phenomenon in which the sound continues to exist after the sound source in the space stops. The existence of reverberation makes the speech clarity collected by the audio collection equipment low, affecting the intelligibility of the collected speech.
其中,在较大的空间内,为了采集到空间各个区域发出的声音,需要采用两个或者多个拾音设备共同配合拾取空间内产生的音频,但是,由于空间较大,拾音设备采集的声音混响感非常明显,从而降低了采集到的音频内容的可懂度。Among them, in a larger space, in order to collect the sound from each area of the space, it is necessary to use two or more sound pickup devices to cooperate to pick up the audio generated in the space. However, due to the large space, the sound collected by the sound pickup device Sound reverberation is very noticeable, reducing the intelligibility of the captured audio content.
针对上述的问题,目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.
发明内容Contents of the invention
本申请实施例提供了一种音频处理方法、装置、存储介质及计算机程序,以至少解决由于空间内的混响现象的存在,导致拾音设备采集到的音频的清晰度低的技术问题。Embodiments of the present application provide an audio processing method, device, storage medium, and computer program to at least solve the technical problem of low clarity of audio collected by a sound pickup device due to the existence of reverberation in a space.
根据本申请实施例的一个方面,提供了一种音频处理方法,包括:获取待测试音频的特征向量;将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理待测试音频,得到目标音频。According to an aspect of the embodiment of the present application, an audio processing method is provided, including: obtaining the feature vector of the audio to be tested; inputting the feature vector of the audio to be tested into the target model for processing to obtain target time-frequency masking information, wherein the target The model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; The audio to be tested is processed according to the target time-frequency masking information to obtain the target audio.
根据本申请实施例的另一方面,还提供了另一种音频处理方法,包括:云服务器接收待测试音频;云服务器获取待测试音频的特征向量,采用目标模型对待测试音频的特征向量进行处理,得到目标时频掩蔽信息,并根据目标时频掩蔽信息处理待测试音频,得到目标音频,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;云服务器返回目标音频至客户端。According to another aspect of the embodiment of the present application, another audio processing method is provided, including: the cloud server receives the audio to be tested; the cloud server obtains the feature vector of the audio to be tested, and uses the target model to process the feature vector of the audio to be tested , to obtain the target time-frequency masking information, and process the audio to be tested according to the target time-frequency masking information to obtain the target audio, where the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to The audio processing is the target type audio, which includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; the cloud server returns the target audio to the client.
根据本申请实施例的另一方面,还提供了另一种音频处理方法,包括:采集待测试音频,并在音频播放器播放待测试音频;在音频播放器播放待测试音频对应的目标音频,其 中,目标音频是通过目标时频掩蔽信息对待测试音频进行处理后得到的音频,目标时频掩蔽信息是通过目标模型对待测试音频的特征向量进行处理得到的信息,目标模型用于确定混响音频对应的时频掩蔽信息。According to another aspect of the embodiment of the present application, another audio processing method is provided, including: collecting the audio to be tested, and playing the audio to be tested on the audio player; playing the target audio corresponding to the audio to be tested on the audio player, Among them, the target audio is the audio obtained by processing the audio to be tested through the target time-frequency masking information, the target time-frequency masking information is the information obtained by processing the feature vector of the audio to be tested through the target model, and the target model is used to determine the reverberation audio The corresponding time-frequency masking information.
根据本申请实施例的另一方面,还提供了另一种音频处理方法,包括:通过至少两个采集器采集教学空间内产生的音频,得到第一音频;获取第一音频的特征向量,并将第一音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理第一音频,得到第二音频;将第二音频发送至教学空间所对应的远端课堂。According to another aspect of the embodiment of the present application, there is also provided another audio processing method, including: collecting audio generated in the teaching space through at least two collectors to obtain the first audio; obtaining the feature vector of the first audio, and Input the feature vector of the first audio into the target model for processing to obtain the target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio as the target type audio, the target type audio contains the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; process the first audio according to the target time-frequency masking information to obtain the second audio; send the second audio to the corresponding teaching space remote classroom.
根据本申请实施例的另一方面,还提供了一种音频处理装置,包括:第一获取单元,用于获取待测试音频的特征向量;第一处理单元,用于将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;第二处理单元,用于根据目标时频掩蔽信息处理待测试音频,得到目标音频。According to another aspect of the embodiment of the present application, there is also provided an audio processing device, including: a first acquisition unit, configured to acquire a feature vector of the audio to be tested; a first processing unit, configured to convert the feature vector of the audio to be tested to Input the target model for processing to obtain the target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into the target type audio, in the target type audio The direct sound and the early reflection sound of the sound source corresponding to the reverberation audio are included; the second processing unit is configured to process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
根据本申请实施例的另一方面,还提供了一种存储介质,存储介质包括存储的程序,其中,在程序运行时控制存储介质所在设备执行上述任意一种音频处理方法。According to another aspect of the embodiments of the present application, there is also provided a storage medium, the storage medium includes a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute any one of the above audio processing methods.
根据本申请实施例的另一方面,还提供了一种计算机程序,其特征在于,计算机程序被处理器执行时实现上述任意一种音频处理方法。According to another aspect of the embodiments of the present application, there is also provided a computer program, which is characterized in that, when the computer program is executed by a processor, any one of the above audio processing methods is implemented.
在本申请实施例中,通过获取待测试音频的特征向量;将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理待测试音频,得到目标音频。通过目标模型对待测试音频进行处理,得到目标时频掩蔽信息,并采用目标时频掩蔽信息处理待测试音频,得到目标音频,达到了抑制待测试音频中的混响的目的,从而实现了提高拾音设备采集到的音频的清晰度的技术效果,进而解决了由于空间内的混响现象的存在,导致拾音设备采集到的音频的清晰度低的技术问题。In the embodiment of the present application, by obtaining the feature vector of the audio to be tested; inputting the feature vector of the audio to be tested into the target model for processing, the target time-frequency masking information is obtained, wherein the target model is used to determine the time-frequency corresponding to the reverberation audio Masking information, the time-frequency masking information is used to process the reverberation audio into the target type audio, the target type audio contains the direct sound and early reflections of the sound source corresponding to the reverberation audio; process the audio to be tested according to the target time-frequency masking information, Get the target audio. The audio to be tested is processed by the target model to obtain the target time-frequency masking information, and the audio to be tested is processed by using the target time-frequency masking information to obtain the target audio, which achieves the purpose of suppressing the reverberation in the audio to be tested, thereby realizing the improvement of pickup The technical effect of the clarity of the audio collected by the sound equipment, and then solve the technical problem of low clarity of the audio collected by the sound pickup equipment due to the existence of the reverberation phenomenon in the space.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The schematic embodiments and descriptions of the application are used to explain the application and do not constitute an improper limitation to the application. In the attached picture:
图1是根据本申请实施例的计算机终端的硬件结构框图;FIG. 1 is a block diagram of a hardware structure of a computer terminal according to an embodiment of the present application;
图2是根据本申请实施例一提供的音频处理方法的流程图;FIG. 2 is a flowchart of an audio processing method provided according to Embodiment 1 of the present application;
图3是根据本申请实施例中房间冲击响应的幅度示意图;Fig. 3 is a schematic diagram of the magnitude of the room impulse response according to the embodiment of the present application;
图4是根据本申请实施例中房间冲击响应的信号示意图;FIG. 4 is a schematic diagram of a signal of a room impulse response according to an embodiment of the present application;
图5是根据本申请实施例二提供的音频处理方法的流程图;FIG. 5 is a flowchart of an audio processing method provided according to Embodiment 2 of the present application;
图6是根据本申请实施例三提供的音频处理方法的流程图;FIG. 6 is a flowchart of an audio processing method provided according to Embodiment 3 of the present application;
图7是根据本申请实施例四提供的音频处理方法的流程图;FIG. 7 is a flowchart of an audio processing method provided according to Embodiment 4 of the present application;
图8是根据本申请实施例五提供的音频处理方法的流程图;FIG. 8 is a flowchart of an audio processing method provided according to Embodiment 5 of the present application;
图9是根据本申请实施例六提供的音频处理装置的示意图;FIG. 9 is a schematic diagram of an audio processing device provided according to Embodiment 6 of the present application;
图10是根据本申请实施例的可选的计算机终端的结构框图。Fig. 10 is a structural block diagram of an optional computer terminal according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is an embodiment of a part of the application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this application.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present application and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
为了解决相关技术中由于空间内的混响现象的存在,导致拾音设备采集到的音频的清晰度低的技术问题,相关技术中出现了以下方法:In order to solve the technical problem in the related art that the clarity of the audio collected by the pickup device is low due to the existence of the reverberation phenomenon in the space, the following methods have appeared in the related art:
1、基于信号处理方式实现混响抑制,具体地,单通道拾音场景下,通过预先假设的混响统计模型来估计晚期混响能量来计算维纳增益,从而根据维纳增益对采集到的音频进行混响抑制,混响抑制效果不明显;在多通道场景,采用WPE方法(Weighted Prediction Error for speech dereverberation),在麦克风数据较少时算法处理后的听感改善并不明显。1. Realize reverberation suppression based on signal processing. Specifically, in a single-channel pickup scenario, estimate the late reverberation energy through a pre-assumed reverberation statistical model to calculate the Wiener gain, so that the collected The reverberation suppression is performed on the audio, and the reverberation suppression effect is not obvious; in the multi-channel scene, the WPE method (Weighted Prediction Error for speech dereverberation) is adopted, and the hearing improvement after algorithm processing is not obvious when the microphone data is small.
2、基于深度学***滑,其处理后的音频会存在较明显的能量起伏,听感不自然。此外,目前的深度学习模型混响抑制算法只考虑了单个拾音设备,没有充分利用考虑多个拾音设备的互补信息,混响抑制效果不佳。2. The reverberation suppression algorithm based on the deep learning model uses direct sound as the training and recovery target. Since the degree of reverberation suppression is not smooth enough in time, the processed audio will have obvious energy fluctuations and the sense of hearing is unnatural. In addition, the current deep learning model reverberation suppression algorithm only considers a single pickup device, does not fully utilize the complementary information of multiple pickup devices, and the reverberation suppression effect is not good.
基于此,本申请希望提供一种能够解决上述技术问题的方案,其详细内容将在后续实施例中得以阐述。Based on this, the present application hopes to provide a solution capable of solving the above-mentioned technical problems, the details of which will be described in subsequent embodiments.
首先,在对本申请实施例进行描述的过程中出现的部分名词或术语适用于如下解释:First of all, some nouns or terms that appear during the description of the embodiments of the present application are applicable to the following explanations:
混响:混响是声源发音停止后声音继续存在的声学现象。Reverberation: Reverberation is the acoustic phenomenon in which the sound continues after the sound source stops producing sounds.
RIR:(Room Impluse Response),房间冲击响应,可以用来描述房间混响特性,RIR包含时间上连续的三个部分:直达声、早期反射声和晚期混响。RIR: (Room Impluse Response), room impact response, can be used to describe the characteristics of room reverberation, RIR contains three parts that are continuous in time: direct sound, early reflection sound and late reverberation.
DFSMN:(Deep Feedforward Sequential Memory Network),深度前馈记忆网络,是一种神经网络模型结构。DFSMN: (Deep Feedforward Sequential Memory Network), a deep feedforward memory network, is a neural network model structure.
实施例1Example 1
根据本申请实施例,提供了一种音频处理方法实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机***中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present application, an embodiment of an audio processing method is provided. It should be noted that the steps shown in the flow charts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, although in The flowcharts show a logical order, but in some cases the steps shown or described may be performed in an order different from that shown or described herein.
本申请实施例一所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。图1示出了一种用于实现音频处理方法的计算机终端(或移动设备)的硬件结构框图。如图1所示,计算机终端10(或移动设备10)可以包括一个或多个(图中采用102a、102b,……,102n来示出)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输装置。除此以外,还可以包括:显示器、输入/输出接口(I/O接口)、通用串行总线(USB)端口(可以作为I/O接口的端口中的一个端口被包括)、网络接口、电源和/或相机。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。The method embodiment provided in Embodiment 1 of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing an audio processing method. As shown in FIG. 1 , the computer terminal 10 (or mobile device 10) may include one or more (shown by 102a, 102b, ..., 102n in the figure) processors 102 (the processors 102 may include but not limited to microprocessor A processing device such as a processor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device for communication functions. In addition, it can also include: a display, an input/output interface (I/O interface), a universal serial bus (USB) port (which can be included as one of the ports of the I/O interface), a network interface, a power supply and/or camera. Those of ordinary skill in the art can understand that the structure shown in FIG. 1 is only a schematic diagram, and it does not limit the structure of the above-mentioned electronic device. For example, computer terminal 10 may also include more or fewer components than shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .
应当注意到的是上述一个或多个处理器102和/或其他数据处理电路在本文中通常可以被称为“数据处理电路”。该数据处理电路可以全部或部分的体现为软件、硬件、固件或其他任意组合。此外,数据处理电路可为单个独立的处理模块,或全部或部分的结合到计算机终端10(或移动设备)中的其他元件中的任意一个内。如本申请实施例中所涉及到的,该数据处理电路作为一种处理器控制(例如与接口连接的可变电阻终端路径的选择)。It should be noted that the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits". The data processing circuit may be implemented in whole or in part as software, hardware, firmware or other arbitrary combinations. In addition, the data processing circuit can be a single independent processing module, or be fully or partially integrated into any of the other elements in the computer terminal 10 (or mobile device). As mentioned in the embodiment of the present application, the data processing circuit is used as a processor control (for example, the selection of the terminal path of the variable resistor connected to the interface).
存储器104可用于存储应用软件的软件程序以及模块,如本申请实施例中的音频处理方法对应的程序指令/数据存储装置,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的应用程序的音频处理方法。 存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store software programs and modules of application software, such as the program instruction/data storage device corresponding to the audio processing method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104 by running Various functional applications and data processing, that is, the audio processing method for realizing the above-mentioned application programs. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include a memory that is remotely located relative to the processor 102 , and these remote memories may be connected to the computer terminal 10 through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or transmit data via a network. The specific example of the above-mentioned network may include a wireless network provided by the communication provider of the computer terminal 10 . In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.
显示器可以例如触摸屏式的液晶显示器(LCD),该液晶显示器可使得用户能够与计算机终端10(或移动设备)的用户界面进行交互。The display may be, for example, a touchscreen liquid crystal display (LCD), which may enable a user to interact with the user interface of the computer terminal 10 (or mobile device).
在上述运行环境下,本申请提供了如图2所示的音频处理方法。图2是根据本申请实施例一的音频处理方法的流程图。Under the above operating environment, the present application provides an audio processing method as shown in FIG. 2 . Fig. 2 is a flowchart of an audio processing method according to Embodiment 1 of the present application.
S21,获取待测试音频的特征向量。S21. Acquire a feature vector of the audio to be tested.
具体地,待测试音频可以是拾音器对声源发出的声音进行采集得到的音频,拾音器和声源处于同一目标空间内,目标空间可以为房间,由于房间内的混响现象的存在,待测试音频为混响音频。Specifically, the audio to be tested can be the audio obtained by the pickup collecting the sound from the sound source. The pickup and the sound source are in the same target space, and the target space can be a room. Due to the existence of reverberation in the room, the audio to be tested for reverb audio.
为了便于对待测试音频进行处理,可以先提取待测试音频的特征向量,可选地,在本申请实施例的音频处理方法中,获取待测试音频的特征向量包括:对待测试音频进行傅里叶变换,得到待测试音频的频域信息,从频域信息中获取待测试音频的特征向量。In order to facilitate the processing of the audio to be tested, the feature vector of the audio to be tested can be extracted first. Optionally, in the audio processing method of the embodiment of the present application, obtaining the feature vector of the audio to be tested includes: performing Fourier transform on the audio to be tested , obtain the frequency domain information of the audio to be tested, and obtain the feature vector of the audio to be tested from the frequency domain information.
具体地,可以先通过短时傅里叶变换(short-time Fourier transform,STFT)将待测试音频从时域转换到频域,得到待测试音频对应的时频谱,时频谱用于表征频域信息,再从时频谱中获取频域特征向量,例如,可以从时频谱中获取滤波器组(filter bank),还可以获取梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC),伽马通滤波器组(Gammatone filter bank)等特征向量,本申请实施例不限定获取的特征向量的具体类型。Specifically, the audio to be tested can be converted from the time domain to the frequency domain by short-time Fourier transform (short-time Fourier transform, STFT), and the time spectrum corresponding to the audio to be tested can be obtained, and the time spectrum is used to represent the frequency domain information , and then obtain the frequency-domain feature vector from the time-spectrum, for example, you can obtain the filter bank (filter bank) from the time-spectrum, you can also obtain Mel-Frequency Cepstral Coefficients (Mel-Frequency Cepstral Coefficients, MFCC), gamma pass Feature vectors such as a filter bank (Gammatone filter bank), the embodiment of the present application does not limit the specific type of feature vectors to be obtained.
S22,将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声。S22. Input the feature vector of the audio to be tested into the target model for processing to obtain target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio It is the target type audio, which includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio.
需要说明的是,目标模型可以为神经网络模型,神经网络模型的结构可以包括:第一层线性变化单元,后面连接有6层或9层的DFSMN(Deep Feedforward Sequential Memory Network,深度前馈记忆网络),每一层DFSMN都存在激活函数,激活函数具体地可以为RELU或Sigmoid,在DFSMN之的一层线性变化单元,一季输出层的非线性激活函数 Sigmoid。此外,DFSMN可以替换为LSTM(Long Short-Term Memory,长短期记忆网络)、GRU(Gated Recurrent Unit,门控循环单元)等模型单元,或者LSTM和GRU的组合,本实施例不限定神经网络单元的类型。It should be noted that the target model can be a neural network model, and the structure of the neural network model can include: the first layer of linear change unit, followed by a 6-layer or 9-layer DFSMN (Deep Feedforward Sequential Memory Network, deep feedforward memory network) ), each layer of DFSMN has an activation function, the activation function can specifically be RELU or Sigmoid, a layer of linear change units in DFSMN, and a nonlinear activation function Sigmoid of the output layer. In addition, DFSMN can be replaced by model units such as LSTM (Long Short-Term Memory, long-term short-term memory network), GRU (Gated Recurrent Unit, gated cycle unit), or a combination of LSTM and GRU. This embodiment does not limit the neural network unit type.
具体地,目标模型可以由多组混响音频及其对应的时频掩蔽信息训练得到,从而使得模型输入待测试混响音频对应的特征变量,即可输出目标时频掩蔽信息,由于频掩蔽信息可以用于抑制混响音频中的混响特征,采用目标时频掩蔽信息对待测试混响音频进行处理,可以得到目标音频。Specifically, the target model can be trained from multiple sets of reverberation audio and their corresponding time-frequency masking information, so that the model can output the target time-frequency masking information by inputting the characteristic variables corresponding to the reverberation audio to be tested. Since the frequency masking information It can be used to suppress the reverberation feature in the reverberation audio, and the reverberation audio to be tested is processed by using the target time-frequency masking information to obtain the target audio.
需要说明的是,房间混响特性可以用房间冲击响应(Room Impluse Response,RIR)来描述,图3是根据本申请实施例中房间冲击响应的幅度示意图,图4是根据本申请实施例中房间冲击响应的信号示意图,如图3、图4所示,RIR包含时间上连续的三个部分:直达声、早期反射声和晚期混响。It should be noted that the reverberation characteristics of a room can be described by a room impulse response (Room Impluse Response, RIR). Figure 3 is a schematic diagram of the amplitude of the room impulse response according to the embodiment of the application, and Figure 4 is a schematic diagram of the room impulse response according to the embodiment of the application. The signal schematic diagram of the impulse response, as shown in Figure 3 and Figure 4, RIR includes three parts that are continuous in time: direct sound, early reflection sound and late reverberation.
而本申请实施例中的目标类型音频中包含混响音频对应的声源的直达声和早期反射声,具体可以为直达声在内的100ms左右早期反射,在获取样本数据时,通过目标类型音频和混响音频确定混响音频对应的时频掩蔽信息,因而,在采用时频掩蔽信息对待测试混响音频进行混响抑制时,可以保留早期反射声,抑制中期反射声和晚期混响,使得得到的目标音频更平滑和自然。However, the target type audio in the embodiment of the present application includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio. Specifically, it can be the early reflection of about 100 ms including the direct sound. When acquiring sample data, the target type audio Determine the time-frequency masking information corresponding to the reverberation audio with the reverberation audio. Therefore, when using the time-frequency masking information to suppress the reverberation of the reverberation audio to be tested, the early reflections can be retained, and the mid-term reflections and late reverberations can be suppressed, so that The resulting target audio is smoother and more natural.
S23,根据目标时频掩蔽信息处理待测试音频,得到目标音频。S23. Process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
具体地,通过目标时频掩蔽信息对待测试音频进行掩蔽处理,去除待测试音频中的中期反射声和晚期混响,得到仅包含直达声和早期反射声的目标音频。Specifically, the audio to be tested is masked by the time-frequency masking information of the target, and the mid-term reflection and late reverberation in the audio to be tested are removed to obtain the target audio containing only the direct sound and the early reflection.
由于输入目标模型的是待测试音频的特征向量,待测试音频的特征向量是频域特征,输出目标模型的时频掩蔽信息为频域信息,可选地,在本申请实施例的音频处理方法中,根据目标时频掩蔽信息处理待测试音频,得到目标音频包括:采用目标时频掩蔽信息处理待测试音频,得到目标频域信息,并对目标频域信息进行逆傅里叶变换,得到目标音频。Since the input target model is the feature vector of the audio to be tested, the feature vector of the audio to be tested is a frequency domain feature, and the time-frequency masking information of the output target model is frequency domain information. Optionally, in the audio processing method of the embodiment of the present application Among them, processing the audio to be tested according to the target time-frequency masking information to obtain the target audio includes: processing the audio to be tested by using the target time-frequency masking information to obtain the target frequency domain information, and performing inverse Fourier transform on the target frequency domain information to obtain the target audio.
也即,获取待测试音频的时频谱,得到频域信息,在采用目标时频掩蔽信息处理测试音频的时频谱,再将处理后的时频谱通过逆傅里叶变换从频域转换为时域,得到时域信息,即得到用户可以听取并识别的目标音频。例如,在获取待测试音频的特征向量时对待测试音频执行的是傅里叶变换是STFT的情况下,对目标频域信息进行的逆傅里叶变换可以为iSTFT。That is to say, the time spectrum of the audio to be tested is obtained, and the frequency domain information is obtained, and the time spectrum of the test audio is processed by using the target time-frequency masking information, and then the processed time spectrum is converted from the frequency domain to the time domain by inverse Fourier transform , to obtain the time domain information, that is, to obtain the target audio that the user can listen to and recognize. For example, when the Fourier transform of the audio to be tested is STFT when acquiring the feature vector of the audio to be tested, the inverse Fourier transform performed on the target frequency domain information may be iSTFT.
可选地,在本申请实施例的音频处理方法中,采用目标时频掩蔽信息处理待测试音频,得到目标频域信息包括:将目标时频掩蔽信息处理与待测试音频对应的时频谱信息相乘,得目标频域信息。Optionally, in the audio processing method of the embodiment of the present application, using the target time-frequency masking information to process the audio to be tested, and obtaining the target frequency domain information includes: processing the target time-frequency masking information and time spectrum information corresponding to the audio to be tested Multiply, get the target frequency domain information.
在本申请实施例中,通过获取待测试音频的特征向量;将待测试音频的特征向量输入 目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理待测试音频,得到目标音频。通过目标模型对待测试音频进行处理,得到目标时频掩蔽信息,并采用目标时频掩蔽信息处理待测试音频,得到目标音频,达到了抑制待测试音频中的混响的目的,从而实现了提高拾音设备采集到的音频的清晰度的技术效果,进而解决了由于空间内的混响现象的存在,导致拾音设备采集到的音频的清晰度低的技术问题。In the embodiment of the present application, by obtaining the feature vector of the audio to be tested; inputting the feature vector of the audio to be tested into the target model for processing, the target time-frequency masking information is obtained, wherein the target model is used to determine the time-frequency corresponding to the reverberation audio Masking information, the time-frequency masking information is used to process the reverberation audio into the target type audio, the target type audio contains the direct sound and early reflections of the sound source corresponding to the reverberation audio; process the audio to be tested according to the target time-frequency masking information, Get the target audio. The audio to be tested is processed by the target model to obtain the target time-frequency masking information, and the audio to be tested is processed by using the target time-frequency masking information to obtain the target audio, which achieves the purpose of suppressing the reverberation in the audio to be tested, thereby realizing the improvement of pickup The technical effect of the clarity of the audio collected by the sound equipment, and then solve the technical problem of low clarity of the audio collected by the sound pickup equipment due to the existence of the reverberation phenomenon in the space.
存在空间内包括至少两个采集器的情况,为了提高抑制混响的效果,可选地,在本申请实施例的音频处理方法中,待测试音频为目标空间中至少两个采集器对声源进行采集得到的音频,目标模型用于确定同一声源的至少两个混响音频对应的时频掩蔽信息,获取待测试音频的特征向量包括:分别计算目标空间中每个采集器采集到的音频的特征向量,得到至少两个特征向量;对至少两个特征向量进行拼接,生成待测试音频的特征向量。There are situations where at least two collectors are included in the space. In order to improve the effect of suppressing reverberation, optionally, in the audio processing method of the embodiment of the present application, the audio to be tested is the pair of sound sources of at least two collectors in the target space. Collecting the audio, the target model is used to determine the time-frequency masking information corresponding to at least two reverberation audio of the same sound source, and obtaining the feature vector of the audio to be tested includes: separately calculating the audio collected by each collector in the target space The eigenvectors of at least two eigenvectors are obtained; the at least two eigenvectors are spliced to generate the eigenvectors of the audio to be tested.
具体地,在目标模型的训练阶段,将多个采集器采集到一个声源的多路混响音频的多个特征向量进行拼接,并采用拼接后的特征向量以及对应的时频掩膜作为样本数据。同样的,在测试阶段,将目标空间内多个采集器采集到的多路混响音频的特征向量进行拼接,作为目标模型的输入,经过模型处理得到目标时频掩膜,该目标时频掩膜可以用于抑制多个采集器采集到的混响音频中的混响特征。Specifically, in the training phase of the target model, multiple feature vectors of multiple reverberation audio collected by multiple collectors from one sound source are concatenated, and the concatenated feature vectors and corresponding time-frequency masks are used as samples data. Similarly, in the test phase, the feature vectors of the multi-channel reverberation audio collected by multiple collectors in the target space are concatenated as the input of the target model, and the target time-frequency mask is obtained after model processing. The target time-frequency mask Membranes can be used to suppress reverberant features in reverberant audio captured by multiple harvesters.
在使用目标模型之前,需要进行模型的训练,可选地,在本申请实施例的音频处理方法中,在将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息之前,该方法还包括:分别获取不同空间内的声源对应的房间冲击响应特征,并获取房间冲击响应特征中的直达声;根据每个声源发出的语音与对应的房间冲击响应确定声源对应的混响音频,并根据声源发出的语音与早期反射声确定声源对应的目标类型音频;根据每个声源的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息;将每个混响音频以及混响音频对应的时频掩蔽信息确定为一组样本数据,得到多组样本数据;通过多组样本数据训练预设神经网络模型,生成目标模型。Before using the target model, model training is required. Optionally, in the audio processing method of the embodiment of the present application, before the feature vector of the audio to be tested is input into the target model for processing to obtain the target time-frequency masking information, the The method also includes: separately obtaining the room shock response characteristics corresponding to the sound sources in different spaces, and obtaining the direct sound in the room shock response characteristics; The reverberation audio, and determine the target type audio corresponding to the sound source according to the voice emitted by the sound source and the early reflections; determine the time-frequency masking information corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio; each The reverberation audio and the time-frequency masking information corresponding to the reverberation audio are determined as a set of sample data to obtain multiple sets of sample data; the preset neural network model is trained through the multiple sets of sample data to generate a target model.
在一种可选的实施例中,可以先准备16k/48k的采样语音数据,模拟随机大小的房间,在房间内随机设定声源(发出的语音数据为提前准的采样语音数据)和多个接收器(R1,R2…,RM)的位置,并生成采样语音数据在房间内的RIR数据,具体地,可以利用IMAGE方法生成RIR数据。In an optional embodiment, 16k/48k sampled voice data can be prepared first, a room of random size is simulated, and the sound source is randomly set in the room (the voice data sent out is pre-accurate sampled voice data) and multiple receivers (R1, R2..., RM), and generate the RIR data of the sampled voice data in the room. Specifically, the IMAGE method can be used to generate the RIR data.
然后,根据实际需求配置和初始化预设深度网络模型的参数,预设深度网络模型是可以包含线性变换单元、DFSMN和ReLU(Rectified Linear Unit,线性整流函数)单元,Sigmoid非线性激活函数的深度神经网络模型。Then, configure and initialize the parameters of the preset deep network model according to actual needs. The preset deep network model is a deep neural network that can include linear transformation units, DFSMN and ReLU (Rectified Linear Unit, linear rectification function) units, and Sigmoid nonlinear activation functions. network model.
同时,批量生成样本数据,具体地,用采样语音数据和RIR卷积得到混响音频(x1,x2…, xM);用同样的采样语音数据和RIR的早期反射声卷积得到目标音频(s1,s2…,sM),具体地,早期反射声可以选取直达声和之后50ms或100ms的早期反射部分。然后根据混响音频和目标音频计算混响音频对应的时频掩蔽(time-frequency mask),即为混响音频的期望时频掩蔽,例如phase sensitive mask或者complex ratio mask。需要说明的是,同一声源对应的混响音频可以为多路混响音频,提取每路混响音频的特征向量,具体可以为经过傅里叶变换后的频域特征,对提取的特征向量进行拼接,得到多路混响音频共同的特征向量,并将多路混响音频共同的特征向量和期望时频掩蔽。At the same time, sample data is generated in batches, specifically, the reverberation audio (x1, x2..., xM) is obtained by convolution of sampled speech data and RIR; the target audio (s1 ,s2…,sM), specifically, the early reflections can be selected from the direct sound and the early reflections 50ms or 100ms later. Then calculate the time-frequency mask corresponding to the reverberation audio according to the reverberation audio and the target audio, which is the expected time-frequency mask of the reverberation audio, such as phase sensitive mask or complex ratio mask. It should be noted that the reverberation audio corresponding to the same sound source can be multi-channel reverberation audio, and the feature vector of each reverberation audio can be extracted. Specifically, it can be the frequency domain feature after Fourier transform, and the extracted feature vector Splicing is performed to obtain the common eigenvectors of multiple channels of reverberation audio, and mask the common feature vectors and expected time-frequency of multiple channels of reverberation audio.
进一步的,在得到样本数据后,进行模型的训练,在本实施例中,以样本数据中的期望时频掩蔽作为优化目标,计算损失函数,具体地,计算输出时频掩蔽和期望时频掩蔽之间的均方误差,并利用梯度回传算法调整模型参数,重复以上过程,直到模型在验证集上的损失函数不再显著下降,表明模型已经收敛,根据收敛后的模型参数确定目标模型。Further, after obtaining the sample data, the training of the model is carried out. In this embodiment, the expected time-frequency mask in the sample data is used as the optimization target, and the loss function is calculated. Specifically, the output time-frequency mask and the expected time-frequency mask are calculated. The mean square error between them, and use the gradient backpropagation algorithm to adjust the model parameters, and repeat the above process until the loss function of the model on the verification set no longer decreases significantly, indicating that the model has converged, and the target model is determined according to the converged model parameters.
通过本实施例,一方面,选用早期反射声而不是直达声作为模型训练和恢复的目标,可以保证处理后得到的目标音频的听感的自然度和清晰度。另一方面,由于模型中包括深度学习网络单元,可以使得得到训练后的模型具备明显的混响抑制效果,从而有效改善音频的听感质量。再一方面,在仿真空间环境下模拟多个设备拾音采集到的混响音频数据,采用模拟混合数据和少量实际数据训练的基础上对进行模型训练,减少采样成本的同时,训练得到的模型可以改善实际空间中的远距离拾音听感效果。Through this embodiment, on the one hand, choosing the early reflection sound instead of the direct sound as the target of model training and recovery can ensure the naturalness and clarity of the target audio after processing. On the other hand, since the model includes a deep learning network unit, the trained model can have an obvious reverberation suppression effect, thereby effectively improving the audio quality of hearing. On the other hand, in the simulated space environment, the reverberation audio data collected by multiple devices is simulated, and the model is trained on the basis of simulated mixed data and a small amount of actual data training. While reducing sampling costs, the trained model It can improve the listening effect of long-distance pickup in the actual space.
除了混响部分,环境中还存在噪声部分,为了提高模型的混响抑制效果,可选地,在本申请实施例的音频处理方法中,在根据每个声源的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息之前,该方法还包括:在声源的混响音频中加入噪声信息,得到处理后的混响音频;根据每个声源的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息包括:根据处理后的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息。In addition to the reverberation part, there is also a noise part in the environment. In order to improve the reverberation suppression effect of the model, optionally, in the audio processing method of the embodiment of the present application, according to the reverberation audio of each sound source and the target type audio Before determining the time-frequency masking information corresponding to the reverberation audio, the method further includes: adding noise information to the reverberation audio of the sound source to obtain the processed reverberation audio; according to the reverberation audio of each sound source and the target type audio Determining the time-frequency masking information corresponding to the reverberation audio includes: determining the time-frequency masking information corresponding to the reverberation audio according to the processed reverberation audio and the target type audio.
在一种可选的实施方式中,在获取样本数据时,将语音库中的采样语音(不包含噪声和混响的语音)与混响特征库中的混响特征以及噪声库中的噪声进行结合,得到混响音频,并将采样语音与混响特征中的早期反射声进行结合,得到目标音频。再根据混响音频和目标音频计算时频掩膜,该时频掩膜可以同时抑制混响和噪声,通过该时频掩膜和混响音频组合,得到样本数据。通过得到的样本数据训练模型,训练得到的模型可以对包含噪声和混响特征的音频进行处理,得到相应的时频掩膜,并通过得到的时频掩膜对待测音频进行处理,达到了同时抑制混响和噪声的效果。In an optional implementation manner, when acquiring sample data, the sampled speech in the speech library (speech not including noise and reverberation) is compared with the reverberation feature in the reverberation feature library and the noise in the noise library. Combine to obtain the reverberation audio, and combine the sampled speech with the early reflection sound in the reverberation feature to obtain the target audio. Then calculate the time-frequency mask according to the reverberation audio and the target audio. The time-frequency mask can suppress reverberation and noise at the same time. The sample data is obtained by combining the time-frequency mask and the reverberation audio. Through the obtained sample data training model, the trained model can process the audio containing noise and reverberation features, obtain the corresponding time-frequency mask, and process the audio to be tested through the obtained time-frequency mask, achieving simultaneous The effect of suppressing reverb and noise.
为了使得到的目标音频更自然,可选地,在本申请实施例的音频处理方法中,根据目标时频掩蔽信息处理待测试音频,得到目标音频包括:对目标时频掩蔽信息进行平滑处理,并采用处理后的目标时频掩蔽信息对待测试音频进行处理,得到目标音频;或者采用目标时频掩蔽信息处理待测试音频,得到处理后的音频,并对处理后的音频进行平滑处理,得 到目标音频。In order to make the obtained target audio more natural, optionally, in the audio processing method of the embodiment of the present application, processing the audio to be tested according to the target time-frequency masking information, and obtaining the target audio includes: smoothing the target time-frequency masking information, And use the processed target time-frequency masking information to process the audio to be tested to obtain the target audio; or use the target time-frequency masking information to process the audio to be tested to obtain the processed audio, and smooth the processed audio to obtain the target audio.
具体地,平滑处理可以为时间维度上的平滑处理,可以对模型输出的mask或者mask作用之后的频谱做时间上的平滑处理,可以使得到的目标音频更平滑自然。Specifically, smoothing processing may be smoothing processing in the time dimension, and temporal smoothing processing may be performed on the mask output by the model or the frequency spectrum after the mask is applied, so as to make the obtained target audio smoother and more natural.
实施例2Example 2
根据本申请实施例,还提供了一种音频处理方法,如图5所示,该方法包括:According to an embodiment of the present application, an audio processing method is also provided, as shown in FIG. 5, the method includes:
在获取样本数据时,将语音库中的采样语音(不包含噪声和混响的语言)与混响特征库中的混响特征以及噪声库中的噪声进行结合,得到多个混响音频,也即,观测信号1-观测信号M,并将采样语音与混响特征中的早期反射声进行结合,得到目标语音。When acquiring the sample data, combine the sampled speech in the speech library (language not including noise and reverberation) with the reverberation feature in the reverberation feature library and the noise in the noise library to obtain multiple reverberation audio, and also That is, observe signal 1-observe signal M, and combine the sampled speech with the early reflections in the reverberation feature to obtain the target speech.
进一步的,分别将观测信号和目标语音进行STFT变换,得到观测信号和目标语音的特征向量,将观测信号的特征向量和目标语音的特征向量构成训练集数据,并通过训练集数据对模型进行训练,得到的模型可以对待测试音频的特征向量进行处理,其中,待测试音频可以同时包含混响和噪声,将处理后得到的数据进行iSTFT变换,得到预测信号,该预测信号不包含噪声以及中期反射声、晚期混响,也即,通过本实施例的方法,达到了同时抑制待测音频中的混响和噪声的效果,大大提高了音频的听感质量。Further, the observation signal and the target speech are subjected to STFT transformation respectively to obtain the feature vectors of the observation signal and the target speech, and the feature vectors of the observation signal and the feature vectors of the target speech constitute the training set data, and the model is trained through the training set data , the obtained model can process the feature vector of the audio to be tested, wherein the audio to be tested can contain reverberation and noise at the same time, and the iSTFT transformation is performed on the processed data to obtain a prediction signal, which does not contain noise and mid-term reflection Acoustic and late reverberation, that is, through the method of this embodiment, the effect of simultaneously suppressing reverberation and noise in the audio to be tested is achieved, and the audio quality of the audio is greatly improved.
实施例3Example 3
根据本申请实施例,还提供了一种音频处理方法,如图6所示,该方法包括:According to an embodiment of the present application, an audio processing method is also provided, as shown in FIG. 6, the method includes:
S61,云服务器接收待测试音频。S61. The cloud server receives audio to be tested.
具体地,待测试音频可以是拾音器对声源发出的声音进行采集得到的音频,拾音器和声源处于同一目标空间内,目标空间可以为房间,由于房间内的混响现象的存在,待测试音频为混响音频。Specifically, the audio to be tested can be the audio obtained by the pickup collecting the sound from the sound source. The pickup and the sound source are in the same target space, and the target space can be a room. Due to the existence of reverberation in the room, the audio to be tested for reverb audio.
S62,云服务器获取待测试音频的特征向量,采用目标模型对待测试音频的特征向量进行处理,得到目标时频掩蔽信息,并根据目标时频掩蔽信息处理待测试音频,得到目标音频,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声。S62. The cloud server obtains the feature vector of the audio to be tested, uses the target model to process the feature vector of the audio to be tested, obtains the target time-frequency masking information, and processes the audio to be tested according to the target time-frequency masking information to obtain the target audio, wherein, the target The model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio.
具体地,可以将待测试音频从时域转换到频域,得到待测试音频对应的时频谱,时频谱用于表征频域信息,再从时频谱中获取频域特征向量,并将频域特征向量输入目标模型进行处理。Specifically, the audio to be tested can be converted from the time domain to the frequency domain to obtain the time spectrum corresponding to the audio to be tested. The time spectrum is used to represent the frequency domain information, and then the frequency domain feature vector is obtained from the time spectrum, and the frequency domain feature Vector input to the target model for processing.
其中,目标模型可以由多组混响音频及其对应的时频掩蔽信息训练得到,从而使得模型输入待测试混响音频对应的特征变量,即可输出目标时频掩蔽信息,由于时频掩蔽信息 可以用于抑制混响音频中的混响特征,采用目标时频掩蔽信息对待测试混响音频进行处理,可以得到目标音频。Among them, the target model can be obtained by training multiple groups of reverberation audio and its corresponding time-frequency masking information, so that the model can output the target time-frequency masking information by inputting the characteristic variables corresponding to the reverberation audio to be tested. Since the time-frequency masking information It can be used to suppress the reverberation feature in the reverberation audio, and the reverberation audio to be tested is processed by using the target time-frequency masking information to obtain the target audio.
需要说明的是,而本申请实施例中的目标类型音频中包含混响音频对应的声源的直达声和早期反射声,通过目标类型音频和混响音频确定混响音频对应的时频掩蔽信息,采用该时频掩蔽信息对待测试混响音频进行混响抑制时,可以保留早期反射声,抑制中期反射声和晚期混响。It should be noted that, while the target type audio in the embodiment of the present application includes the direct sound and early reflections of the sound source corresponding to the reverberation audio, the time-frequency masking information corresponding to the reverberation audio is determined through the target type audio and the reverberation audio , when using the time-frequency masking information to perform reverberation suppression on the reverberation audio to be tested, early reflections can be preserved, and mid-term reflections and late reverberations can be suppressed.
S63,云服务器返回目标音频至客户端。S63. The cloud server returns the target audio to the client.
具体地,目标音频中保留了早期反射声,抑制了中期反射声和晚期混响,使得用户的听感平滑而自然。Specifically, the early reflections are preserved in the target audio, and the mid-term reflections and late reverberation are suppressed, so that the user's listening experience is smooth and natural.
实施例4Example 4
根据本申请实施例,还提供了一种音频处理方法,如图7所示,该方法包括:According to an embodiment of the present application, an audio processing method is also provided, as shown in FIG. 7, the method includes:
S71,采集待测试音频,并在音频播放器播放待测试音频。S71. Collect the audio to be tested, and play the audio to be tested on the audio player.
具体地,待测试音频可以是拾音器对声源发出的声音进行采集得到的音频,拾音器和声源处于同一目标空间内,目标空间可以为房间,由于房间内的混响现象的存在,待测试音频为混响音频,因而,在音频播放器播放待测试音频,用户无法获得清晰的听感。Specifically, the audio to be tested can be the audio obtained by the pickup collecting the sound from the sound source. The pickup and the sound source are in the same target space, and the target space can be a room. Due to the existence of reverberation in the room, the audio to be tested The audio is reverberated, therefore, the user cannot get a clear sense of hearing when the audio to be tested is played on the audio player.
S72,在音频播放器播放待测试音频对应的目标音频,其中,目标音频是通过目标时频掩蔽信息对待测试音频进行处理后得到的音频,目标时频掩蔽信息是通过目标模型对待测试音频的特征向量进行处理得到的信息,目标模型用于确定混响音频对应的时频掩蔽信息。S72, play the target audio corresponding to the audio to be tested on the audio player, wherein the target audio is the audio obtained after processing the audio to be tested through the target time-frequency masking information, and the target time-frequency masking information is the feature of the audio to be tested through the target model The information obtained by processing the vector, the target model is used to determine the time-frequency masking information corresponding to the reverberation audio.
需要说明的是,目标模型可以由多组混响音频及其对应的时频掩蔽信息训练得到,从而使得模型输入待测试混响音频对应的特征变量,可以得到目标类型音频,本申请实施例中的目标类型音频中包含混响音频对应的声源的直达声和早期反射声,通过目标类型音频和混响音频确定混响音频对应的时频掩蔽信息,采用该时频掩蔽信息对待测试混响音频进行混响抑制时,可以保留早期反射声,抑制中期反射声和晚期混响。It should be noted that the target model can be obtained by training multiple groups of reverberation audio and their corresponding time-frequency masking information, so that the model can input the characteristic variables corresponding to the reverberation audio to be tested, and the target type audio can be obtained. In the embodiment of this application The target type audio contains the direct sound and early reflections of the sound source corresponding to the reverberation audio, and the time-frequency masking information corresponding to the reverberation audio is determined through the target type audio and the reverberation audio, and the time-frequency masking information is used to treat the test reverberation When the audio is reverberated, the early reflections can be preserved, and the mid-term reflections and late reverberation can be suppressed.
在音频播放器播放待测试音频对应的目标音频时,由于目标音频中保留了早期反射声,抑制了中期反射声和晚期混响,使得用户的听感平滑而自然。When the audio player plays the target audio corresponding to the audio to be tested, since the early reflections are preserved in the target audio, the mid-term reflections and late reverberation are suppressed, so that the user's sense of hearing is smooth and natural.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that for the foregoing method embodiments, for the sake of simple description, they are expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Depending on the application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification belong to preferred embodiments, and the actions and modules involved are not necessarily required by this application.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to enable a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the various embodiments of the present application.
实施例5Example 5
根据本申请实施例,还提供了一种音频处理方法,如图8所示,该方法包括:According to an embodiment of the present application, an audio processing method is also provided, as shown in FIG. 8, the method includes:
S81,通过至少两个采集器采集教学空间内产生的音频,得到第一音频。S81. Use at least two collectors to collect audio generated in the teaching space to obtain a first audio.
具体地,教学空间可以为远程课堂的线下教室,教室内分布两个或两个以上的采集器,第一音频,也即教学空间内产生的音频,可以为老师讲课或课堂上的学生作答时产生的声音,也可以为教学空间内的多媒体设备产生的声音,由于房间内的混响现象的存在,第一音频为混响音频。Specifically, the teaching space can be an offline classroom of a remote classroom. Two or more collectors are distributed in the classroom. The first audio, that is, the audio generated in the teaching space, can be used for the teacher’s lecture or the students’ answers in the classroom. The sound generated at the time may also be the sound generated by the multimedia equipment in the teaching space. Due to the existence of the reverberation phenomenon in the room, the first audio frequency is the reverberation audio.
S82,获取第一音频的特征向量,并将第一音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声。S82. Obtain the feature vector of the first audio, and input the feature vector of the first audio into the target model for processing to obtain the target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency The masking information is used to process the reverberation audio into target type audio, and the target type audio includes direct sound and early reflection sound of a sound source corresponding to the reverberation audio.
具体地,可以将第一音频从时域转换到频域,得到第一音频对应的时频谱,时频谱用于表征频域信息,再从时频谱中获取频域特征向量,并将频域特征向量输入目标模型进行处理。Specifically, the first audio can be converted from the time domain to the frequency domain to obtain the time spectrum corresponding to the first audio, the time spectrum is used to represent the frequency domain information, and then the frequency domain feature vector is obtained from the time spectrum, and the frequency domain feature Vector input to the target model for processing.
其中,目标模型可以由多组混响音频及其对应的时频掩蔽信息训练得到,从而使得模型输入第一音频对应的特征变量,即可输出目标时频掩蔽信息,由于频掩蔽信息可以用于抑制混响音频中的混响特征,采用目标时频掩蔽信息对第一音频进行处理,可以得到第二音频。Among them, the target model can be obtained by training multiple groups of reverberation audio and its corresponding time-frequency masking information, so that the model can output the target time-frequency masking information by inputting the feature variable corresponding to the first audio, because the frequency masking information can be used for The reverberation feature in the reverberation audio is suppressed, and the target time-frequency masking information is used to process the first audio to obtain the second audio.
S83,根据目标时频掩蔽信息处理第一音频,得到第二音频。S83. Process the first audio according to the target time-frequency masking information to obtain the second audio.
需要说明的是,而本申请实施例中的目标类型音频中包含混响音频对应的声源的直达声和早期反射声,通过目标类型音频和混响音频确定混响音频对应的时频掩蔽信息,采用该时频掩蔽信息对第一音频进行混响抑制,得到的第二音频可以保留早期反射声,抑制中期反射声和晚期混响。It should be noted that, while the target type audio in the embodiment of the present application includes the direct sound and early reflections of the sound source corresponding to the reverberation audio, the time-frequency masking information corresponding to the reverberation audio is determined through the target type audio and the reverberation audio , using the time-frequency masking information to suppress reverberation on the first audio, and the obtained second audio can retain early reflections and suppress mid-term reflections and late reverberation.
S84,将第二音频发送至教学空间所对应的远端课堂。S84. Send the second audio to the remote classroom corresponding to the teaching space.
具体地,将第二音频发送至教学空间所对应的远端课堂,并在远端课堂播放第二音频, 由于第二音频中保留了早期反射声,抑制中期反射声和晚期混响,相对于在远端课堂播放第一音频,可以提高远端课堂的学员对音频内容的可懂度。Specifically, the second audio is sent to the remote classroom corresponding to the teaching space, and the second audio is played in the remote classroom. Since the early reflections are retained in the second audio, mid-term reflections and late reverberation are suppressed, compared to Playing the first audio in the remote classroom can improve the intelligibility of the audio content to the students in the remote classroom.
实施例6Example 6
根据本申请实施例,还提供了一种用于实施上述音频处理方法的装置,如图9所示,该装置包括:According to an embodiment of the present application, a device for implementing the above audio processing method is also provided, as shown in FIG. 9 , the device includes:
第一获取单元91,用于获取待测试音频的特征向量。The first obtaining unit 91 is configured to obtain a feature vector of the audio to be tested.
第一处理单元92,用于将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声。The first processing unit 92 is configured to input the feature vector of the audio to be tested into the target model for processing to obtain target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to The reverberation audio is processed into the target type audio, and the target type audio includes the direct sound and the early reflection sound of the sound source corresponding to the reverberation audio.
第二处理单元93,用于根据目标时频掩蔽信息处理待测试音频,得到目标音频。The second processing unit 93 is configured to process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
此处需要说明的是,上述第一获取单元91、第一处理单元92和第二处理单元93对应于实施例1中的步骤S21、步骤S22和步骤S22,两个模块与对应的步骤所实现的实例和应用场景相同,但不限于上述实施例一所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在实施例一提供的计算机终端10中。It should be noted here that the above-mentioned first acquisition unit 91, first processing unit 92 and second processing unit 93 correspond to step S21, step S22 and step S22 in Embodiment 1, and the two modules and corresponding steps realize The examples and application scenarios are the same, but are not limited to the content disclosed in Embodiment 1 above. It should be noted that, as a part of the device, the above modules can run in the computer terminal 10 provided in the first embodiment.
可选地,在本申请实施例的音频处理装置中,待测试音频为目标空间中至少两个采集器对声源进行采集得到的音频,目标模型用于确定同一声源的至少两个混响音频对应的时频掩蔽信息,第一获取单元91包括:计算模块,用于分别计算目标空间中每个采集器采集到的音频的特征向量,得到至少两个特征向量;拼接模块,用于对至少两个特征向量进行拼接,生成待测试音频的特征向量。Optionally, in the audio processing device of the embodiment of the present application, the audio to be tested is the audio obtained by collecting the sound source by at least two collectors in the target space, and the target model is used to determine at least two reverberations of the same sound source For the time-frequency masking information corresponding to the audio, the first acquisition unit 91 includes: a calculation module, which is used to separately calculate the feature vectors of the audio collected by each collector in the target space, to obtain at least two feature vectors; At least two feature vectors are concatenated to generate a feature vector of the audio to be tested.
可选地,在本申请实施例的音频处理装置中,该装置还包括:第二获取单元,用于在将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息之前,分别获取不同空间内的声源对应的房间冲击响应特征,并获取房间冲击响应特征中的直达声;第一确定单元,用于根据每个声源发出的语音与对应的房间冲击响应特征确定声源对应的混响音频,并根据声源发出的语音与早期反射声确定声源对应的目标类型音频;第二确定单元,用于根据每个声源的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息;第三确定单元,用于将每个混响音频以及混响音频对应的时频掩蔽信息确定为一组样本数据,得到多组样本数据;模型生成单元,用于通过多组样本数据训练预设神经网络模型,生成目标模型。Optionally, in the audio processing device of the embodiment of the present application, the device further includes: a second acquisition unit, configured to input the feature vector of the audio to be tested into the target model for processing to obtain the target time-frequency masking information, respectively Obtain the room shock response characteristics corresponding to the sound sources in different spaces, and obtain the direct sound in the room shock response characteristics; the first determination unit is used to determine the sound source according to the voice emitted by each sound source and the corresponding room shock response characteristics Corresponding reverberation audio, and determine the target type audio corresponding to the sound source according to the voice emitted by the sound source and the early reflections; the second determination unit is used to determine the reverberation audio according to the reverberation audio of each sound source and the target type audio Corresponding time-frequency masking information; the third determination unit is used to determine each reverberation audio and the time-frequency masking information corresponding to the reverberation audio as a set of sample data to obtain multiple sets of sample data; the model generation unit is used to pass Multiple sets of sample data train the preset neural network model to generate the target model.
可选地,在本申请实施例的音频处理装置中,该装置还包括:音频处理单元,用于在根据每个声源的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息之前,在声 源的混响音频中加入噪声信息,得到处理后的混响音频;第四确定单元,用于根据每个声源的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息包括:第五确定单元,用于根据处理后的混响音频以及目标类型音频确定混响音频对应的时频掩蔽信息。Optionally, in the audio processing device of the embodiment of the present application, the device further includes: an audio processing unit, configured to determine the time-frequency masking information corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio Before, noise information is added to the reverberation audio of the sound source to obtain the processed reverberation audio; the fourth determination unit is used to determine the time-frequency corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio The masking information includes: a fifth determining unit configured to determine time-frequency masking information corresponding to the reverberation audio according to the processed reverberation audio and the target type audio.
可选地,在本申请实施例的音频处理装置中,第一获取单元801包括:第一处理模块,用于对待测试音频进行傅里叶变换,得到待测试音频的频域信息,从频域信息中获取待测试音频的特征向量;第二处理模块,用于根据目标时频掩蔽信息处理待测试音频,得到目标音频包括:第三处理模块,用于采用目标时频掩蔽信息处理待测试音频,得到目标频域信息,并对目标频域信息进行逆傅里叶变换,得到目标音频。Optionally, in the audio processing device of the embodiment of the present application, the first acquisition unit 801 includes: a first processing module, configured to perform Fourier transform on the audio to be tested to obtain frequency domain information of the audio to be tested, and obtain frequency domain information from the frequency domain The feature vector of the audio to be tested is obtained from the information; the second processing module is used to process the audio to be tested according to the target time-frequency masking information, and obtaining the target audio includes: a third processing module, which is used to process the audio to be tested by using the target time-frequency masking information , to obtain the target frequency domain information, and perform an inverse Fourier transform on the target frequency domain information to obtain the target audio.
可选地,在本申请实施例的音频处理装置中,第三处理模块还用于将目标时频掩蔽信息处理与待测试音频对应的时频谱信息相乘,得目标频域信息。Optionally, in the audio processing device of the embodiment of the present application, the third processing module is further configured to process the target time-frequency masking information and multiply the time-frequency spectrum information corresponding to the audio to be tested to obtain the target frequency domain information.
可选地,在本申请实施例的音频处理装置中,第二处理模块包括:第一处理子模块,用于对目标时频掩蔽信息进行平滑处理,并采用处理后的目标时频掩蔽信息对待测试音频进行处理,得到目标音频;或者第二处理子模块,用于采用目标时频掩蔽信息处理待测试音频,得到处理后的音频,并对处理后的音频进行平滑处理,得到目标音频。Optionally, in the audio processing device of the embodiment of the present application, the second processing module includes: a first processing submodule, configured to perform smoothing processing on the target time-frequency masking information, and use the processed target time-frequency masking information to treat The test audio is processed to obtain the target audio; or the second processing submodule is used to process the audio to be tested by using the target time-frequency masking information to obtain the processed audio, and smooth the processed audio to obtain the target audio.
实施例7Example 7
本申请的实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。可选地,在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。Embodiments of the present application may provide a computer terminal, and the computer terminal may be any computer terminal device in a group of computer terminals. Optionally, in this embodiment, the foregoing computer terminal may also be replaced with a terminal device such as a mobile terminal.
可选地,在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。Optionally, in this embodiment, the foregoing computer terminal may be located in at least one network device among multiple network devices of the computer network.
在本实施例中,上述计算机终端可以执行应用程序的音频处理方法中以下步骤的程序代码:获取待测试音频的特征向量;将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理待测试音频,得到目标音频。In this embodiment, the above-mentioned computer terminal can execute the program code of the following steps in the audio processing method of the application program: obtain the feature vector of the audio to be tested; input the feature vector of the audio to be tested into the target model for processing, and obtain the target time-frequency mask information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound of the sound source corresponding to the reverberation audio and early reflections; the audio to be tested is processed according to the time-frequency masking information of the target to obtain the target audio.
可选地,图10是根据本申请实施例的一种计算机终端的结构框图。如图10所示,该计算机终端A可以包括:一个或多个(图中仅示出一个)处理器、存储器、以及传输装置。Optionally, FIG. 10 is a structural block diagram of a computer terminal according to an embodiment of the present application. As shown in FIG. 10 , the computer terminal A may include: one or more (only one is shown in the figure) processors, memory, and transmission means.
其中,存储器可用于存储软件程序以及模块,如本申请实施例中的音频处理方法和装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的音频处理方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远 程存储器可以通过网络连接至终端A。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。Wherein, the memory can be used to store software programs and modules, such as the program instructions/modules corresponding to the audio processing method and device in the embodiment of the present application, and the processor executes various functional applications by running the software programs and modules stored in the memory. And data processing, that is, realizing the above-mentioned audio processing method. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely relative to the processor, and these remote memories may be connected to terminal A through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:获取待测试音频的特征向量;将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理待测试音频,得到目标音频。The processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: obtain the feature vector of the audio to be tested; input the feature vector of the audio to be tested into the target model for processing, and obtain the target time-frequency masking information, wherein , the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, which contains the direct sound and early reflections of the sound source corresponding to the reverberation audio sound; process the audio to be tested according to the target time-frequency masking information to obtain the target audio.
采用本申请实施例,提供了一种计算机终端。通过执行获取待测试音频的特征向量;将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理待测试音频,得到目标音频的步骤,达到了抑制待测试音频中的混响的目的,从而实现了提高拾音设备采集到的音频的清晰度的技术效果,进而解决了由于空间内的混响现象的存在,导致拾音设备采集到的音频的清晰度低的技术问题。By adopting the embodiment of the present application, a computer terminal is provided. Obtain the feature vector of the audio to be tested by executing; input the feature vector of the audio to be tested into the target model for processing, and obtain the target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking The information is used to process the reverberation audio into a target type audio, and the target type audio includes the direct sound and early reflection sound of the sound source corresponding to the reverberation audio; the step of processing the audio to be tested according to the target time-frequency masking information to obtain the target audio, The purpose of suppressing the reverberation in the audio to be tested is achieved, thereby achieving the technical effect of improving the clarity of the audio collected by the pickup device, and then solving the problem of the reverberation phenomenon in the space that caused the pickup device to collect A technical issue with low audio clarity.
本领域普通技术人员可以理解,图10所示的结构仅为示意,计算机终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图10其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图10中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图10所示不同的配置。Those of ordinary skill in the art can understand that the structure shown in FIG. ), PAD and other terminal equipment. FIG. 10 does not limit the structure of the above-mentioned electronic device. For example, the computer terminal 10 may also include more or fewer components (eg, network interface, display device, etc.) than those shown in FIG. 10 , or have a configuration different from that shown in FIG. 10 .
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing hardware related to the terminal device through a program, and the program can be stored in a computer-readable storage medium, and the storage medium can be Including: flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), magnetic disk or optical disk, etc.
实施例8Example 8
本申请的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于保存上述实施例一所提供的音频处理方法所执行的程序代码。The embodiment of the present application also provides a storage medium. Optionally, in this embodiment, the foregoing storage medium may be used to store program codes executed by the audio processing method provided in Embodiment 1 above.
可选地,在本实施例中,上述存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。Optionally, in this embodiment, the above-mentioned storage medium may be located in any computer terminal in the group of computer terminals in the computer network, or in any mobile terminal in the group of mobile terminals.
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:获取待测试音频的特征向量;将待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,目标模型用于确定混响音频对应的时频掩蔽信息,时频掩蔽信息用于将 混响音频处理为目标类型音频,目标类型音频中包含混响音频对应的声源的直达声和早期反射声;根据目标时频掩蔽信息处理待测试音频,得到目标音频。Optionally, in this embodiment, the storage medium is configured to store program codes for performing the following steps: obtaining the feature vector of the audio to be tested; inputting the feature vector of the audio to be tested into the target model for processing to obtain the target time-frequency Masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into a target type audio, and the target type audio includes direct access to the sound source corresponding to the reverberation audio sound and early reflections; the audio to be tested is processed according to the time-frequency masking information of the target to obtain the target audio.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present application, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be realized in other ways. Wherein, the device embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for enabling a computer device (which may be a personal computer, server or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disc, etc., which can store program codes. .
以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above description is only the preferred embodiment of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present application, some improvements and modifications can also be made. These improvements and modifications are also It should be regarded as the protection scope of this application.

Claims (14)

  1. 一种音频处理方法,其特征在于,包括:An audio processing method, characterized in that, comprising:
    获取待测试音频的特征向量;Obtain the feature vector of the audio to be tested;
    将所述待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,所述目标模型用于确定混响音频对应的时频掩蔽信息,所述时频掩蔽信息用于将所述混响音频处理为目标类型音频,所述目标类型音频中包含所述混响音频对应的声源的直达声和早期反射声;Inputting the feature vector of the audio to be tested into the target model for processing to obtain target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to convert The reverberation audio is processed as a target type audio, and the target type audio includes direct sound and early reflections of a sound source corresponding to the reverberation audio;
    根据所述目标时频掩蔽信息处理所述待测试音频,得到目标音频。Processing the audio to be tested according to the target time-frequency masking information to obtain a target audio.
  2. 根据权利要求1所述的音频处理方法,其特征在于,所述待测试音频为目标空间中至少两个采集器对声源进行采集得到的音频,所述目标模型用于确定同一声源的至少两个混响音频对应的时频掩蔽信息,所述获取待测试音频的特征向量包括:The audio processing method according to claim 1, wherein the audio to be tested is the audio obtained by collecting sound sources by at least two collectors in the target space, and the target model is used to determine at least The time-frequency masking information corresponding to the two reverberation audio, the acquisition of the feature vector of the audio to be tested includes:
    分别计算所述目标空间中每个采集器采集到的音频的特征向量,得到至少两个特征向量;Calculating the feature vectors of the audio collected by each collector in the target space respectively to obtain at least two feature vectors;
    对所述至少两个特征向量进行拼接,生成所述待测试音频的特征向量。Splicing the at least two feature vectors to generate a feature vector of the audio to be tested.
  3. 根据权利要求1或2所述的音频处理方法,其特征在于,在所述将所述待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息之前,所述方法还包括:The audio processing method according to claim 1 or 2, wherein, before the feature vector of the audio to be tested is input into the target model for processing to obtain target time-frequency masking information, the method also includes:
    分别获取不同空间内的声源对应的房间冲击响应特征,并获取所述房间冲击响应特征中的直达声;Obtaining the room impulse response characteristics corresponding to the sound sources in different spaces, and obtaining the direct sound in the room impulse response characteristics;
    根据每个声源发出的语音与对应的所述房间冲击响应特征确定所述声源对应的混响音频,并根据所述声源发出的语音与所述早期反射声确定所述声源对应的目标类型音频;Determine the reverberation audio corresponding to the sound source according to the voice emitted by each sound source and the corresponding room impulse response characteristics, and determine the corresponding reverberation audio of the sound source according to the voice emitted by the sound source and the early reflections target type audio;
    根据每个声源的所述混响音频以及所述目标类型音频确定所述混响音频对应的时频掩蔽信息;determining time-frequency masking information corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio;
    将每个所述混响音频以及所述混响音频对应的时频掩蔽信息确定为一组样本数据,得到多组样本数据;Determining each of the reverberation audio and the time-frequency masking information corresponding to the reverberation audio as a set of sample data to obtain multiple sets of sample data;
    通过所述多组样本数据训练预设神经网络模型,生成所述目标模型。The preset neural network model is trained through the multiple sets of sample data to generate the target model.
  4. 根据权利要求3所述的音频处理方法,其特征在于,在所述根据每个声源的所述混响音频以及所述目标类型音频确定所述混响音频对应的时频掩蔽信息之前,所述方法还包括:The audio processing method according to claim 3, wherein before the time-frequency masking information corresponding to the reverberation audio is determined according to the reverberation audio of each sound source and the target type audio, the The method also includes:
    在所述声源的所述混响音频中加入噪声信息,得到处理后的混响音频;adding noise information to the reverberation audio of the sound source to obtain the processed reverberation audio;
    所述根据每个声源的所述混响音频以及所述目标类型音频确定所述混响音频对应的时频掩蔽信息包括:The determining the time-frequency masking information corresponding to the reverberation audio according to the reverberation audio of each sound source and the target type audio includes:
    根据所述处理后的混响音频以及所述目标类型音频确定所述混响音频对应的时频掩蔽信息。Time-frequency masking information corresponding to the reverberation audio is determined according to the processed reverberation audio and the target type audio.
  5. 根据权利要求1-4中任一项所述的音频处理方法,其特征在于,所述获取待测试音频的特征向量包括:The audio processing method according to any one of claims 1-4, wherein said obtaining the feature vector of the audio to be tested comprises:
    对所述待测试音频进行傅里叶变换,得到所述待测试音频的频域信息,从所述频域信息中获取所述待测试音频的特征向量;Performing Fourier transform on the audio to be tested to obtain the frequency domain information of the audio to be tested, and obtain the feature vector of the audio to be tested from the frequency domain information;
    所述根据所述目标时频掩蔽信息处理所述待测试音频,得到目标音频包括:The processing of the audio to be tested according to the target time-frequency masking information to obtain the target audio includes:
    采用所述目标时频掩蔽信息处理所述待测试音频,得到目标频域信息,并对所述目标频域信息进行逆傅里叶变换,得到所述目标音频。Processing the audio to be tested by using the target time-frequency masking information to obtain target frequency domain information, and performing an inverse Fourier transform on the target frequency domain information to obtain the target audio.
  6. 根据权利要求5所述的音频处理方法,其特征在于,所述采用所述目标时频掩蔽信息处理所述待测试音频,得到目标频域信息包括:The audio processing method according to claim 5, wherein the processing of the audio to be tested by using the target time-frequency masking information to obtain the target frequency domain information includes:
    将所述目标时频掩蔽信息处理与所述待测试音频对应的时频谱信息相乘,得所述目标频域信息。The target frequency domain information is obtained by multiplying the target time-frequency masking information by the time-frequency spectrum information corresponding to the audio to be tested.
  7. 根据权利要求1-6中任一项所述的音频处理方法,其特征在于,所述根据所述目标时频掩蔽信息处理所述待测试音频,得到目标音频包括:The audio processing method according to any one of claims 1-6, wherein the processing the audio to be tested according to the target time-frequency masking information to obtain the target audio includes:
    对所述目标时频掩蔽信息进行平滑处理,并采用处理后的目标时频掩蔽信息对所述待测试音频进行处理,得到所述目标音频;或者performing smoothing processing on the target time-frequency masking information, and processing the audio to be tested by using the processed target time-frequency masking information to obtain the target audio; or
    采用所述目标时频掩蔽信息处理所述待测试音频,得到处理后的音频,并对所述处理后的音频进行平滑处理,得到所述目标音频。Processing the audio to be tested by using the target time-frequency masking information to obtain the processed audio, and smoothing the processed audio to obtain the target audio.
  8. 一种音频处理方法,其特征在于,包括:An audio processing method, characterized in that, comprising:
    云服务器接收待测试音频;The cloud server receives the audio to be tested;
    所述云服务器获取所述待测试音频的特征向量,采用目标模型对所述待测试音频的特征向量进行处理,得到目标时频掩蔽信息,并根据所述目标时频掩蔽信息处理所述待测试音频,得到目标音频,其中,所述目标模型用于确定混响音频对应的时频掩蔽信息,所述时频掩蔽信息用于将所述混响音频处理为目标类型音频,所述目标类型音频中包含所述混响音频对应的声源的直达声和早期反射声;The cloud server obtains the feature vector of the audio to be tested, uses a target model to process the feature vector of the audio to be tested, obtains target time-frequency masking information, and processes the target time-frequency masking information according to the target time-frequency masking information Audio, to obtain target audio, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, and the time-frequency masking information is used to process the reverberation audio into target type audio, and the target type audio contains the direct sound and early reflection sound of the sound source corresponding to the reverberation audio;
    所述云服务器返回所述目标音频至客户端。The cloud server returns the target audio to the client.
  9. 一种音频处理方法,其特征在于,包括:An audio processing method, characterized in that, comprising:
    采集待测试音频,并在音频播放器播放所述待测试音频;Gather the audio to be tested, and play the audio to be tested in the audio player;
    在所述音频播放器播放所述待测试音频对应的目标音频,其中,所述目标音频是通过目标时频掩蔽信息对所述待测试音频进行处理后得到的音频,所述目标时频掩蔽信息是通过目标模型对所述待测试音频的特征向量进行处理得到的信息,所述目标模型用于确定混响音频对应的时频掩蔽信息。The target audio corresponding to the audio to be tested is played on the audio player, wherein the target audio is the audio obtained after processing the audio to be tested through target time-frequency masking information, and the target time-frequency masking information is the information obtained by processing the feature vector of the audio to be tested by the target model, and the target model is used to determine the time-frequency masking information corresponding to the reverberation audio.
  10. 一种音频处理方法,其特征在于,包括:An audio processing method, characterized in that, comprising:
    通过至少两个采集器采集教学空间内产生的音频,得到第一音频;collecting audio generated in the teaching space by at least two collectors to obtain the first audio;
    获取所述第一音频的特征向量,并将所述第一音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,所述目标模型用于确定混响音频对应的时频掩蔽信息,所述时频掩蔽信息用于将所述混响音频处理为目标类型音频,所述目标类型音频中包含所述混响音频对应的声源的直达声和早期反射声;Obtaining the feature vector of the first audio, and inputting the feature vector of the first audio into a target model for processing to obtain target time-frequency masking information, wherein the target model is used to determine the time-frequency masking corresponding to the reverberation audio Information, the time-frequency masking information is used to process the reverberation audio into target type audio, the target type audio includes direct sound and early reflection sound of a sound source corresponding to the reverberation audio;
    根据所述目标时频掩蔽信息处理所述第一音频,得到第二音频;processing the first audio according to the target time-frequency masking information to obtain a second audio;
    将所述第二音频发送至所述教学空间所对应的远端课堂。Send the second audio to the remote classroom corresponding to the teaching space.
  11. 一种音频处理装置,其特征在于,包括:An audio processing device, characterized in that it comprises:
    第一获取单元,用于获取待测试音频的特征向量;The first obtaining unit is used to obtain the feature vector of the audio to be tested;
    第一处理单元,用于将所述待测试音频的特征向量输入目标模型进行处理,得到目标时频掩蔽信息,其中,所述目标模型用于确定混响音频对应的时频掩蔽信息,所述时频掩蔽信息用于将所述混响音频处理为目标类型音频,所述目标类型音频中包含所述混响音频对应的声源的直达声和早期反射声;The first processing unit is configured to input the feature vector of the audio to be tested into a target model for processing to obtain target time-frequency masking information, wherein the target model is used to determine the time-frequency masking information corresponding to the reverberation audio, the The time-frequency masking information is used to process the reverberation audio into target type audio, and the target type audio includes direct sound and early reflection sound of a sound source corresponding to the reverberation audio;
    第二处理单元,用于根据所述目标时频掩蔽信息处理所述待测试音频,得到目标音频。The second processing unit is configured to process the audio to be tested according to the target time-frequency masking information to obtain a target audio.
  12. 一种存储介质,其特征在于,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行权利要求1至10中任意一项所述的音频处理方法。A storage medium, characterized in that the storage medium includes a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute the audio processing method according to any one of claims 1 to 10.
  13. 一种计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至10中任意一项所述的音频处理方法。A computer program, characterized in that, when the computer program is executed by a processor, the audio processing method according to any one of claims 1 to 10 is realized.
  14. 一种计算机终端,其特征在于,包括处理器及存储器,其中,A computer terminal, characterized in that it includes a processor and a memory, wherein,
    所述存储器,用于存储软件程序以及模块;The memory is used to store software programs and modules;
    所述处理器,与所述存储器耦合,用于运行存储在所述存储器内的软件程序以及模块,以实现上述权利要求1至10中任意一项所述的音频处理方法。The processor, coupled with the memory, is used to run software programs and modules stored in the memory, so as to realize the audio processing method described in any one of claims 1 to 10 above.
PCT/CN2022/123819 2021-10-14 2022-10-08 Audio processing method and apparatus, storage medium and computer program WO2023061258A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111194926.5 2021-10-14
CN202111194926.5A CN113643714B (en) 2021-10-14 2021-10-14 Audio processing method, device, storage medium and computer program

Publications (1)

Publication Number Publication Date
WO2023061258A1 true WO2023061258A1 (en) 2023-04-20

Family

ID=78426739

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/123819 WO2023061258A1 (en) 2021-10-14 2022-10-08 Audio processing method and apparatus, storage medium and computer program

Country Status (2)

Country Link
CN (1) CN113643714B (en)
WO (1) WO2023061258A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643714B (en) * 2021-10-14 2022-02-18 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program
CN114446316B (en) * 2022-01-27 2024-03-12 腾讯科技(深圳)有限公司 Audio separation method, training method, device and equipment of audio separation model
CN117746828B (en) * 2024-02-20 2024-04-30 华侨大学 Noise masking control method, device, equipment and medium for open office

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011154126A (en) * 2010-01-26 2011-08-11 Yamaha Corp Apparatus and program for performing sound masking
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium
CN111341303A (en) * 2018-12-19 2020-06-26 北京猎户星空科技有限公司 Acoustic model training method and device and voice recognition method and device
CN112201276A (en) * 2020-11-11 2021-01-08 东南大学 TC-ResNet network-based microphone array voice separation method
CN112652290A (en) * 2020-12-14 2021-04-13 北京达佳互联信息技术有限公司 Method for generating reverberation audio signal and training method of audio processing model
CN113488066A (en) * 2021-06-18 2021-10-08 北京小米移动软件有限公司 Audio signal processing method, audio signal processing apparatus, and storage medium
CN113643714A (en) * 2021-10-14 2021-11-12 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531179B (en) * 2015-09-10 2019-08-20 中国科学院声学研究所 A kind of multi-channel speech enhancement method of the selective attention based on semantic priori
US10726857B2 (en) * 2018-02-23 2020-07-28 Cirrus Logic, Inc. Signal processing for speech dereverberation
CN109523999B (en) * 2018-12-26 2021-03-23 中国科学院声学研究所 Front-end processing method and system for improving far-field speech recognition
CN110970046B (en) * 2019-11-29 2022-03-11 北京搜狗科技发展有限公司 Audio data processing method and device, electronic equipment and storage medium
CN111239686B (en) * 2020-02-18 2021-12-21 中国科学院声学研究所 Dual-channel sound source positioning method based on deep learning
CN111768796B (en) * 2020-07-14 2024-05-03 中国科学院声学研究所 Acoustic echo cancellation and dereverberation method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011154126A (en) * 2010-01-26 2011-08-11 Yamaha Corp Apparatus and program for performing sound masking
CN111341303A (en) * 2018-12-19 2020-06-26 北京猎户星空科技有限公司 Acoustic model training method and device and voice recognition method and device
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium
CN112201276A (en) * 2020-11-11 2021-01-08 东南大学 TC-ResNet network-based microphone array voice separation method
CN112652290A (en) * 2020-12-14 2021-04-13 北京达佳互联信息技术有限公司 Method for generating reverberation audio signal and training method of audio processing model
CN113488066A (en) * 2021-06-18 2021-10-08 北京小米移动软件有限公司 Audio signal processing method, audio signal processing apparatus, and storage medium
CN113643714A (en) * 2021-10-14 2021-11-12 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program

Also Published As

Publication number Publication date
CN113643714B (en) 2022-02-18
CN113643714A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
WO2023061258A1 (en) Audio processing method and apparatus, storage medium and computer program
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
Li et al. On the importance of power compression and phase estimation in monaural speech dereverberation
Luo et al. Real-time single-channel dereverberation and separation with time-domain audio separation network.
CN111489760B (en) Speech signal dereverberation processing method, device, computer equipment and storage medium
US20110096915A1 (en) Audio spatialization for conference calls with multiple and moving talkers
Zhang et al. Multi-channel multi-frame ADL-MVDR for target speech separation
Chakrabarty et al. Time-frequency masking based online speech enhancement with multi-channel data using convolutional neural networks
CN112634923B (en) Audio echo cancellation method, device and storage medium based on command scheduling system
CN114283795A (en) Training and recognition method of voice enhancement model, electronic equipment and storage medium
CN112820315A (en) Audio signal processing method, audio signal processing device, computer equipment and storage medium
Küçük et al. Real-time convolutional neural network-based speech source localization on smartphone
Hussain et al. Ensemble hierarchical extreme learning machine for speech dereverberation
Lugasi et al. Speech enhancement using masking for binaural reproduction of ambisonics signals
Rao et al. Interspeech 2021 conferencingspeech challenge: Towards far-field multi-channel speech enhancement for video conferencing
CN114121031A (en) Device voice noise reduction, electronic device, and storage medium
CN106161820B (en) A kind of interchannel decorrelation method for stereo acoustic echo canceler
JP2021167977A (en) Voice signal processing method, voice signal processing device, electronic apparatus and storage medium
CN110491409B (en) Method and device for separating mixed voice signal, storage medium and electronic device
Luo et al. Implicit filter-and-sum network for multi-channel speech separation
CN113241085B (en) Echo cancellation method, device, equipment and readable storage medium
CN114758668A (en) Training method of voice enhancement model and voice enhancement method
Westhausen et al. Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids
Lee et al. Improved mask-based neural beamforming for multichannel speech enhancement by snapshot matching masking
Zhang et al. LCSM: A lightweight complex spectral mapping framework for stereophonic acoustic echo cancellation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22880197

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE