WO2024055751A1 - 音频数据处理方法、装置、设备、存储介质及程序产品 - Google Patents

音频数据处理方法、装置、设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2024055751A1
WO2024055751A1 PCT/CN2023/108796 CN2023108796W WO2024055751A1 WO 2024055751 A1 WO2024055751 A1 WO 2024055751A1 CN 2023108796 W CN2023108796 W CN 2023108796W WO 2024055751 A1 WO2024055751 A1 WO 2024055751A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
target
data frame
mask
historical
Prior art date
Application number
PCT/CN2023/108796
Other languages
English (en)
French (fr)
Inventor
黄代玉
鲍枫
李岳鹏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024055751A1 publication Critical patent/WO2024055751A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present application relates to the field of computer technology, and in particular to an audio data processing method, device, equipment, storage medium and program product.
  • Non-stationary noise Non-Stationary Noise
  • Non-stationary background noise there is a non-stationary background noise called Babble Noise, which is composed of the conversation sounds of multiple speakers. Since the components of the noise data of this non-stationary background noise are similar to the components of the speech data of the target speech, when performing speech enhancement processing on the target speech containing the non-stationary background noise, it is easy to distinguish the target speech from the non-stationary background. Noise is used to falsely cancel speech data with similar speech components, which will reduce the speech fidelity after noise suppression of the audio data.
  • Embodiments of the present application provide an audio data processing method, device, equipment, storage medium and program product, which can effectively suppress noise data in audio data and improve speech fidelity.
  • embodiments of the present application provide an audio data processing method, which is executed by a computer device, including:
  • the target audio data frame and K historical audio data frames associated with the original audio data are both spectrum frames, and each historical audio data in the K historical audio data frames The frames are all spectrum frames before the target audio data frame, and K is a positive integer;
  • N target cepstrum coefficients of the target audio data frame When N target cepstrum coefficients of the target audio data frame are obtained, M first-order time derivatives and M second-order time derivatives associated with the target audio data frame are obtained based on the N target cepstrum coefficients; N is greater than A positive integer of 1, M is a positive integer less than N;
  • N target cepstral coefficients, M first-order time derivatives, M second-order time derivatives, and spectral dynamic features to the target mask estimation model, and the target mask estimation model outputs the target mask corresponding to the target audio data frame;
  • the target mask is used to suppress noise data in the original audio data to obtain enhanced audio data corresponding to the original audio data.
  • embodiments of the present application provide an audio data processing method, which is executed by a computer device, including:
  • the target sample audio data frame and K historical sample audio data associated with the sample audio data and obtain the sample mask corresponding to the target sample audio data frame;
  • the target sample audio data frame and the K historical sample audio data frames are both spectrums frame, and each historical sample audio data frame among the K historical sample audio data frames is the spectrum frame before the target sample audio data frame, and K is a positive integer;
  • N target sample cepstrum coefficients of the target sample audio data frame When N target sample cepstrum coefficients of the target sample audio data frame are obtained, M sample first-order time derivatives and M sample second-order time derivatives associated with the target sample audio data frame are obtained based on the N target sample cepstrum coefficients.
  • Time derivative; N is a positive integer greater than 1, M is a positive integer less than N;
  • N target sample cepstral coefficients, M sample first-order time derivatives, M sample second-order time derivatives and sample spectrum dynamic characteristics into the initial mask estimation model, and the initial mask estimation model outputs the target sample audio data frame corresponding to prediction mask;
  • the initial mask estimation model is iteratively trained based on the prediction mask and the sample mask to obtain a target mask estimation model, which is used to output a target corresponding to the target audio data frame associated with the original audio data.
  • Mask the target mask is used to suppress the noise data in the original audio data to obtain the enhanced audio data corresponding to the original audio data.
  • an audio data processing device including:
  • the first acquisition module is used to obtain the target audio data frame and K historical audio data frames associated with the original audio data;
  • the target audio data frame and the K historical audio data frames are both spectrum frames, and the K historical audio data frames
  • Each historical audio data frame in is the spectrum frame before the target audio data frame, and K is a positive integer;
  • the second acquisition module is used to acquire M first-order time derivatives and M second-order time derivatives associated with the target audio data frame based on the N target cepstrum coefficients when acquiring the N target cepstrum coefficients of the target audio data frame.
  • Order time derivative N is a positive integer greater than 1, M is a positive integer less than N;
  • the third acquisition module is used to obtain N historical cepstrum coefficients corresponding to each historical audio data frame, and determine the spectrum dynamic characteristics associated with the target audio data frame based on the obtained K*N historical cepstrum coefficients;
  • the mask estimation module is used to input N target cepstrum coefficients, M first-order time derivatives, M second-order time derivatives, and spectral dynamic characteristics into the target mask estimation model, and the target mask estimation model outputs target audio data.
  • the target mask corresponding to the frame; the target mask is used to suppress the noise data in the original audio data to obtain the enhanced audio data corresponding to the original audio data.
  • an audio data processing device including:
  • the first acquisition module is used to obtain the target sample audio data frame and K historical sample audio data associated with the sample audio data, and obtain the sample mask corresponding to the target sample audio data frame; the target sample audio data frame and K historical sample audio data frames
  • the sample audio data frames are all spectrum frames, and each of the K historical sample audio data frames is the spectrum frame before the target sample audio data frame, and K is a positive integer;
  • the second acquisition module is used to obtain M sample first-order times associated with the target sample audio data frame based on the N target sample cepstral coefficients when the N target sample audio data frame is obtained. Derivatives and M sample second-order time derivatives; N is a positive integer greater than 1, M is a positive integer less than N;
  • the third acquisition module is used to obtain N historical sample cepstral coefficients corresponding to each historical sample audio data frame, and determine the samples associated with the target sample audio data frame based on the obtained K*N historical sample cepstral coefficients.
  • the mask prediction module is used to input N target sample cepstrum coefficients, M sample first-order time derivatives, M sample second-order time derivatives, and sample spectrum dynamic characteristics into the initial mask estimation model. Output the prediction mask corresponding to the target sample audio data frame; and,
  • a model training module for iteratively training the initial mask estimation model based on the prediction mask and the sample mask to obtain a target mask estimate for outputting a target mask corresponding to the target audio data frame associated with the original audio data.
  • Model the target mask is used to suppress the noise data in the original audio data to obtain the enhanced audio data corresponding to the original audio data.
  • embodiments of the present application provide a computer device, including: a processor and a memory;
  • the processor is connected to a memory, where the memory is used to store a computer program.
  • the computer program is executed by the processor, the computer device executes the method provided by the embodiment of the present application.
  • inventions of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is adapted to be loaded and executed by a processor, so that a computer device having the processor executes the present application. Examples provide methods.
  • embodiments of the present application provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by the embodiment of the present application.
  • Figure 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of an audio data processing scenario provided by an embodiment of the present application.
  • Figure 3 is a schematic flow chart of an audio data processing method provided by an embodiment of the present application.
  • Figure 4a is a schematic diagram of an audio preprocessing scenario provided by an embodiment of the present application.
  • Figure 4b is a schematic diagram of an audio preprocessing scenario provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a scenario of differential operation of cepstrum coefficients provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of a scene for obtaining inter-frame difference values provided by an embodiment of the present application.
  • Figure 7 is a schematic network structure diagram of a mask estimation model provided by an embodiment of the present application.
  • Figure 8 is a schematic flow chart of an audio data processing method provided by an embodiment of the present application.
  • Figure 9 is a schematic flow chart of model training provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of a noise reduction effect provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of an audio data processing device provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of an audio data processing device provided by an embodiment of the present application.
  • Figure 13 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Figure 14 is a schematic structural diagram of an audio data processing system provided by an embodiment of the present application.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Speech enhancement (SE) technology refers to a technology that extracts useful speech signals from the noise background to suppress and reduce noise interference when the speech signal is interfered or even overwhelmed by various noises.
  • Speech enhancement technology can separate speech noise from non-speech noise to ensure speech intelligibility, that is, extract the purest possible original speech from noisy speech.
  • Speech enhancement involves a wide range of application fields, including voice calls, telephone conferencing, real-time audio and video conferencing, scene recording, hearing aid equipment and speech recognition equipment, etc., and has become a pre-processing module for many speech coding and recognition systems.
  • DSP digital signal processing
  • analog information such as audio, video, pictures, etc.
  • computers or special processing equipment to collect signals in digital form. , transformation, filtering, evaluation, enhancement, compression, recognition and other processing to obtain signal forms that meet people's needs.
  • digital signal processing technology can be used to extract target audio features including target cepstral coefficients, first-order time derivatives, second-order time derivatives, and spectral dynamic features from the target audio data frame.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence. Machine learning usually includes artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • the target mask estimation model is an AI model based on machine learning technology, which can be used to estimate the corresponding mask from the input audio features.
  • Figure 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system architecture may include a service server 100 and a user terminal cluster, where the user terminal cluster may include one or more user terminals.
  • the number of user terminals in the user terminal cluster will not be limited here.
  • multiple user terminals in the user terminal cluster may specifically include: user terminal 200a, user terminal 200b, user terminal 200c,..., user terminal 200n.
  • any user terminal in the user terminal cluster can have a communication connection with the service server 100, so that each user terminal in the user terminal cluster can interact with the service server 100 through the communication connection.
  • the above-mentioned communication connection is not limited to a connection method. It can be connected directly or indirectly through wired communication, or directly or indirectly through wireless communication. It can also be connected through other methods, which is not limited in this application.
  • each user terminal in the user terminal cluster as shown in Figure 1 can be installed with an application client.
  • the application client can communicate with the business server shown in Figure 1 above. 100 for data exchange.
  • the application client can be a social client, an instant messaging client (for example, a conference client), an entertainment client (for example, a game client, a live broadcast client), a multimedia client (for example, a video client), Information clients (for example, news clients), shopping clients, car clients, smart home clients, and other clients that have the function of displaying data information such as text, images, audio, and video.
  • the application client can be a client with an audio and video communication function.
  • the audio and video communication function here can be a simple audio communication function or a video communication function. This function can be widely used. It is used in various business scenarios involving audio and video collection such as audio and video conferencing, audio and video calls, and audio and video live broadcast in different fields such as corporate office, instant communication, online education, telemedicine, and digital finance.
  • the application client can be an independent client, or it can be integrated in a certain client (such as a social network). Embedded sub-clients in communication clients, video clients, etc.) are not limited here.
  • the business server 100 can be a collection of multiple servers including a backend server, a data processing server, etc. corresponding to the instant messaging client. Therefore, each user terminal can communicate with the business server through the instant messaging client. 100 for data transmission, for example, each user terminal can collect relevant audio and video data in real time, and send the collected audio and video data to other user terminals through the business server 100 to achieve audio and video communication (for example, to carry out remote real-time audio and video conferencing).
  • embodiments of the present application provide a method for real-time noise suppression of audio data. This method combines digital signal processing with non-linear fitting capabilities. Neural networks are effectively combined to suppress noise data (such as Babble Noise) during audio and video communication while maintaining extremely high voice fidelity.
  • noise data such as Babble Noise
  • the speech enhancement technology based on digital signal processing can be further divided into single-channel speech enhancement technology and microphone array speech enhancement technology according to the number of channels.
  • the speech enhancement technology of digital signal processing can better cope with stationary noise in real-time online speech enhancement, but its ability to suppress non-stationary noise is poor.
  • speech enhancement technology based on machine learning/deep learning has unique characteristics in noise suppression. It can reduce highly non-stationary noise and background sounds, and can be used for commercial applications in real-time communication. Therefore, the effective combination of digital signal processing technology and machine learning/deep learning technology can simultaneously meet the needs of non-stationary noise suppression and real-time communication.
  • non-stationary noise refers to noise whose statistical characteristics change with time. For example, when collecting audio and video, the barking of dogs, the banging of kitchen utensils, and the crying of babies are collected together with the target speech. sound, construction or traffic noise, etc.
  • the embodiments of this application may collectively refer to the source objects of the target speech as business objects (for example, the user who speaks during the audio and video communication process may also be called the speaker), and the business objects related to the The connected audio data to be processed are collectively referred to as original audio data.
  • the original audio data here can be obtained by collecting the sound in the real environment where the business object is located through audio equipment, and may include both the voice data generated by the business object (that is, the voice data of the target voice) and the noise data in the environment. .
  • the noise data in the embodiment of this application refers to the noise data of non-stationary background noise, which may include real conversation sounds (ie, Babble Noise) around business objects, singing or speaking sounds carried by multimedia files being played, and other similar non-stationary noise data. Smooth background noise.
  • non-stationary background noise may include real conversation sounds (ie, Babble Noise) around business objects, singing or speaking sounds carried by multimedia files being played, and other similar non-stationary noise data. Smooth background noise.
  • the multimedia files here can be video files that carry both image data and audio data, such as short videos, TV series, movies, music videos (Music Video, MV), animations, etc.; or they can be mainly composed of audio data. Audio files, such as songs, audio books, radio dramas, radio programs, etc., embodiments of this application will not limit the type, content, source, and format of multimedia files.
  • embodiments of the present application may refer to a neural network model used for mask estimation of audio features extracted from original audio data as a target mask estimation model.
  • the audio device may be a hardware component installed in the user terminal.
  • the audio device may be a microphone of the user terminal; or, optionally, the audio device may also be a hardware device connected to the user terminal. , for example, a microphone connected to the user terminal is used to provide the user terminal with original audio data acquisition services.
  • the audio device may include an audio sensor, a microphone, etc.
  • the methods provided by the embodiments of the present application can be executed by computer equipment, including but not limited to user terminals (for example, any user terminal in the user terminal cluster shown in Figure 1) or business servers (for example, The business server 100 shown in Figure 1).
  • the business server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also provide cloud database, cloud service, cloud computing, cloud function, cloud storage, network service, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • User terminals can be smartphones, tablets, laptops, desktop computers, handheld computers, wearable devices (such as smart watches, smart bracelets, smart hearing aids, etc.), smart computers, smart cars, etc. that can run the above application clients.
  • Intelligent Terminal The user terminal and the service server may be connected directly or indirectly through wired or wireless methods, and the embodiments of the present application do not limit this.
  • the user terminal 200a and the user terminal 200b are used as examples for description here.
  • business object 1 conducts audio and video communication with business object 2 corresponding to user terminal 200b through the application client on user terminal 200a (for example, business object 1 and business object 2 are having a pure voice conference)
  • business object 1 and business object 2 are having a pure voice conference
  • the user terminal 200a can obtain the original audio data associated with the business object 1 through a related audio device (such as a microphone on the user terminal 200a).
  • the original audio data may be a mixed audio signal in the time domain. Since it is difficult to directly extract the pure speech signal from the mixed audio signal in the time domain, this application will start from the frequency domain to solve the speech separation problem.
  • audio preprocessing can be performed on the original audio data to obtain multiple spectrum frames (also referred to as audio data frames in this embodiment of the application) in the frequency domain.
  • Each spectrum frame contains the original audio data in
  • the embodiment of the present application can call any spectrum frame to be processed among the multiple spectrum frames as a target audio data frame.
  • the target audio data located in the frequency domain can be The spectrum frame before the frame is called the historical audio data frame, that is to say, the historical audio data frame is the target audio data The spectrum frame obtained before the frame.
  • the user terminal 200a can obtain the target audio data frame associated with the original audio data and K historical audio data frames, where K is a positive integer.
  • K is a positive integer.
  • the specific number of historical audio data frames is not determined in the embodiment of this application. limited.
  • the target audio data frame and the K historical audio data frames here are both spectrum frames, and each of the K historical audio data frames is the spectrum frame before the target audio data frame.
  • the target audio data frame is taken as an example for explanation.
  • the processing process of other spectrum frames is consistent with the processing process of the target audio data frame.
  • the user terminal 200a can obtain N target cepstrum coefficients of the target audio data frame, and then can obtain M first-order time derivatives and M first-order time derivatives associated with the target audio data frame based on the N target cepstrum coefficients. Second-order time derivative, where N is a positive integer greater than 1, and M is a positive integer less than N.
  • the embodiments of this application specify the specific number of target cepstrum coefficients, the specific number of first-order time derivatives, and the specific number of second-order time derivatives. Quantities are not limited.
  • the user terminal 200a can also obtain N historical cepstrum coefficients corresponding to each historical audio data frame, and can determine the spectrum dynamic characteristics associated with the target audio data frame based on the obtained K*N historical cepstral coefficients. .
  • the cepstrum coefficients, first-order time derivatives, second-order time derivatives and spectral dynamic features related to each spectrum frame can be collectively referred to as audio features.
  • the audio features corresponding to the target audio data frame can be called is the target audio feature.
  • the target cepstral coefficient can be used to characterize the acoustic characteristics of the target audio data frame
  • the related first-order time derivative, second-order time derivative and spectrum dynamic characteristics can characterize the time correlation characteristics between audio signals (or the stability of the audio signal). characteristic).
  • the user terminal 200a can input the above-mentioned N target cepstral coefficients, M first-order time derivatives, M second-order time derivatives and spectrum dynamic characteristics into the trained target mask estimation model, and use the target mask estimation model to A target mask corresponding to the target audio data frame is output.
  • the target mask can be used to suppress noise data in the original audio data to obtain enhanced audio data corresponding to the original audio data.
  • the voice data and noise data in the original audio data can be effectively separated, and the audio and video communication process can be realized.
  • speech enhancement It can be understood that the mask obtained by modeling multiple audio features is more accurate. Therefore, the speech fidelity of the enhanced audio data obtained by using the mask is also very high.
  • the target mask may include but is not limited to an ideal ratio mask (IRM for short, also known as an ideal ratio mask), an ideal binary mask (Ideal Binary Mask, IBM for short, also known as Can be called ideal binary mask), optimal ratio mask (Optimal Ratio Mask, ORM, can also be called optimal ratio mask), etc.
  • IRM ideal ratio mask
  • IBM Ideal Binary Mask
  • ORM can also be called optimal ratio mask
  • the subsequent user terminal 200a can send the obtained enhanced audio data to the service server 100, and then the service server 100 delivers the enhanced audio data to the terminal device 200b.
  • terminal device 200b can also perform a voice enhancement process similar to the above description to send the obtained enhanced audio data related to business object 2 to terminal device 200a. In this way, the business During the audio and video communication process, object 1 and business object 2 can always hear the high-quality voice sent by the other party, so that high-quality audio and video communication can be achieved and the user experience can be improved.
  • the above-mentioned application client can also be a client with an audio and video editing function.
  • This function can perform voice enhancement processing on the original audio data to be processed.
  • This function can be applied to audio and video production, Audio and video recording and other business scenarios involving audio and video collection.
  • the original audio data can be obtained by recording the sound of the business object (here, the target user who needs to perform speech enhancement) in the real environment in real time through an audio device, or the original audio data can also be obtained from the multimedia file to be processed. (which may include video files and audio files), and the embodiment of the present application does not limit this.
  • the original audio data can also be an audio signal mixed with the voice data of the business object and the noise data in the environment where the business object is located.
  • the process of voice enhancement processing of the original audio data is the same as the above-mentioned audio and video communication scenarios.
  • the speech enhancement process is similar, and the resulting clearer enhanced audio data can be directly stored or sent, or can be used to replace the original audio data in the multimedia file to be processed.
  • the real-time requirements for voice enhancement are lower, but it can still meet users' needs for high-quality voice.
  • the service server can also obtain the original audio data sent by the user terminal, and obtain the target mask corresponding to the target audio data frame associated with the original audio data by loading the trained target mask estimation model, thereby realizing speech Enhance.
  • the number of business servers in the system architecture shown in Figure 1 can be one or more.
  • One user terminal can be connected to one business server, and each business server can obtain the information uploaded by the connected user terminal. original audio data and perform speech enhancement on it.
  • the above system architecture is suitable for a variety of business scenarios involving audio and video collection, which can specifically include: audio and video conference scenarios, audio and video call scenarios, audio and video live broadcast scenarios, audio and video exclusive interview scenarios, remote visit scenarios, Real-time noise reduction scenarios such as speech enhancement and speech recognition of hearing aid devices, as well as non-real-time noise reduction scenarios such as audio and video recording, audio and video post-production, or other business scenarios that require speech enhancement processing of collected audio data, especially It is a business scenario that requires real-time suppression of Babble Noise.
  • the specific business scenarios will not be listed here.
  • Figure 2 is a schematic diagram of an audio data processing scenario provided by an embodiment of the present application.
  • the computer device 20 shown in Figure 2 can be the business server 100 in the embodiment corresponding to Figure 1 or a user terminal cluster. Any user terminal (for example, user terminal 200a) is not limited here.
  • the original audio data 201 can be a mixed audio signal containing the voice data of the business object and the noise data in the environment.
  • the original audio data 201 can be the audio data collected in real time by the computer device 20 through the relevant audio equipment. , may also be audio data obtained by the computer device 20 from the multimedia file to be processed, or may be audio data sent by other computer devices to the computer device 20 for audio processing, which is not limited in the embodiments of the present application.
  • the computer device 20 can suppress the noise data therein to obtain audio data with better voice quality.
  • the computer device 20 can first use digital signal processing technology to extract the audio features of the original audio data 201.
  • the computer device 20 can first perform audio preprocessing on the original audio data 201, specifically including preprocessing the original audio data 201.
  • Frame windowing preprocessing, time-frequency transformation and other operations are performed to obtain an audio data frame set 202 associated with the original audio data 201.
  • the audio data frame set 202 may include multiple audio data frames located in the frequency domain ( That is, spectrum frames), the number of audio data frames included in the audio data frame set 202 is not limited here.
  • the computer device 20 can perform processing operations such as audio feature extraction, mask estimation, noise suppression, etc. on each audio data frame in the audio data frame set 202.
  • the embodiment of the present application does not process the sequence of each audio data frame. Limitations, for example, multiple audio data frames can be processed in parallel, or each audio data frame can be processed serially in order of acquisition time.
  • any audio data frame to be processed in the audio data frame set 202 may be used as the target audio data frame.
  • the audio data frame 203 in the audio data frame set 202 may be used as the target audio data frame.
  • Target audio data frame when other audio data frames are used as target audio data frames, the corresponding processing process is consistent with the processing process for the audio data frame 203.
  • the computer device 20 can also obtain a historical audio data frame set 204 related to the audio data frame 203 in the audio data frame set 202.
  • the historical audio data frame set 204 can include K historical audio data frames before the target audio data frame.
  • these K historical audio data frames may be audio data frames A 1 , ..., audio data frames A K , where K is a positive integer, and the specific value of K is not limited here. It can be understood that the audio data frames A 1 to A K are all spectrum frames before the audio data frame 203 .
  • the computer device 20 can perform audio feature extraction on the target audio data frame. Taking the audio data frame 203 as an example, the computer device 20 can obtain the cepstrum coefficient set 205 corresponding to the audio data frame 203.
  • the cepstrum coefficient set 205 can be used to characterize the acoustic characteristics of the audio data frame 203, where the cepstrum coefficient set 203 205 may include N target cepstrum coefficients of the audio data frame 203.
  • the N target cepstrum coefficients may specifically include cepstrum coefficients B 1 , cepstrum coefficients B 2 , ..., cepstrum coefficients B N , and N is greater than 1. is a positive integer.
  • the specific value of N is not limited here.
  • the computer device 20 may obtain M first-order time derivatives and M second-order time derivatives associated with the audio data frame 203 based on the N target cepstrum coefficients in the cepstral coefficient set 205, where M is a positive number less than N.
  • M is a positive number less than N.
  • the specific value of M is not limited here.
  • the first-order time derivative can be obtained by performing a differential operation on the above-mentioned cepstrum coefficient B 1 , cepstrum coefficient B 2 , ..., and cepstrum coefficient B N
  • the second-order time derivative can be obtained by performing a differential operation on the obtained first-order time derivative. It is obtained by a second difference operation.
  • step S102 for its specific operation process, please refer to the relevant description of step S102 in the embodiment corresponding to FIG. 3.
  • the computer device 20 can obtain a first-order time derivative set 206 and a second-order time derivative set 207 , where the first-order time derivative set 206 can include the audio data frame 203 associated with the audio data frame 203 .
  • M first-order time derivatives for example, the M first-order time derivatives may specifically include first-order time derivatives C 1 , ..., first-order time derivatives C M ; similarly, the second-order time derivative set 207 may include the audio data frame 203
  • the associated M second-order time derivatives for example, the M second-order time derivatives may specifically include second-order time derivatives D 1 , ..., and second-order time derivatives D M .
  • the computer device 20 may also obtain spectral dynamic characteristics associated with the target audio data frame. Still taking the audio data frame 203 as an example, after obtaining the above-mentioned historical audio data frame set 204, the computer device 20 can obtain the N historical cepstrum coefficients corresponding to each historical audio data frame in the historical audio data frame set 204, For example, N historical cepstrum coefficients corresponding to audio data frame A 1 can be obtained, including cepstrum coefficient A 11 , cepstrum coefficient A 12 , ..., cepstrum coefficient A 1N ; ...; N corresponding to audio data frame A K can be obtained historical cepstrum coefficients, including cepstrum coefficients A K1 , cepstrum coefficients A K2 , ..., cepstrum coefficients A KN .
  • the obtained K*N historical cepstrum coefficients can be used as a cepstrum coefficient set 208 . It can be understood that the acquisition process of the N historical cepstrum coefficients corresponding to each historical audio data frame is similar to the acquisition process of the N target cepstral coefficients corresponding to the audio data frame 203, and will not be described again here.
  • the computer device 20 can determine the spectrum dynamic characteristics 209 associated with the audio data frame 203 based on the K*N historical cepstral coefficients in the cepstral coefficient set 208.
  • the specific process can be referred to the steps in the embodiment corresponding to subsequent Figure 3. Related description of S103.
  • the computer device 20 can load a pre-trained target mask estimation model (for example, the mask estimation model 210), and then the above-mentioned cepstrum coefficient set 205, a The first-order time derivative set 206, the second-order time derivative set 207 and the spectrum dynamic features 209 are jointly input to the mask estimation model 210.
  • the mask estimation model 210 the mask estimation is performed on the input audio features, and the corresponding audio data frame 203 can be obtained.
  • target mask for example, mask 211).
  • the computer device 20 can apply the obtained mask 211 to the audio data frame 203 to suppress the noise data therein.
  • the function of the mask is to retain the voice data of the business object in the original audio data as much as possible, and eliminate the interference caused by noisy data (such as the sound of other people talking near the business object).
  • the processing process of other audio data frames (such as audio data frame A 1 , ..., audio data frame A K , etc.) by the computer device 20 is similar to the processing process of the audio data frame 203 and will not be described again here.
  • the enhanced audio data 212 corresponding to the original audio data 201 can be obtained.
  • the noise data content in the enhanced audio data 212 is extremely low, and the voice data of the business object is effectively retained, with extremely high voice fidelity.
  • the computer device 20 can use an audio database with massive audio data to train a neural network to obtain the above-mentioned mask estimation model 210.
  • an audio database with massive audio data to train a neural network to obtain the above-mentioned mask estimation model 210.
  • For the specific training process please refer to the embodiment corresponding to subsequent Figure 8.
  • the sources of the above-mentioned original audio data may be different, and accordingly, the uses of the ultimately obtained enhanced audio data may also be different.
  • the computer device 20 can perform real-time speech enhancement processing on the original audio data E1 to obtain enhanced audio data.
  • F1 is sent to the user terminals of other users who are in audio and video communication with business object 1; for another example, in the hearing aid device voice enhancement scenario, the computer device 20 can obtain the original data associated with business object 2 obtained by the hearing aid device.
  • the audio data E2 undergoes voice enhancement processing, so that the enhanced audio data F2 containing the clear voice data of the business object 2 can be returned to the hearing aid device for playback; for another example, in a speech recognition scenario, after obtaining the original input of the business object 3
  • the computer device 20 can first perform speech enhancement processing on it to obtain the enhanced audio data F3, and then perform speech recognition on the high-quality speech data contained in the enhanced audio data F3, thereby improving the accuracy of speech recognition. sex.
  • the computer device 20 can perform voice enhancement processing on the original audio data E4 entered by the business object 4, and can store the obtained enhanced audio data F4 (for example, it can be stored in the local memory of the computer device 20 Caching or uploading to cloud storage) or sending (for example, it can be sent to other user terminals for playback as audio and video conversation messages during instant messaging); for another example, in an audio and video production scenario, the computer device 20 can download the multimedia to be processed from The original audio data E5 is obtained from the file and subjected to speech enhancement processing. The original audio data E5 in the multimedia file to be processed can then be replaced with the obtained enhanced audio data F5, thereby improving the audio quality in the multimedia file.
  • voice enhancement processing on the original audio data E4 entered by the business object 4
  • the obtained enhanced audio data F4 for example, it can be stored in the local memory of the computer device 20 Caching or uploading to cloud storage
  • sending for example, it can be sent to other user terminals for playback as audio and video conversation messages during instant messaging
  • the computer device 20 can download the multimedia
  • the computer device 20 obtains the target mask estimation model by training the initial mask estimation model, obtains the target audio features of the target audio data frame associated with the original audio data, and masks the target audio features through the target mask estimation model.
  • the target mask estimation model by training the initial mask estimation model, obtains the target audio features of the target audio data frame associated with the original audio data, and masks the target audio features through the target mask estimation model.
  • Figure 3 is a schematic flow chart of an audio data processing method provided by an embodiment of the present application. It can be understood that the method provided by the embodiment of the present application can be executed by a computer device, where the computer device includes but is not limited to a user terminal or a business server running a target mask estimation model.
  • this embodiment of the present application takes the computer device as a user terminal as an example to illustrate the specific process of performing audio processing (such as speech enhancement) on original audio data in the user terminal.
  • the method may at least include the following steps S101 to S104:
  • Step S101 Obtain the target audio data frame and K historical audio data frames associated with the original audio data.
  • the user terminal can obtain the original audio data including the voice data of the business object and the noise data in the environment.
  • the original audio data can be the audio data collected by the user terminal in real time through the audio device, or it can be obtained from the multimedia to be processed.
  • the audio data obtained in the file can also be audio data sent by other associated user terminals, which is not limited here.
  • speech data has certain stable properties. For example, in a pronunciation unit that lasts from tens to hundreds of milliseconds, speech data can show obvious stability and regularity. Based on this, When performing speech enhancement processing, for a piece of audio data, speech enhancement can be performed based on smaller pronunciation units (such as phonemes, words, bytes, etc.). Therefore, before performing audio feature extraction on the original audio data, the user terminal can perform audio preprocessing on the original audio data to obtain multiple spectrum frames in the frequency domain.
  • sliding windows can be used to extract short segments from the raw audio data.
  • the user terminal can perform frame-based windowing preprocessing on the original audio data to obtain H audio data segments, where H is a positive integer greater than 1.
  • H is a positive integer greater than 1.
  • the specific number of audio data segments is not performed in the embodiment of the present application. limited.
  • the frame-based windowing preprocessing may include a frame-based operation and a windowing operation.
  • the user terminal can perform a frame segmentation operation on the original audio data to obtain H audio signal frames located in the time domain. Since there will be discontinuities at the beginning and end of each audio signal frame after framing, the more audio signal frames are divided, the greater the error with the original audio data. Therefore, the embodiment of the present application can solve this problem through windowing operations.
  • the problem is to make the framed signal continuous, and each frame signal can show the characteristics of a periodic function. That is to say, the user terminal can perform a windowing operation on each of the H audio signal frames obtained above, thereby obtaining H audio data segments with continuous signals.
  • each audio signal frame is multiplied by the window function in sequence to obtain the corresponding audio data segment.
  • the window function includes but is not limited to Vorbis window, Hamming window, rectangular window, Hanning window, etc. In actual applications, an appropriate window function can be selected according to needs, and the embodiment of the present application does not limit this.
  • the user terminal can jointly determine the number of audio signal frames that can be divided based on the length of the original audio data, the frame length used in the framing operation, and the frame shift.
  • the frame length refers to the length of an audio signal frame.
  • the "length" here can be used in a variety of ways. Expressed in a way, for example, it can be expressed by time or the number of sampling points. Optionally, if expressed in time, the length of an audio signal frame can usually be between 15ms and 30ms. In actual applications, an appropriate frame length can be selected as needed. This is not limited in the embodiments of the present application. For example, In some embodiments, the frame length may be set to 20 ms, and an audio signal frame with a frame length of 20 ms refers to an audio signal with a duration of 20 ms.
  • the frame shift refers to the distance moved each time it is divided into frames. It starts from the starting point of the first audio signal frame and moves one frame shift until the starting point of the next audio signal frame.
  • This can also be expressed in two ways. For example, in some embodiments, it can be expressed by time, and the frame shift is set to 12ms. For another example, in some embodiments, it can be expressed by the number of sampling points. For a sampling rate of 16kHz For raw audio data, the frame shift can be set to 192 samples.
  • Figures 4a-4b are schematic diagrams of audio preprocessing scenarios provided by embodiments of the present application.
  • the frame length can be set to T1 (for example, set to 20ms), and the frame shift can be set to T2 (for example, set to 12ms).
  • the user terminal can apply the window function to each audio signal frame in turn, so as to obtain the corresponding audio data. part. For example, by multiplying audio signal frame 1 by the window function, audio data segment 1 can be obtained; by multiplying audio signal frame 2 by the window function, audio data segment 2 can be obtained; ...; by multiplying audio signal frame H by the window function, the audio data segment H can be obtained. It can be understood that the audio data segment 1 to the audio data segment H here are arranged in time sequence.
  • the analysis window refers to the window function used in frame-based windowing preprocessing.
  • the speech spectrum in the frequency domain can be restored subsequently.
  • a synthesis window is added to the process of obtaining the speech signal in the time domain.
  • both the analysis window and the synthesis window can use the Vorbis window, and this window function satisfies the Princen-Bradley criterion.
  • the specific implementation process will not be elaborated in the embodiments of this application.
  • the definition of the Vorbis window can be found in the following formula (1):
  • n refers to the index of the sampling point acted on by the current Vorbis window
  • N is the window length, 0 ⁇ n ⁇ N-1.
  • the user terminal can perform time-frequency transformation on each audio data segment respectively, so as to obtain the audio data frame corresponding to each audio data segment. That is to say, the time domain can be converted into The audio data segments on are transformed into audio data frames in the frequency domain in order to obtain spectral frames that are easier to perform noise suppression.
  • this embodiment of the present application will take any audio data segment among H audio data segments as an example to illustrate the specific process of time-frequency transformation.
  • the H audio data segments include audio data segment i, and i is a positive integer less than or equal to H.
  • the user terminal can first perform a time-frequency transformation on the audio data segment i, such as Fourier transform, such as Fast Fourier Transformation , referred to as FFT), to obtain the DC component frequency point and 2S frequency points of the audio data segment i in the frequency domain. That is to say, a total of (1+2S) frequency points can be obtained after Fourier transform, and S is a positive integer.
  • the embodiment of the present application does not limit the number of frequency points.
  • the number of sampling points of the audio signal frame corresponding to each audio data segment and the number of frequency points corresponding to the audio data segment may be the same or different. In actual applications, the number of frequency points obtained after Fourier transformation can be set as needed. For example, in some embodiments, the number of sampling points corresponding to each audio signal frame is 320. When performing time-frequency conversion, the number of frequency points corresponding to each audio data segment can be set to 512.
  • the (1+2S) frequency points obtained after Fourier transform are all complex numbers.
  • Each complex number corresponds to a frequency.
  • the modulus of the complex number can represent the amplitude characteristics of the frequency.
  • the amplitude characteristics are related to the corresponding audio frequency.
  • the embodiment of the present application can determine the first S frequency points among the 2S frequency points as frequency points related to the first frequency point type.
  • the last S frequency points among the 2S frequency points can be determined as frequency points.
  • the frequency points are determined as frequency points related to the second frequency point type, that is, the 2S frequency points may include S frequency points related to the first frequency point type and S frequency points related to the second frequency point type. It can be understood that the S frequency points related to the first frequency point type and the S frequency points related to the second frequency point type are conjugate symmetrical about their centers.
  • the user terminal can obtain S frequency points related to the first frequency point type among the above 2S frequency points, and can determine the audio data based on the S frequency points related to the first frequency point type and the DC component frequency point.
  • the audio data frame corresponding to segment i can also be determined based on the S frequency points related to the second frequency point type and the DC component frequency point. This is not limited in the embodiment of the present application.
  • the audio data frame corresponding to the audio data segment i is a spectrum frame in the frequency domain.
  • each audio data segment corresponds to 513 frequency points, including 1 DC component frequency point and 512 frequency points with a conjugate symmetric relationship, then 512 can be used
  • the first half of the frequency points that is, the frequency point related to the first frequency point type
  • the DC component frequency point form a corresponding audio data frame.
  • 5 frequency points including a DC component frequency point (a+bi), frequency point (c+di), frequency point (e+ fi), frequency point (c-di), frequency point (e-fi), where frequency point (c+di) and frequency point (c-di) are a pair of conjugate complex numbers, frequency point (e+fi)
  • the frequency point (e-fi) is also a pair of conjugate complex numbers.
  • the frequency point (c+di) and the frequency point (e+fi) can be regarded as the frequency point related to the first frequency point type, and the frequency point ( c-di) and frequency point (e-fi) as frequency points related to the second frequency point type.
  • the audio data frame corresponding to the audio data segment i can be determined based on the DC component frequency point (a+bi), the frequency point (c+di) and the frequency point (e+fi), or the audio data frame corresponding to the audio data segment i can be determined based on the DC component frequency point (a +bi), frequency point (c-di) and frequency point (e-fi) determine the audio data frame corresponding to audio data segment i.
  • the user terminal can determine the target audio data frame and the K historical audio data frames before the target audio data frame among the obtained H audio data frames, where the target audio data
  • the frame and the K historical audio data frames are both spectrum frames.
  • the target audio data frame can be any audio data frame to be processed among the H audio data frames, and each of the K historical audio data frames is is the spectrum frame before the target audio data frame, and K is a positive integer less than H.
  • K is not limited in this embodiment of the application. Please refer to Figure 4b again.
  • the spectrum frames before audio data frame 4 include audio data frame 1, audio data frame 2 and audio data frame 3.
  • the number of historical audio data frames before the current audio data frame does not satisfy
  • K the number of historical audio data frames before the audio data frame can be made to K through zero padding operation.
  • Step S102 When N target cepstrum coefficients of the target audio data frame are obtained, M first-order time derivatives and M second-order time derivatives associated with the target audio data frame are obtained based on the N target cepstrum coefficients.
  • N cepstrum coefficients that can characterize the acoustic features of the target audio data frame can be obtained.
  • these N cepstrum coefficients can be collectively referred to as target cepstral coefficients, and M first-order time derivatives, M second-order time derivatives and spectra that can characterize the time correlation characteristics between different speech signals can be further obtained.
  • Dynamic characteristics where N is a positive integer greater than 1, and M is a positive integer less than N. The following will elaborate on the acquisition process of the target cepstral coefficient, first-order time derivative and second-order time derivative associated with the target audio data frame.
  • the specific process of obtaining the N target cepstrum coefficients of the target audio data frame can be: assuming that the target audio data frame contains a total of S1 frequency points, and these S1 frequency points include a DC component frequency point and a frequency point.
  • the user terminal can map these S1 (for example, 256+1) frequency points to N (for example, 56) acoustic frequency bands, where S1 is greater than or equal to N. That is to say, the S1 frequency points can be mapped to N (for example, 56) acoustic frequency bands.
  • the frequencies of the points are divided into coarser frequency scales (ie, acoustic frequency bands in the embodiment of the present application), thereby reducing the complexity of subsequent calculations.
  • cepstrum processing can be performed on each acoustic frequency band separately to obtain the target cepstrum coefficient corresponding to each acoustic frequency band.
  • N acoustic frequency bands include acoustic frequency band j, and j is a positive integer less than or equal to N.
  • the specific process of performing cepstrum processing on acoustic frequency band j can be:
  • the band energy of the acoustic band j can be obtained.
  • triangulation filtering can be performed on the frequency point data to obtain the frequency band energy corresponding to each acoustic frequency band.
  • a triangular filter for example, triangular filter j
  • a triangular filter group containing N triangular filters can be obtained from a triangular filter group containing N triangular filters, and then each filter point in the triangular filter j can be By acting on the frequency points at corresponding positions in the acoustic frequency band j, the frequency band energy of the acoustic frequency band j can be obtained.
  • logarithmic transformation can be performed on the frequency band energy of acoustic frequency band j to obtain the logarithmic frequency band energy of acoustic frequency band j, and then the logarithmic band energy of acoustic frequency band j can be obtained.
  • Discrete Cosine Transform is performed on the logarithmic band energy of the acoustic frequency band j, so that the target cepstrum coefficient corresponding to the acoustic frequency band j can be obtained.
  • N target cepstrum coefficients of the target audio data frame can be obtained.
  • the process of obtaining the cepstrum coefficients of other audio data frames is the same as the process of obtaining the target cepstral coefficients, and will not be described again here.
  • the existing technology directly estimates the size of the frequency bin (which can be understood as the interval between samples in the frequency domain) through a neural network, which will lead to great computational complexity.
  • this application does not deal directly with samples or spectra. Assuming that the spectral envelope of speech and noise is flat enough, a coarser resolution than frequency bin can be used, that is, the frequency of each frequency point is divided into a coarser frequency scale to reduce computational complexity.
  • the embodiment of the present application can This coarser frequency scale is called the acoustic frequency band, and different acoustic frequency bands can be used to characterize the nonlinear characteristics of the human ear's perception of sound.
  • the acoustic frequency band in the embodiment of the present application may be Bark frequency scale, Mel frequency scale or other frequency scale, which is not limited here.
  • Bark frequency scale a frequency scale that matches the human ear's perception of sound.
  • the Bark frequency scale is in Hz, which maps frequencies to 24 critical frequency bands of psychoacoustics.
  • the 25th critical frequency band occupies a frequency of about 16kHz to 20kHz.
  • the width of one critical frequency band is equal to one Bark.
  • the Bark frequency scale converts physical frequencies into psychoacoustic frequencies.
  • the (complex) spectrum values of S1 for example, 257) frequency points need to be considered, then the amount of data subsequently sent to the target mask estimation model will be very large, so the embodiment of this application Using the characteristics of the frequency band envelope, the S1 frequency points are redivided into N acoustic frequency bands (bands), thereby reducing the amount of calculation.
  • the Bark domain may be approximated by various approximation functions. Assume that the sampling rate is 16kHz, the window length is 512, the frame length is 20ms, and the frame shift is 12ms.
  • An audio data frame after Fourier transform contains 257 frequency points. These 257 frequency points can be calculated based on the set frequency band approximation function. It is divided into 56 acoustic frequency bands, and the codes based on the division are as follows:
  • frequency point 0 i.e., the first frequency point, which is the DC component frequency point
  • frequency point 1 i.e., the second frequency point
  • frequency point 232 i.e., the frequency point 255 into the 55th acoustic frequency band
  • frequency point 256 divide the frequency point 256 into the 56th acoustic frequency band.
  • cepstrum processing that is, logarithmic transformation of the frequency band energy of each acoustic frequency band and then DCT
  • BFCC Bark-frequency Cepstrum Coefficients
  • the first-order time derivative and the second-order time derivative of these N target cepstrum coefficients are also considered.
  • the specific process for the user terminal to obtain M first-order time derivatives and M second-order time derivatives associated with the target audio data frame based on N target cepstral coefficients can be as follows:
  • each of the (N-1) differential operation values can be used as a first-order time derivative
  • the M first-order time derivatives associated with the target audio data frame can be obtained from these (N-1) first-order time derivatives; similarly, the obtained (N-1) first-order time derivatives can be Secondary differential operation to obtain (N-2) differential operation values, and each of the (N-2) differential operation values can be used as a second-order time derivative, and then (N-2) ) second-order time derivatives are obtained from M second-order time derivatives associated with the target audio data frame.
  • M 6 can be set.
  • Figure 5 is a schematic diagram of a cepstrum coefficient difference operation scenario provided by an embodiment of the present application.
  • an audio data frame corresponds to 56 cepstrum coefficients (such as BFCC), which are cepstrum coefficient 1, cepstrum coefficient 2, cepstrum coefficient 3, ..., cepstrum coefficient 54, cepstrum coefficient 55.
  • Cepstrum coefficient 56 is a schematic diagram of a cepstrum coefficient difference operation scenario provided by an embodiment of the present application.
  • cepstrum coefficients such as BFCC
  • the first-order time derivative 1 can be obtained; by performing a difference operation on the cepstrum coefficient 2 and the cepstrum coefficient 3, The first-order time derivative 2 can be obtained; ...; the first-order time derivative 54 can be obtained by performing a differential operation on the cepstrum coefficient 54 and the cepstral coefficient 55; the first-order time derivative 54 can be obtained by performing a differential operation on the cepstrum coefficient 55 and the cepstral coefficient 56. Time derivative 55.
  • a second difference operation can be performed on the obtained first-order time derivative 1 to first-order time derivative 55.
  • a second-order difference operation can be performed on the first-order time derivative 1 and the first-order time derivative 2 (for example, the first-order time derivative 2- First-order time derivative 1), you can get the second-order time derivative 1; ...; perform a second difference operation on the first-order time derivative 54 and the first-order time derivative 55, you can get the second-order time derivative 54.
  • Step S103 Obtain N historical cepstrum coefficients corresponding to each historical audio data frame, and determine the spectrum dynamic characteristics associated with the target audio data frame based on the obtained K*N historical cepstral coefficients.
  • the stationarity measurement of the current audio data frame by removing the audio data frame can be obtained based on the frequency band difference values corresponding to the past K historical audio data frames.
  • the user terminal can use its cache (such as a ring buffer structure) to respectively correspond to the latest K historical audio data frames before the current target audio data frame. N historical cepstrum coefficients are stored. When each target audio data frame is updated to the subsequent audio data frame, correspondingly, the historical cepstrum coefficients in the cache are also updated.
  • any two adjacent historical audio data frames can be obtained as the first historical audio data frame and the second historical audio data frame, where the second historical audio data frame
  • the audio data frame is the spectrum frame obtained after the first historical audio data frame.
  • the N historical cepstrum coefficients corresponding to the first historical audio data frame and the N historical cepstral coefficients corresponding to the second historical audio data frame can be obtained from the cache related to the target audio data frame (for example, the local cache of the user terminal). Spectral coefficients.
  • the spectrum dynamic characteristics associated with the target audio data frame when determining the spectrum dynamic characteristics associated with the target audio data frame, specifically includes: comparing the N historical cepstrum coefficients corresponding to the first historical audio data frame with The N coefficient difference values between the N historical cepstrum coefficients corresponding to the second historical audio data frame are used as the inter-frame difference values between the first historical audio data frame and the second historical audio data frame. ; Based on the K-1 inter-frame difference value between each adjacent historical audio data frame in the K historical audio data frames, determine the spectrum dynamic characteristics associated with the target audio data frame.
  • the N historical cepstrum coefficients corresponding to the acquired first historical audio data frame can be used as the first historical cepstral coefficients
  • the acquired second historical audio data frame can be used as the first historical cepstrum coefficients.
  • the corresponding N historical cepstrum coefficients are used as the second historical cepstral coefficients.
  • the frequency band difference value between the first historical cepstral coefficient and the second historical cepstrum coefficient can be used as the inter-frame difference value between the first historical audio data frame and the second historical audio data frame.
  • the coefficient difference value between the historical cepstrum coefficient L p and the historical cepstrum coefficient L q can be obtained (for example, the historical cepstrum coefficient L p - the historical cepstrum coefficient L q ), and then the N coefficient difference values can be obtained, Determined as the frequency band difference value between the first historical cepstral coefficient and the second historical cepstrum coefficient. That is, the band difference value includes N coefficient difference values.
  • the frequency band difference value may then be used as an inter-frame difference value between the first historical audio data frame and the second historical audio data frame.
  • the sum of the difference values of the N coefficient difference values contained in all inter-frame difference values can be obtained, and then the sum of the difference values is averaged (such as the sum of the difference values/K), that is, The corresponding spectrum dynamic characteristics are obtained.
  • Figure 6 is a schematic diagram of a scenario for obtaining inter-frame difference values provided by an embodiment of the present application.
  • K historical audio data frames
  • N historical cepstral coefficients
  • historical audio data frame 1 corresponds to cepstrum coefficients A1 ⁇ cepstrum coefficients A56
  • historical audio data frame 2 corresponds to cepstrum coefficients.
  • the historical audio data frame 7 corresponds to the cepstrum coefficients C1 to cepstral coefficients C56
  • the historical audio data frame 8 corresponds to the cepstrum coefficients D1 to D56.
  • the first historical cepstrum coefficients include cepstrum coefficients A1 to A56
  • the second historical cepstrum coefficients include cepstrum coefficients A1 to A56.
  • the coefficients include cepstrum coefficient B1 ⁇ cepstrum coefficient B56.
  • the coefficient difference value AB1 between cepstrum coefficient A1 and cepstrum coefficient B1 can be obtained (for example, cepstrum coefficient A1 - cepstrum coefficient B1), and the cepstrum coefficient can be obtained
  • the embodiment of the present application can obtain K historical audio data frames through a zero-padding operation, where , the history obtained by zero padding operation
  • the audio data frame is an all-zero spectrum frame.
  • the N cepstrum coefficients corresponding to this type of all-zero spectrum frame can also be set to zero values.
  • Step S104 Input N target cepstrum coefficients, M first-order time derivatives, M second-order time derivatives and spectral dynamic characteristics to the target mask estimation model, and the target mask estimation model outputs the target corresponding to the target audio data frame. mask.
  • intelligent speech enhancement technology based on machine learning/deep learning is inspired by the concept of Time Frequency (T-F) masking in Computational Auditory Scene Analysis (CASA).
  • T-F Time Frequency
  • CASA Computational Auditory Scene Analysis
  • the training target is defined on the T-F representation of the speech signal, such as the spectrogram calculated from the short-time Fourier transform.
  • These training objectives are mainly divided into two categories: one is based on masking, such as IRM, which describes the time-frequency relationship between pure speech and background noise; the other is based on mapping, such as logarithmic power spectrum, It corresponds to the spectral representation of clean speech.
  • the former method is adopted, the nonlinear fitting ability of the neural network is used to estimate the mask from the input features, and then the mask is compared with the noisy speech signal (i.e., the original audio in the embodiment of the present application). After multiplying the spectrum of data), the time domain waveform is reconstructed to achieve the purpose of enhancement.
  • the user terminal can use the N target cepstral coefficients, M first-order time derivatives, M second-order time derivatives and spectrum dynamic characteristics obtained in the above steps as a total of (N+2M+1) features. , are used as the target audio features of the target audio data frame, and the target audio features can be input to the target mask estimation model for mask estimation.
  • the target mask estimation model based on neural networks has extremely strong nonlinear fitting capabilities. Therefore, the initial mask estimation model can be trained to learn how to calculate the mask from the noisy audio features.
  • the specific process of model training can be seen in the following figure. 8 corresponding embodiments.
  • the target mask estimation model may include a mask estimation network layer and a mask output layer, then the obtained target audio features may first be input to the mask estimation network layer, and the mask estimation network layer Mask estimation is performed on the input target audio features, and the hidden features corresponding to the target audio features can be obtained. Further, the hidden feature can be input to the mask output layer, and the hidden feature can be merged through the mask output layer, so that a target mask corresponding to the target audio data frame can be obtained, where the length of the target mask is N (Same number of acoustic bands as divided above).
  • the target mask can be used to suppress noise data in the original audio data to obtain enhanced audio data corresponding to the original audio data.
  • the mask may also be called gain or band gain, and may include but is not limited to ideal ratio mask (IRM), ideal binary mask (IBM), optimal ratio mask (ORM), etc.,
  • IRM ideal ratio mask
  • IBM ideal binary mask
  • ORM optimal ratio mask
  • IRM directly depicts the ratio of pure speech energy and noisy speech energy in the time-frequency unit.
  • the value of IRM can be between 0 and 1. The larger the value, the higher the proportion of target speech in the time-frequency unit.
  • the above-mentioned mask estimation network layer may include a first mask estimation network layer, a second mask estimation network layer, and a third mask estimation network layer with skip connections, wherein the first mask estimation network layer Skip connections between the mask estimation network layer, the second mask estimation network layer, and the third mask estimation network layer can avoid network overfitting.
  • the specific process of mask estimation of the input target audio features through the mask estimation network layer can be:
  • first intermediate feature and the target audio feature can be feature spliced according to the skip connection between the first mask estimation network layer and the second mask estimation network layer to obtain the second intermediate feature, and then the obtained third intermediate feature can be obtained.
  • the second intermediate feature is input to the second mask estimation network layer, and the third intermediate feature can be output through the second mask estimation network layer;
  • the third intermediate layer can be configured according to the jump connection between the first mask estimation network layer and the third mask estimation network layer and the jump connection between the second mask estimation network layer and the third mask estimation network layer.
  • Features, target audio features and first intermediate features are spliced to obtain the fourth intermediate feature;
  • the fourth intermediate feature is input to the third mask estimation network layer, and the hidden features corresponding to the target audio features can be output through the third mask estimation network layer.
  • the target mask estimation model may also use more or fewer mask estimation network layers.
  • the embodiments of this application do not limit the specific number of mask estimation network layers.
  • the first mask estimation network layer, the second mask estimation network layer and the third mask estimation network layer in the mask estimation network layer may Network structures such as Gated Recurrent Units (GRU) or Long Short-term Memory (LSTM) can be used, and the mask output layer can use a fully connected layer or other network structures.
  • GRU Gated Recurrent Unit
  • LSTM Long Short-term Memory
  • GRU is a gating mechanism in recurrent neural networks. It is similar to LSTM with forget gate. GRU contains update gate and reset gate. Compared with LSTM, it has fewer output gates and has fewer parameters than LSTM. Therefore, , if GRU is used to design the mask estimation network layer, a lightweight mask estimation model can be obtained.
  • Figure 7 is a schematic network structure diagram of a mask estimation model provided by an embodiment of the present application.
  • the mask estimation model 70 ie, the target mask estimation model
  • the mask estimation model 70 can include a gated recurrent network layer 1 (i.e., the first mask estimation network layer), gated recurrent network layer 2 (i.e., the second mask estimation network layer), gated recurrent network layer 3 (i.e., the third mask estimation network layer) and the fully connected layer ( (i.e., mask output layer), which uses three layers of simple GRU neural networks to model audio features, and the last fully connected layer is used for output gain (i.e., mask).
  • the mask estimation model 70 ie, the target mask estimation model
  • the fully connected layer i.e., mask output layer
  • the number of features corresponding to each audio data frame of the input model can be 69, and the number of nodes (also called neurons or perceptrons) of the three gated recurrent network layers can be 64, 96, and 96 respectively.
  • the feature dimension of the first intermediate feature output by the gated recurrent network layer 1 is 64
  • the feature dimension of the third intermediate feature output by the gated recurrent network layer 2 is 96
  • the feature dimension of the hidden feature output by the gated recurrent network layer 3 is 96.
  • the number of nodes in the fully connected layer can be 56
  • the dimension of the mask finally output is 56 (that is, 56 mask values are output).
  • each network layer can use an appropriate activation function.
  • gated cyclic network layer 1 can use the ReLu function (Rectified Linear Unit, linear rectification function, also known as modified linear unit)
  • gated cyclic network layer 2 can use ReLu function
  • the gated recurrent network layer 3 can use the tanh function (hyperbolic tangent function, hyperbolic tangent function)
  • the fully connected layer can use the sigmoid function to ensure that the value range of the output mask is ( 0,1).
  • Each of the above network layers may also use other functions as activation functions, which are not limited in the embodiments of the present application.
  • the target mask estimation model used in the embodiments of this application is a lightweight neural network, in which the three-layer mask estimation network layer can achieve good mask estimation effects and has a small number of network parameters. , the network complexity is low, which can reduce calculation time and CPU consumption.
  • the energy of noisy speech must be greater than the energy of pure speech.
  • the frequency domain is divided into N acoustic frequency bands to calculate the energy. For each acoustic frequency band, the less noise the frequency band contains. , the purer the speech, the greater the frequency band gain. Based on this, for noisy speech, multiply each acoustic frequency band by a gain.
  • the physical meaning is that when the acoustic frequency band is noisy, it can be multiplied by a smaller gain, and vice versa. gain, which can enhance speech and suppress noise.
  • the user terminal after obtaining the target mask (ie, frequency band gain) through the above steps, the user terminal can use the target mask to perform noise suppression.
  • the specific process can be:
  • the target mask When the length of the target mask (i.e. N) is less than the length of the target audio data frame (i.e. S1), the target mask needs to be interpolated to obtain the corresponding interpolation mask. At this time, the length of the interpolation mask obtained is equal to The length of the target audio data frames (for example, 257 frequency points) is the same.
  • the interpolation mask can be multiplied by the target audio data frame, that is, each mask value in the interpolation mask is applied to each frequency point obtained by Fourier transform in the target audio data frame, and then the multiplication can be The result is subjected to inverse Fourier transform, thereby obtaining the target audio data after noise suppression of the target audio data frame, that is, restored to the enhanced time domain speech signal.
  • the enhanced audio data corresponding to the original audio data can be obtained.
  • the enhancement obtained at this time The noise content in the audio data is very low, and the voice data of the business object is not mistakenly eliminated, so the enhanced audio data has extremely high voice fidelity.
  • the embodiment of the present application can comprehensively consider a variety of audio features including target cepstrum coefficients, first-order time derivatives, second-order time derivatives, and spectral dynamic characteristics, so that it can be more accurate.
  • target cepstrum coefficients including target cepstrum coefficients, first-order time derivatives, second-order time derivatives, and spectral dynamic characteristics
  • the noise data in the audio data can be effectively suppressed and the speech fidelity is improved, especially in real-time audio and video communication scenarios (for example, real-time audio and video conferences scenarios), it can provide users with high-quality and high-definition voice to enhance the user experience.
  • the embodiment of the present application first uses digital signal processing technology to extract corresponding audio features from the noisy audio data, and then inputs the extracted audio features into a lightweight neural network model (i.e., target mask estimation model) to quickly perform masking.
  • a lightweight neural network model i.e., target mask estimation model
  • Code estimation therefore the network complexity required by the embodiment of the present application is lower, which can reduce computational complexity and CPU (Central Processing Unit, central processing unit) consumption, thereby improving audio data processing efficiency.
  • CPU Central Processing Unit, central processing unit
  • FIG. 8 is a schematic flowchart of an audio data processing method provided by an embodiment of the present application.
  • the computer equipment here includes but is not limited to user terminals or service servers, such as user terminal 200a, user terminal 200b, user terminal 200c,... shown in Figure 1.
  • User terminal 200n or service server 100 the embodiment of the present application takes the computer device as a user terminal as an example to illustrate the specific process of model training of the initial mask estimation model in the user terminal.
  • the method may at least include the following steps S201 to S205:
  • Step S201 Obtain the target sample audio data frame and K historical sample audio data associated with the sample audio data, and obtain the sample mask corresponding to the target sample audio data frame.
  • the user terminal can obtain sample audio data from an audio database with massive audio data.
  • the sample audio data here can be a noisy speech signal (such as audio data carrying Babble Noise and speech data of a sample object).
  • the target sample audio data frame and K historical sample audio data associated with the sample audio data can be obtained by performing operations such as frame windowing preprocessing and time-frequency transformation on the sample audio data.
  • the target sample audio data frame and K historical sample audio data frames are both spectrum frames, and each historical sample audio data frame among the K historical sample audio data frames is the spectrum frame before the target sample audio data frame, K is a positive integer.
  • step S101 in the embodiment corresponding to Figure 3, and will not be described again here.
  • the user terminal can also obtain the sample mask corresponding to the target sample audio data frame.
  • Step S202 when N target sample cepstral coefficients of the target sample audio data frame are obtained, based on the N target sample cepstrum coefficients, M sample first-order time derivatives and M sample associated with the target sample audio data frame are obtained. Sample second time derivative.
  • the user terminal can map multiple frequency points contained in the target sample audio data frame to the divided N sample acoustic frequency bands, and perform cepstrum processing on each sample acoustic frequency band to obtain the target sample cepstrum corresponding to each sample acoustic frequency band. Spectral coefficients.
  • N is a positive integer greater than 1
  • M is a positive integer less than N.
  • N is a positive integer greater than 1
  • M is a positive integer less than N.
  • Step S203 Obtain N historical sample cepstral coefficients corresponding to each historical sample audio data frame, and determine the sample spectrum dynamic characteristics associated with the target sample audio data frame based on the obtained K*N historical sample cepstrum coefficients.
  • the user terminal can obtain the N historical sample cepstral coefficients corresponding to any two adjacent historical sample audio data frames among the K historical sample audio data frames, and then determine these based on the obtained two sets of historical sample cepstrum coefficients.
  • the inter-sample frame difference value between two adjacent historical sample audio data frames can finally determine the sample spectrum dynamic characteristics associated with the target sample audio data frame based on the K-1 sample inter-frame difference value.
  • step S103 for the specific implementation of this step, please refer to step S103 in the embodiment corresponding to FIG. 3, which will not be described again here.
  • Step S204 input N target sample cepstral coefficients, M sample first-order time derivatives, M sample second-order time derivatives, and sample spectrum dynamic characteristics into the initial mask estimation model, and output the target sample audio from the initial mask estimation model The prediction mask corresponding to the data frame.
  • the user terminal can use the obtained N target sample cepstral coefficients, M sample first-order time derivatives, M sample second-order time derivatives and sample spectrum dynamic characteristics as sample audio features of the target sample audio data frame, and then the sample can be
  • the audio features are input to the initial mask estimation model, and the initial mask estimation model outputs a prediction mask corresponding to the target sample audio data frame.
  • step S104 in the embodiment corresponding to FIG. 3, which will not be described again here.
  • Step S205 Iteratively train the initial mask estimation model based on the prediction mask and the sample mask to obtain a target mask estimation model.
  • the target mask estimation model is used to output a target audio data frame associated with the original audio data. The corresponding target mask.
  • the user terminal can generate a loss function based on the prediction mask and the sample mask, and then modify the model parameters in the initial mask estimation model based on the loss function. Through multiple iterative trainings, the output and original audio can be finally obtained.
  • the loss function used in model training can be Huber loss (which is a parameterized loss function used for regression problems), and its formula is described as follows:
  • g true refers to the sample mask
  • g pred refers to the prediction mask
  • the hyperparameter d in loss can be set to 0.1.
  • loss functions can also be used, which are not limited in the embodiments of this application.
  • Figure 9 is a schematic flowchart of model training provided by an embodiment of the present application.
  • a neural network model can be designed to model the extracted audio features to obtain the corresponding mask, which is used to suppress the background noise in the audio data.
  • speech preprocessing 902 ie, audio preprocessing
  • speech preprocessing 902 can be performed on it to obtain multiple associated sample audio data frames, and each of them can be processed in turn.
  • Perform audio feature extraction 903 on the sample audio data frame to obtain sample audio features corresponding to each sample audio data frame.
  • sample audio features can then be input into the initial mask estimation model for network model training 904, so that Get the middle mask estimation model.
  • voice preprocessing 906 can be performed on the obtained test voice 905 (which can also be called test audio data and can be obtained together with the sample audio data) to obtain multiple test audio data frames, and each can be processed in turn.
  • the test audio data frame performs audio feature extraction 907 to obtain the test audio features corresponding to each test audio data frame.
  • test audio features can be input into the intermediate mask estimation model respectively, and the corresponding masks can be output through the intermediate mask estimation model, and the obtained masks can be used to act on the corresponding test audio data frames to obtain background noise suppression.
  • the resulting spectrum can then be subjected to an inverse Fourier transform and the time domain speech signal 908 can be reconstructed, thereby achieving the purpose of speech enhancement 909.
  • the intermediate mask estimation model can be used as a target mask estimation model that can be used directly in the future.
  • FIG. 10 is a schematic diagram of a noise reduction effect provided by an embodiment of the present application.
  • the spectrum corresponding to a piece of speech containing Babble Noise is the spectrum 10A in Figure 10.
  • the enhanced speech obtained
  • the corresponding spectrum is spectrum 10B in Figure 10. From the comparison of the two spectrums before and after, it can be seen that the method provided by this application can effectively suppress background noise such as Babble Noise, while retaining relatively complete speech.
  • the embodiment of the present application can obtain the target mask estimation model for outputting the mask corresponding to the audio data frame by training the initial mask estimation model. Since the model is a lightweight neural network model, Therefore, the computational complexity in the speech enhancement process can be reduced, the size of the installation package in the implementation scenario can be ensured, and CPU consumption can be reduced. In addition, the trained target mask estimation model can automatically and quickly output the estimated mask, which can improve the efficiency of speech enhancement processing of audio data.
  • the audio data processing device 1 can be a computer program (including program code) running on a computer device.
  • the audio data processing device 1 is an application software; the device can be used to execute the audio data processing method provided by the embodiment of the present application. corresponding steps in .
  • the audio data processing device 1 may include: a first acquisition module 11, a second acquisition module 12, a third acquisition module 13, a mask estimation module 14, a frequency band mapping module 15, a cepstrum processing module 16, Noise suppression module 17;
  • the first acquisition module 11 is used to acquire the target audio data frame and K historical audio data frames associated with the original audio data; the target audio data frame and the K historical audio data frames are both spectrum frames, and the K historical audio data frames Each historical audio data frame in the frame is the spectrum frame before the target audio data frame, and K is a positive integer.
  • the first acquisition module 11 may include: audio preprocessing unit 111, time-frequency transformation unit 112, and data frame determination unit 113;
  • the audio preprocessing unit 111 is used to perform frame-based windowing preprocessing on the original audio data to obtain H audio data segments; H is a positive integer greater than 1;
  • the time-frequency transformation unit 112 is used to perform time-frequency transformation on each audio data segment separately to obtain the audio data frame corresponding to each audio data segment;
  • the H audio data segments include audio data segment i, where i is a positive integer less than or equal to H;
  • the time-frequency transformation unit 112 is specifically used to perform Fourier transform on the audio data segment i to obtain the DC component frequency point and 2S frequency points of the audio data segment i in the frequency domain; the 2S frequency points include those of the first frequency point type.
  • the related S frequency points and the S frequency points related to the second frequency point type; S is a positive integer; based on the S frequency points related to the first frequency point type and the DC component frequency point, determine the corresponding audio data segment i audio data frame;
  • the data frame determination unit 113 is used to determine the target audio data frame and the K historical audio data frames before the target audio data frame among the obtained H audio data frames; K is smaller than H.
  • step S101 for the specific functional implementation of the audio preprocessing unit 111, the time-frequency conversion unit 112, and the data frame determination unit 113, please refer to step S101 in the embodiment corresponding to FIG. 3, and will not be described again here.
  • the second acquisition module 12 is configured to acquire M first-order time derivatives and M first-order time derivatives associated with the target audio data frame based on the N target cepstrum coefficients when acquiring the N target cepstrum coefficients of the target audio data frame.
  • Second-order time derivative N is a positive integer greater than 1, M is a positive integer less than N;
  • the second acquisition module 12 may include: a first difference unit 121 and a second difference unit 122;
  • the first difference unit 121 is used to perform a difference operation on the N target cepstrum coefficients to obtain (N-1) difference operation values, and use each difference operation value among the (N-1) difference operation values as a First-order time derivative, obtain M first-order time derivatives associated with the target audio data frame among (N-1) first-order time derivatives;
  • the second difference unit 122 is used to perform secondary difference operations on (N-1) first-order time derivatives to obtain (N-2) difference operation values, and convert (N-2) Each of the differential operation values is used as a second-order time derivative, and M second-order time derivatives associated with the target audio data frame are obtained among the (N-2) second-order time derivatives.
  • step S102 for the specific functional implementation of the first differential unit 121 and the second differential unit 122, please refer to step S102 in the embodiment corresponding to FIG. 3, and will not be described again here.
  • the third acquisition module 13 is used to obtain N historical cepstrum coefficients corresponding to each historical audio data frame, and determine the spectrum dynamic characteristics associated with the target audio data frame based on the acquired K*N historical cepstral coefficients;
  • the third acquisition module 13 may include: a data frame acquisition unit 131, a coefficient acquisition unit 132, a difference determination unit 133, and a feature determination unit 134;
  • the data frame acquisition unit 131 is used to acquire any two adjacent historical audio data frames among the K historical audio data frames as the first historical audio data frame and the second historical audio data frame; the second historical audio data frame is The spectrum frame obtained after the first historical audio data frame;
  • the coefficient acquisition unit 132 is configured to acquire the N historical cepstrum coefficients corresponding to the first historical audio data frame and the N historical cepstral coefficients corresponding to the second historical audio data frame in the cache related to the target audio data frame;
  • the difference determination unit 133 is configured to determine the N coefficient difference values between the N historical cepstral coefficients corresponding to the first historical audio data frame and the N historical cepstral coefficients corresponding to the second historical audio data frame, As the inter-frame difference value between the first historical audio data frame and the second historical audio data frame;
  • the difference determination unit 133 may include: a coefficient difference acquisition subunit 1331 and a difference value determination subunit 1332;
  • the coefficient difference acquisition subunit 1331 is used to obtain the historical cepstrum coefficient Lp from the N historical cepstrum coefficients included in the first historical cepstrum coefficient, and obtain the historical cepstrum coefficient Lp from the N historical cepstrum coefficients included in the second historical cepstrum coefficient.
  • obtain the historical cepstrum coefficient Lq; p and q are both positive integers less than or equal to N, and p q; obtain the coefficient difference value between the historical cepstrum coefficient Lp and the historical cepstrum coefficient Lq;
  • Difference value determination subunit 1332 configured to determine the frequency band difference value between the first historical cepstral coefficient and the second historical cepstral coefficient based on the coefficient difference value, and use the frequency band difference value as the first historical audio data frame and the second historical audio The inter-frame difference value between data frames.
  • step S103 for the specific functional implementation of the coefficient difference acquisition sub-unit 1331 and the difference value determination sub-unit 1332, please refer to step S103 in the embodiment corresponding to Figure 3 above, and will not be described again here.
  • the feature determination unit 134 is configured to determine the spectrum dynamic features associated with the target audio data frame based on the K-1 inter-frame difference values between adjacent historical audio data frames in the K historical audio data frames.
  • step S103 For the specific functional implementation of the data frame acquisition unit 131, coefficient acquisition unit 132, difference determination unit 133, and feature determination unit 134, please refer to step S103 in the embodiment corresponding to Figure 3 above, and will not be described again here.
  • the mask estimation module 14 is used to input N target cepstrum coefficients, M first-order time derivatives, M second-order time derivatives and spectral dynamic characteristics to the target mask estimation model, and output the target audio from the target mask estimation model
  • the target mask corresponding to the data frame the target mask is used to suppress the noise data in the original audio data to obtain the enhanced audio data corresponding to the original audio data;
  • the target mask estimation model includes a mask estimation network layer and a mask output layer;
  • the mask estimation module 14 may include: a mask estimation unit 141 and a mask output unit 142;
  • the mask estimation unit 141 is used to use N target cepstrum coefficients, M first-order time derivatives, M second-order time derivatives and spectral dynamic features as target audio features of the target audio data frame, and input the target audio features into the mask
  • the code estimation network layer performs mask estimation on the target audio features through the mask estimation network layer to obtain the hidden features corresponding to the target audio features;
  • the mask estimation network layer includes a first mask estimation network layer, a second mask estimation network layer, and a third mask estimation network layer with skip connections;
  • the mask estimation unit 141 may include: The first estimation sub-unit 1411, the second estimation sub-unit 1412, the third estimation sub-unit 1413;
  • the first estimation subunit 1411 is used to input the target audio features to the first mask estimation network layer, and output the first intermediate features through the first mask estimation network layer;
  • the second estimation subunit 1412 is used to perform feature splicing on the first intermediate feature and the target audio feature according to the skip connection between the first mask estimation network layer and the second mask estimation network layer to obtain the second intermediate feature, Input the second intermediate feature to the second mask estimation network layer, and output the third intermediate feature through the second mask estimation network layer;
  • the third estimation sub-unit 1413 is used to jump between the first mask estimation network layer and the third mask estimation network layer and the jump between the second mask estimation network layer and the third mask estimation network layer. Connect, perform feature splicing on the third intermediate feature, the target audio feature and the first intermediate feature to obtain the fourth intermediate feature, input the fourth intermediate feature to the third mask estimation network layer, and output it through the third mask estimation network layer Hidden features corresponding to target audio features.
  • step S104 for the specific functional implementation of the first estimation sub-unit 1411, the second estimation sub-unit 1412, and the third estimation sub-unit 1413, please refer to step S104 in the embodiment corresponding to FIG. 3, and will not be described again here.
  • the mask output unit 142 is used to input the hidden features to the mask output layer, and perform feature merging of the hidden features through the mask output layer, Get the target mask corresponding to the target audio data frame.
  • step S104 for the specific functional implementation of the mask estimation unit 141 and the mask output unit 142, please refer to step S104 in the embodiment corresponding to FIG. 3, and will not be described again here.
  • the target audio data frame contains S1 frequency points.
  • the S1 frequency points include a DC component frequency point and S2 frequency points related to a frequency point type.
  • S1 and S2 are both positive integers;
  • the frequency band mapping module 15 is used to map S1 frequency points to N acoustic frequency bands; S1 is greater than or equal to N;
  • the cepstrum processing module 16 is used to perform cepstrum processing on each acoustic frequency band separately to obtain the target cepstrum coefficient corresponding to each acoustic frequency band;
  • the N acoustic frequency bands include acoustic frequency band j, where j is a positive integer less than or equal to N;
  • the cepstrum processing module 16 may include: an energy acquisition unit 161 and a cosine transform unit 162;
  • the energy acquisition unit 161 is used to acquire the frequency band energy of the acoustic frequency band j, perform logarithmic transformation on the frequency band energy of the acoustic frequency band j, and obtain the logarithmic band energy of the acoustic frequency band j;
  • the cosine transform unit 162 is used to perform discrete cosine transform on the logarithmic band energy of the acoustic frequency band j to obtain the target cepstrum coefficient corresponding to the acoustic frequency band j.
  • step S102 for the specific functional implementation of the energy acquisition unit 161 and the cosine transformation unit 162, please refer to step S102 in the embodiment corresponding to FIG. 3, and will not be described again here.
  • the noise suppression module 17 is used to perform interpolation processing on the target mask to obtain an interpolation mask; the length of the interpolation mask is the same as the length of the target audio data frame; multiply the interpolation mask and the target audio data frame, and compare the multiplication result Perform inverse Fourier transform to obtain the target audio data after noise suppression of the target audio data frame; when noise suppression is performed on each audio data frame associated with the original audio data, based on the corresponding audio data frame of each audio data frame Target audio data to obtain enhanced audio data corresponding to the original audio data.
  • Steps S101 to S104 in the corresponding embodiment 3 will not be described again here.
  • the description of the beneficial effects of using the same method will not be described again.
  • the audio data processing device 2 may be a computer program (including program code) running on a computer device.
  • the audio data processing device 2 may be an application software; the device may be used to execute the audio data processing method provided by the embodiment of the present application. corresponding steps in .
  • the audio data processing device 2 may include: a first acquisition module 21, a second acquisition module 22, a third acquisition module 23, a mask prediction module 24, and a model training module 25;
  • the first acquisition module 21 is used to obtain the target sample audio data frame and K historical sample audio data associated with the sample audio data, and obtain the sample mask corresponding to the target sample audio data frame;
  • the target sample audio data frame and K The historical sample audio data frames are all spectrum frames, and each of the K historical sample audio data frames is the spectrum frame before the target sample audio data frame, and K is a positive integer;
  • the second acquisition module 22 is configured to acquire M first-order samples associated with the target sample audio data frame based on the N target sample cepstral coefficients when acquiring the N target sample audio data frame.
  • Time derivative and M sample second-order time derivative; N is a positive integer greater than 1, M is a positive integer less than N;
  • the third acquisition module 23 is used to obtain N historical sample cepstrum coefficients corresponding to each historical sample audio data frame, and determine the cepstrum coefficients associated with the target sample audio data frame based on the obtained K*N historical sample cepstrum coefficients.
  • the mask prediction module 24 is used to input the N target sample cepstral coefficients, M sample first-order time derivatives, M sample second-order time derivatives, and sample spectrum dynamic characteristics into the initial mask estimation model, which is estimated by the initial mask
  • the model outputs the prediction mask corresponding to the target sample audio data frame
  • the model training module 25 is configured to iteratively train the initial mask estimation model based on the prediction mask and the sample mask to obtain a target mask estimation model, and the target mask estimation model is used to output and A target mask corresponding to a target audio data frame associated with the original audio data; the target mask is used to suppress noise data in the original audio data to obtain enhanced audio data corresponding to the original audio data.
  • step S201 the specific function implementation of the first acquisition module 21, the second acquisition module 22, the third acquisition module 23, the mask prediction module 24, and the model training module 25 can be referred to step S201 in the embodiment corresponding to the above-mentioned Figure 8.
  • S205 will not be described again here.
  • the description of the beneficial effects of using the same method will not be described again.
  • the computer device 1000 may include a processor 1001 , a network interface 1004 and a memory 1005 .
  • the computer device 1000 may further include a user interface 1003 and at least one communication bus 1002 .
  • the communication bus 1002 is used to realize connection communication between these components.
  • the user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • Network interface 1004 may optionally include standard wired interfaces, wireless interfaces (such as WI-FI interfaces). mouth).
  • the memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory.
  • the memory 1005 may optionally be at least one storage device located remotely from the aforementioned processor 1001.
  • memory 1005, which is a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 can provide network communication functions; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control stored in the memory 1005.
  • the application program is used to execute the description of the audio data processing method in any of the corresponding embodiments of FIG. 3 and FIG. 8 , which will not be described again here. In addition, the description of the beneficial effects of using the same method will not be described again.
  • the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores the audio data executed by the audio data processing device 1 and the audio data processing device 2 mentioned above.
  • a computer program, and the computer program includes program instructions.
  • the processor executes the program instructions, it can execute the description of the audio data processing method in any of the embodiments corresponding to Figure 3 and Figure 8. Therefore, the details will not be described here.
  • the description of the beneficial effects of using the same method will not be described again.
  • technical details not disclosed in the computer-readable storage medium embodiments involved in this application please refer to the description of the method embodiments in this application.
  • the above-mentioned computer-readable storage medium may be the audio data processing device provided in any of the foregoing embodiments or an internal storage unit of the above-mentioned computer device, such as a hard disk or memory of the computer device.
  • the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc.
  • the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the computer device.
  • the computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
  • embodiments of the present application also provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by any one of the corresponding embodiments of FIG. 3 and FIG. 8 .
  • the description of the beneficial effects of using the same method will not be described again.
  • the audio data processing system 3 may include an audio data processing device 1a and an audio data processing device 2a.
  • the audio data processing device 1a can be the audio data processing device 1 in the embodiment corresponding to Figure 11. It can be understood that the audio data processing device 1a can be integrated in the computer device 20 in the embodiment corresponding to Figure 2. , therefore, will not be described in detail here.
  • the audio data processing device 2a can be the audio data processing device 2 in the embodiment corresponding to the above-mentioned Figure 12. It can be understood that the audio data processing device 2a can be integrated in the computer device 20 in the embodiment corresponding to the above-mentioned Figure 2, Therefore, no further details will be given here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种音频数据处理方法、装置、设备、存储介质及程序产品,该方法包括:获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧(S101);基于获取到的目标音频数据帧的N个目标倒频谱系数,获取M个一阶时间导数和M个二阶时间导数(S102);获取每个历史音频数据帧对应的N个历史倒频谱系数,基于获取到的K*N个历史倒频谱系数,确定频谱动态特征(S103);将N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征输入至目标掩码估计模型,得到目标音频数据帧对应的目标掩码(S104);目标掩码用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。

Description

音频数据处理方法、装置、设备、存储介质及程序产品
本申请要求于2022年9月13日提交中国专利局、申请号为202211110666.3、申请名称为“一种音频数据处理方法、装置以及可读存储介质”的中国专利申请的优先权。
技术领域
本申请涉及计算机技术领域,尤其涉及一种音频数据处理方法、装置、设备、存储介质及程序产品。
发明背景
目前,在一些音视频采集业务场景(例如,音视频会议场景)下,需要对音频数据进行采集,然而,在采集到的这些音频数据中,极易存在非平稳噪声(Non-Stationary Noise),对当前音频数据中的目标语音造成干扰,从而降低了目标语音的采集质量。
然而,在这些非平稳噪声中,有一种非平稳背景噪声Babble Noise,由多个说话人的交谈声组成。由于这种非平稳背景噪声的噪声数据的成分与目标语音的语音数据的成分类似,在对包含该非平稳背景噪声的目标语音进行语音增强处理时,容易将该目标语音中与该非平稳背景噪声具有相似语音成分的语音数据进行误消,从而会降低对音频数据进行噪声抑制后的语音保真度。
发明内容
本申请实施例提供了一种音频数据处理方法、装置、设备、存储介质及程序产品,可以有效抑制音频数据中的噪声数据,且提升语音保真度。
本申请实施例一方面提供了一种音频数据处理方法,由计算机设备执行,包括:
获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧;目标音频数据帧和K个历史音频数据帧均为频谱帧,且K个历史音频数据帧中的每个历史音频数据帧均为目标音频数据帧之前的频谱帧,K为正整数;
在获取到目标音频数据帧的N个目标倒频谱系数时,基于N个目标倒频谱系数,获取与目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数;N为大于1的正整数,M为小于N的正整数;
获取每个历史音频数据帧对应的N个历史倒频谱系数,基于获取到的K*N个历史倒频谱系数,确定与目标音频数据帧相关联的频谱动态特征;及,
将N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征输入至目标掩码估计模型,由目标掩码估计模型输出目标音频数据帧对应的目标掩码;目标掩码用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。
本申请实施例一方面提供了一种音频数据处理方法,由计算机设备执行,包括:
获取与样本音频数据相关联的目标样本音频数据帧和K个历史样本音频数据,且获取目标样本音频数据帧对应的样本掩码;目标样本音频数据帧和K个历史样本音频数据帧均为频谱帧,且K个历史样本音频数据帧中的每个历史样本音频数据帧均为目标样本音频数据帧之前的频谱帧,K为正整数;
在获取到目标样本音频数据帧的N个目标样本倒频谱系数时,基于N个目标样本倒频谱系数,获取与目标样本音频数据帧相关联的M个样本一阶时间导数和M个样本二阶时间导数;N为大于1的正整数,M为小于N的正整数;
获取每个历史样本音频数据帧分别对应的N个历史样本倒频谱系数,基于获取到的K*N个历史样本倒频谱系数确定与目标样本音频数据帧相关联的样本频谱动态特征;
将N个目标样本倒频谱系数、M个样本一阶时间导数、M个样本二阶时间导数以及样本频谱动态特征输入至初始掩码估计模型,由初始掩码估计模型输出目标样本音频数据帧对应的预测掩码;及,
基于预测掩码和样本掩码对初始掩码估计模型进行迭代训练,得到目标掩码估计模型,所述目标掩码估计模型用于输出与原始音频数据相关联的目标音频数据帧所对应的目标掩码;目标掩码用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。
本申请实施例一方面提供了一种音频数据处理装置,包括:
第一获取模块,用于获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧;目标音频数据帧和K个历史音频数据帧均为频谱帧,且K个历史音频数据帧中的每个历史音频数据帧均为目标音频数据帧之前的频谱帧,K为正整数;
第二获取模块,用于在获取到目标音频数据帧的N个目标倒频谱系数时,基于N个目标倒频谱系数,获取与目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数;N为大于1的正整数,M为小于N的正整数;
第三获取模块,用于获取每个历史音频数据帧对应的N个历史倒频谱系数,基于获取到的K*N个历史倒频谱系数确定与目标音频数据帧相关联的频谱动态特征;及,
掩码估计模块,用于将N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征输入至目标掩码估计模型,由目标掩码估计模型输出目标音频数据帧对应的目标掩码;目标掩码用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。
本申请实施例一方面提供了一种音频数据处理装置,包括:
第一获取模块,用于获取与样本音频数据相关联的目标样本音频数据帧和K个历史样本音频数据,且获取目标样本音频数据帧对应的样本掩码;目标样本音频数据帧和K个历史样本音频数据帧均为频谱帧,且K个历史样本音频数据帧中的每个历史样本音频数据帧均为目标样本音频数据帧之前的频谱帧,K为正整数;
第二获取模块,用于在获取到目标样本音频数据帧的N个目标样本倒频谱系数时,基于N个目标样本倒频谱系数,获取与目标样本音频数据帧相关联的M个样本一阶时间导数和M个样本二阶时间导数;N为大于1的正整数,M为小于N的正整数;
第三获取模块,用于获取每个历史样本音频数据帧分别对应的N个历史样本倒频谱系数,基于获取到的K*N个历史样本倒频谱系数确定与目标样本音频数据帧相关联的样本频谱动态特征;
掩码预测模块,用于将N个目标样本倒频谱系数、M个样本一阶时间导数、M个样本二阶时间导数以及样本频谱动态特征输入至初始掩码估计模型,由初始掩码估计模型输出目标样本音频数据帧对应的预测掩码;及,
模型训练模块,用于基于预测掩码和样本掩码对初始掩码估计模型进行迭代训练,得到用于输出与原始音频数据相关联的目标音频数据帧所对应的目标掩码的目标掩码估计模型;目标掩码用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。
本申请实施例一方面提供了一种计算机设备,包括:处理器和存储器;
处理器与存储器相连,其中,存储器用于存储计算机程序,计算机程序被处理器执行时,使得该计算机设备执行本申请实施例提供的方法。
本申请实施例一方面提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,该计算机程序适于由处理器加载并执行,以使得具有该处理器的计算机设备执行本申请实施例提供的方法。
本申请实施例一方面提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例提供的方法。
附图简要说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种***架构示意图;
图2是本申请实施例提供的一种音频数据处理的场景示意图;
图3是本申请实施例提供的一种音频数据处理方法的流程示意图;
图4a是本申请实施例提供的一种音频预处理的场景示意图;
图4b是本申请实施例提供的一种音频预处理的场景示意图;
图5是本申请实施例提供的一种倒频谱系数差分运算的场景示意图;
图6是本申请实施例提供的一种获取帧间差异值的场景示意图;
图7是本申请实施例提供的一种掩码估计模型的网络结构示意图;
图8是本申请实施例提供的一种音频数据处理方法的流程示意图;
图9是本申请实施例提供的一种模型训练的流程示意图;
图10是本申请实施例提供的一种降噪效果示意图;
图11是本申请实施例提供的一种音频数据处理装置的结构示意图;
图12是本申请实施例提供的一种音频数据处理装置的结构示意图;
图13是本申请实施例提供的一种计算机设备的结构示意图;
图14是本申请实施例提供的一种音频数据处理***的结构示意图。
实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
语音增强(Speech Enhancement,SE)技术是指当语音信号被各种各样的噪声干扰、甚至淹没后,从噪声背景中提取有用的语音信号,抑制、降低噪声干扰的技术。语音增强技术可以将语音噪声与非语音噪声分离以保证语音的可懂度,也就是说,从含噪语音中提取尽可能纯净的原始语音。
语音增强涉及的应用领域十分广泛,包括语音通话、电话会议、实时音视频会议、场景录音、助听器设备和语音识别设备等,并成为许多语音编码和识别***的预处理模块。
本申请实施例提供的方案涉及数字信号处理技术。可以理解的是,数字信号处理(Digital Signal Processing,DSP)是将模拟信息(如音频、视频、图片等)转换为数字信息的技术,其利用计算机或专用处理设备,以数字形式对信号进行采集、变换、滤波、估值、增强、压缩、识别等处理,以得到符合人们需要的信号形式。在本申请实施例中,数字信号处理技术可用于对目标音频数据帧提取包含目标倒频谱系数、一阶时间导数、二阶时间导数以及频谱动态特征在内的目标音频特征。
本申请实施例提供的方案还涉及人工智能领域下的机器学习技术。可以理解的是,机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。在本申请实施例中,目标掩码估计模型是基于机器学习技术的AI模型,可用于从输入的音频特征中估计相应的掩码。
请参见图1,图1是本申请实施例提供的一种***架构示意图。如图1所示,该***架构可以包括业务服务器100以及用户终端集群,其中,用户终端集群可以包括一个或多个用户终端,这里将不对用户终端集群中的用户终端的数量进行限定。如图1所示,用户终端集群中的多个用户终端具体可以包括:用户终端200a、用户终端200b、用户终端200c、…、用户终端200n。
其中,用户终端集群之间可以存在通信连接,例如用户终端200a与用户终端200b之间存在通信连接,用户终端200a与用户终端200c之间存在通信连接。同时,用户终端集群中的任一用户终端可以与业务服务器100存在通信连接,以便于用户终端集群中的每个用户终端均可以通过该通信连接与业务服务器100进行数据交互,例如用户终端200a与业务服务器100之间存在通信连接。其中,上述通信连接不限定连接方式,可以通过有线通信方式进行直接或间接地连接,也可以通过无线通信方式进行直接或间接地连接,还可以通过其它方式,本申请在此不做限制。
应该理解,如图1所示的用户终端集群中的每个用户终端均可以安装有应用客户端,当该应用客户端运行于各用户终端中时,可以分别与上述图1所示的业务服务器100之间进行数据交互。其中,该应用客户端可以为社交客户端、即时通信客户端(例如,会议客户端)、娱乐客户端(例如,游戏客户端、直播客户端)、多媒体客户端(例如,视频客户端)、资讯类客户端(例如,新闻客户端)、购物客户端、车载客户端、智能家居客户端等具有显示文字、图像、音频以及视频等数据信息功能的客户端。
例如,在一些实施例中,该应用客户端可以为具有音视频通讯功能的客户端,需要说明的是,这里的音视频通讯功能可以为单纯的音频通讯功能或视频通讯功能,该功能可广泛应用于企业办公、即时交流、在线教育、远程医疗、数字金融等不同领域中的音视频会议、音视频通话、音视频直播等多种涉及音视频采集的业务场景。其中,该应用客户端可以为独立的客户端,也可以为集成在某客户端(例如社 交客户端、视频客户端等)中的嵌入式子客户端,在此不做限定。
以即时通信客户端为例,业务服务器100可以为包括即时通信客户端对应的后台服务器、数据处理服务器等多个服务器的集合,因此,每个用户终端均可以通过该即时通信客户端与业务服务器100进行数据传输,如每个用户终端均可以实时采集相关的音视频数据,并通过业务服务器100将采集到的音视频数据发送至其它用户终端,以实现音视频通讯(例如,开展远程的实时音视频会议)。
应当理解,在实际应用场景(例如实时的音视频通讯场景)下,通过音视频采集过程所采集到的音频数据难以避免地会被外界噪声所干扰,尤其是背景人声所组成的噪声(即Babble Noise),这种类型的噪声对当前音频数据中的目标语音所造成的干扰更加难以消除。
为了提升目标语音的采集质量,需要抑制此类噪声,基于此,本申请实施例提供了一种针对音频数据的噪声实时抑制的方法,该方法将数字信号处理与具越非线性拟合能力的神经网络进行有效结合,以对音视频通讯过程中的噪声数据(例如Babble Noise)进行抑制,同时保有极高的语音保真度。
其中,在基于数字信号处理的语音增强技术中,按照其通道数目的不同,又可以进一步划分为单通道语音增强技术和麦克风阵列的语音增强技术。数字信号处理的语音增强技术在实时在线语音增强中可较好地应对平稳噪声,但对非平稳噪声的抑制能力较差。而另一方面,基于机器学***稳噪声和背景声音,并可以针对实时通信进行商业应用。因此,将数字信号处理技术与机器学***稳噪声的抑制和实时通信的需求。
应当理解,这里的非平稳噪声是指其统计特性随时间变化的噪声,例如,在进行音视频采集时,随目标语音而一并采集到的狗叫声、厨房用具的砰砰声、婴儿哭声、建筑或交通噪音等。
为了便于后续的理解和说明,本申请实施例可以将目标语音的来源对象统称为业务对象(例如,音视频通讯过程中发表讲话的用户,也可称为主讲人),将与该业务对象相关联的待处理的音频数据统称为原始音频数据。
可以理解,这里的原始音频数据可以通过音频设备采集业务对象所处现实环境中的声音所获得,可能会同时包含有业务对象产生的语音数据(即目标语音的语音数据)以及环境中的噪声数据。
本申请实施例中的噪声数据是指非平稳背景噪声的噪声数据,可以包括业务对象周围的真实交谈声(即Babble Noise)、正在播放的多媒体文件所携带的歌声或说话声以及其它类似的非平稳背景噪声。
其中,这里的多媒体文件可以为同时携带图像数据和音频数据的视频类文件,例如短视频、电视剧集、电影、音乐短片(Music Video,MV)、动画等;或者可以为主要由音频数据组成的音频类文件,例如歌曲、有声读物、广播剧、电台节目等,本申请实施例将不对多媒体文件的类型、内容、来源和格式进行限制。
此外,本申请实施例可将用于对从原始音频数据中提取到的音频特征进行掩码估计的神经网络模型称为目标掩码估计模型。
其中,可选的,上述音频设备可以是设于用户终端中的硬件组件,例如音频设备可以为用户终端的麦克风;或者,可选的,该音频设备也可以是与用户终端相连接的硬件装置,例如与用户终端相连接的麦克风,用于为用户终端提供原始音频数据的获取服务,该音频设备可以包括音频传感器、麦克风等。
可以理解的是,本申请实施例提供的方法可以由计算机设备执行,计算机设备包括但不限于用户终端(例如,图1所示的用户终端集群中的任意一个用户终端)或业务服务器(例如,图1所示的业务服务器100)。其中,业务服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式***,还可以是提供云数据库、云服务、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。用户终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、掌上电脑、可穿戴设备(例如智能手表、智能手环、智能助听设备等)、智能电脑、智能车载等可以运行上述应用客户端的智能终端。其中,用户终端和业务服务器可以通过有线或无线方式进行直接或间接地连接,本申请实施例对此不做限制。
为便于理解,这里以用户终端200a和用户终端200b为例进行说明。假设业务对象1通过用户终端200a上的应用客户端与用户终端200b对应的业务对象2进行音视频通讯(例如,业务对象1与业务对象2在进行纯语音会议),当业务对象1进行讲话时,用户终端200a可以通过相关的音频设备(例如用户终端200a上的麦克风)获取到与业务对象1相关联的原始音频数据。
需要说明的是,在本申请实施例中,原始音频数据可以是时域中的混合音频信号。由于在时域上直接将纯净语音信号从混合音频信号中抽离出来难度较大,因此本申请将从频域入手解决语音分离问题。
具体来说,可以对原始音频数据进行音频预处理,以得到频域上的多个频谱帧(在本申请实施例也可称为音频数据帧),每一个频谱帧均包含有原始音频数据在频域上的部分频谱,为便于区分,本申请实施例可以将多个频谱帧中任意一个待处理的频谱帧称为目标音频数据帧,相应的,可以将在频域上位于该目标音频数据帧之前的频谱帧称为历史音频数据帧,也就是说,历史音频数据帧为在目标音频数据 帧之前得到的频谱帧。
基于此,进一步地,用户终端200a可以获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧,其中K为正整数,本申请实施例对历史音频数据帧的具体数量不进行限定。这里的目标音频数据帧和K个历史音频数据帧均为频谱帧,且K个历史音频数据帧中的每个历史音频数据帧均为目标音频数据帧之前的频谱帧。
为了能够提升语音分离的准确性和可懂度,本申请将利用多种音频特征来进行掩码估计。为便于理解,这里以目标音频数据帧为例进行说明,其它频谱帧的处理过程与目标音频数据帧的处理过程一致。具体来说,用户终端200a可以获取目标音频数据帧的N个目标倒频谱系数,进而可以基于这N个目标倒频谱系数,获取与目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数,其中,N为大于1的正整数,M为小于N的正整数,本申请实施例对目标倒频谱系数的具体数量、一阶时间导数的具体数量以及二阶时间导数的具体数量均不进行限定。此外,用户终端200a还可以获取每个历史音频数据帧对应的N个历史倒频谱系数,并可以基于获取到的K*N个历史倒频谱系数,确定与目标音频数据帧相关联的频谱动态特征。
本申请实施例可以将每个频谱帧相关的倒频谱系数、一阶时间导数、二阶时间导数和频谱动态特征统称为音频特征,其中,为便于区分,目标音频数据帧对应的音频特征可称为目标音频特征。可以理解,目标倒频谱系数可用于表征目标音频数据帧的声学特征,相关的一阶时间导数、二阶时间导数以及频谱动态特征则可以表征音频信号间的时间相关性特征(或音频信号的稳定特性)。因此,用户终端200a可以将上述N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征输入至训练好的目标掩码估计模型,由该目标掩码估计模型输出目标音频数据帧对应的目标掩码,该目标掩码可用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。
也就是说,利用得到的每个频谱帧对应的掩码(即Mask,也可称为掩蔽、掩模等),可以有效分离原始音频数据中的语音数据和噪声数据,实现音视频通讯过程中的语音增强。可以理解,综合多种音频特征进行建模所得到的掩码的准确度更高,因此,利用该掩码得到的增强音频数据的语音保真度也很高。
在本申请实施例中,目标掩码可以包括但不限于理想比率掩码(Ideal Ratio Mask,简称IRM,也可称为理想比值掩蔽)、理想二值掩码(Ideal Binary Mask,简称IBM,也可称为理想二值掩蔽)、最佳比率掩码(Optimal Ratio Mask,ORM,也可称为最佳比例掩模)等,这里将不对目标掩码的类型进行限定。
可以理解,后续用户终端200a可以将得到的增强音频数据发送至业务服务器100,再由业务服务器100将该增强音频数据下发给终端设备200b。相应的,可以理解,当业务对象2进行发言时,终端设备200b同样可以执行与上述描述类似的语音增强过程,以将得到的业务对象2相关的增强音频数据发送至终端设备200a,这样,业务对象1和业务对象2在进行音视频通讯的过程中,始终可以收听到对方发送的高质量的语音,从而可以实现高质量的音视频通讯,提升用户体验。
可选的,在一些实施例中,上述应用客户端还可以为具有音视频编辑功能的客户端,通过该功能可以对待处理的原始音频数据进行语音增强处理,该功能可应用于音视频制作、音视频录制等涉及音视频采集的业务场景。其中,该原始音频数据可以通过音频设备实时录制业务对象(此处可指需要执行语音增强的目标用户)所处现实环境中的声音所获得,或者,该原始音频数据也可以从待处理多媒体文件(可包括视频类文件和音频类文件)中获得,本申请实施例对此不做限制。类似的,该原始音频数据也可以为混合有业务对象的语音数据以及该业务对象所处环境中的噪声数据的音频信号,对该原始音频数据进行语音增强处理的过程与上述针对音视频通讯场景的语音增强处理过程类似,最终得到的更清晰的增强音频数据可以直接进行存储或发送,或者可用于替换待处理多媒体文件中的原始音频数据。与实时的音视频通讯场景相比,在非实时的音视频编辑场景下,对语音增强的实时性要求较低,但仍可以满足用户对高质量语音的需求。
可选的,业务服务器也可以获取用户终端发送的原始音频数据,并通过加载训练好的目标掩码估计模型,获取与原始音频数据相关联的目标音频数据帧对应的目标掩码,从而实现语音增强。其中,图1所示的***架构中的业务服务器的数量可以为一个或多个,一个用户终端可以与一个业务服务器相连接,每个业务服务器均可以获取到与之相连接的用户终端所上传的原始音频数据,并对其进行语音增强。
其中,可以理解的是,上述***架构适用于多种涉及音视频采集的业务场景,具体可以包括:音视频会议场景、音视频通话场景、音视频直播场景、音视频专访场景、远程探视场景、助听装置语音增强、语音识别等实时降噪场景,也可以为音视频录制、音视频后期制作等非实时降噪场景,或者其它需要对采集到的音频数据进行语音增强处理的业务场景,尤其是需要对Babble Noise进行实时抑制的业务场景,这里将不对具体的业务场景进行一一列举。
为便于理解,请一并参见图2,图2是本申请实施例提供的一种音频数据处理的场景示意图。其中,如图2所示的计算机设备20可以为上述图1所对应实施例中的业务服务器100或者用户终端集群中的 任意一个用户终端(例如,用户终端200a),这里不做限定。
如图2所示,原始音频数据201可以为包含有业务对象的语音数据和环境中的噪声数据的混合音频信号,该原始音频数据201可以为计算机设备20通过相关音频设备实时采集到的音频数据,也可以为计算机设备20从待处理多媒体文件中获取到的音频数据,还可以为其它计算机设备发送至计算机设备20进行音频处理的音频数据,本申请实施例对此不进行限定。
可以理解,计算机设备20在获取到原始音频数据201后,可以对其中的噪声数据进行抑制,以得到语音质量更好的音频数据。为实现此目标,计算机设备20首先可以利用数字信号处理技术提取原始音频数据201的音频特征,在此之前,计算机设备20可以先对原始音频数据201进行音频预处理,具体包括对原始音频数据201进行分帧加窗预处理、时频变换等操作,从而可以得到与原始音频数据201相关联的音频数据帧集合202,该音频数据帧集合202可以包括多个位于频域上的音频数据帧(即频谱帧),此处对音频数据帧集合202所包含的音频数据帧的数量不进行限定。
随后计算机设备20可以对音频数据帧集合202中的每个音频数据帧均进行音频特征提取、掩码估计、噪声抑制等处理操作,本申请实施例对每个音频数据帧处理的先后顺序不进行限制,例如可以并行处理多个音频数据帧,也可以按照获取时间先后顺序依次对每个音频数据帧进行串行处理。为便于理解和说明,本申请实施例可以将上述音频数据帧集合202中的任意一个待处理的音频数据帧作为目标音频数据帧,例如,可以将音频数据帧集合202中的音频数据帧203作为目标音频数据帧,当其它音频数据帧作为目标音频数据帧时,对应的处理过程与对音频数据帧203的处理过程一致。
此外,计算机设备20还可以在音频数据帧集合202中获取与音频数据帧203相关的历史音频数据帧集合204,该历史音频数据帧集合204可以包括目标音频数据帧之前的K个历史音频数据帧,例如这K个历史音频数据帧可依次为音频数据帧A1、…、音频数据帧AK,其中K为正整数,这里对K的具体取值不进行限定。可以理解,音频数据帧A1~音频数据帧AK均为音频数据帧203之前的频谱帧。
进一步,计算机设备20可以对目标音频数据帧进行音频特征提取。以音频数据帧203为例,计算机设备20可以获取音频数据帧203对应的倒频谱系数集合205,该倒频谱系数集合205可以用于表征音频数据帧203的声学特征,其中,该倒频谱系数集合205可以包括音频数据帧203的N个目标倒频谱系数,例如,N个目标倒频谱系数具体可以包括倒频谱系数B1、倒频谱系数B2、…、倒频谱系数BN,N为大于1的正整数,这里对N的具体取值不进行限定。
随后,计算机设备20可以基于倒频谱系数集合205中的N个目标倒频谱系数,获取与音频数据帧203相关联的M个一阶时间导数和M个二阶时间导数,M为小于N的正整数,这里对M的具体取值不进行限定。其中,一阶时间导数可以通过对上述倒频谱系数B1、倒频谱系数B2、…、倒频谱系数BN进行差分运算得到,二阶时间导数则可以通过对得到的一阶时间导数再进行二次差分运算得到,其具体的运算过程可以参见后续图3所对应实施例中步骤S102的相关描述。
如图2所示,经过相应的运算后,计算机设备20可以获取到一阶时间导数集合206和二阶时间导数集合207,其中,一阶时间导数集合206可以包括与音频数据帧203相关联的M个一阶时间导数,例如,M个一阶时间导数具体可以包括一阶时间导数C1、…、一阶时间导数CM;类似的,二阶时间导数集合207可以包括与音频数据帧203相关联的M个二阶时间导数,例如,M个二阶时间导数具体可以包括二阶时间导数D1、…、二阶时间导数DM
此外,为了能够更准确地表征原始音频数据的稳定特性,计算机设备20还可以获取与目标音频数据帧相关联的频谱动态特征。仍以音频数据帧203为例,计算机设备20在获取到上述历史音频数据帧集合204后,可以获取该历史音频数据帧集合204中的每个历史音频数据帧对应的N个历史倒频谱系数,例如,可以获取音频数据帧A1对应的N个历史倒频谱系数,包括倒频谱系数A11、倒频谱系数A12、…、倒频谱系数A1N;…;获取音频数据帧AK对应的N个历史倒频谱系数,包括倒频谱系数AK1、倒频谱系数AK2、…、倒频谱系数AKN,本申请实施例可以将获取到的K*N个历史倒频谱系数作为倒频谱系数集合208。可以理解,每个历史音频数据帧对应的N个历史倒频谱系数的获取过程与上述音频数据帧203对应的N个目标倒频谱系数的获取过程是类似的,这里不再进行赘述。
进一步,计算机设备20可以基于倒频谱系数集合208中的K*N个历史倒频谱系数确定与音频数据帧203相关联的频谱动态特征209,其具体过程可以参见后续图3所对应实施例中步骤S103的相关描述。
进一步地,在获取到音频数据帧203的音频特征后,计算机设备20可以加载预先训练好的目标掩码估计模型(例如,掩码估计模型210),进而可以将上述倒频谱系数集合205、一阶时间导数集合206、二阶时间导数集合207以及频谱动态特征209共同输入至掩码估计模型210,通过掩码估计模型210在输入的音频特征中进行掩码估计,可以得到音频数据帧203对应的目标掩码(例如,掩码211)。
随后,计算机设备20可以将得到的掩码211作用于音频数据帧203,以对其中的噪声数据进行抑制。可以理解,掩码的作用相当于尽可能保留原始音频数据中业务对象的语音数据,而消除造成干扰的 噪声数据(例如业务对象附近其他人的交谈声)。此外,计算机设备20对其它音频数据帧(例如音频数据帧A1、…、音频数据帧AK等)的处理过程与对音频数据帧203的处理过程类似,这里不再进行赘述。
最终,当计算机设备20对音频数据帧集合202中的每个音频数据帧均进行噪声抑制后,可以得到原始音频数据201对应的增强音频数据212。此时,增强音频数据212中的噪声数据含量极低,且业务对象的语音数据得到了有效保留,具有极高的语音保真度。
可以理解,计算机设备20可以利用具有海量音频数据的音频数据库,训练神经网络得到上述掩码估计模型210,具体训练过程可以参见后续图8所对应的实施例。
可以理解的是,在不同的业务场景中,上述原始音频数据的来源可以不同,相应的,最终得到的增强音频数据的用途也可以不同。例如,在音视频会议、音视频通话、音视频直播、音视频专访、远程探视等实时音视频通讯场景中,计算机设备20可以将对原始音频数据E1进行实时语音增强处理后得到的增强音频数据F1发送至与业务对象1进行音视频通讯的其他用户的用户终端上;又例如,在助听装置语音增强场景中,计算机设备20可以对助听装置获取到的与业务对象2相关联的原始音频数据E2进行语音增强处理,从而可以将包含业务对象2的清晰语音数据的增强音频数据F2返回至助听装置进行播放;又例如,在语音识别场景中,在获取到业务对象3输入的原始音频数据E3后,计算机设备20可以先对其进行语音增强处理,以得到增强音频数据F3,随后可以对增强音频数据F3中所包含的高质量语音数据进行语音识别,从而可以提升语音识别的准确性。又例如,在音视频录制场景中,计算机设备20可以对业务对象4录入的原始音频数据E4进行语音增强处理,并可对得到的增强音频数据F4进行存储(例如可以存储至计算机设备20的本地缓存或者上传至云端存储)或者发送(例如,可作为即时通信过程中的音视频会话消息发送给其它用户终端进行播放);又例如,在音视频制作场景中,计算机设备20可以从待处理多媒体文件中获取原始音频数据E5并对其进行语音增强处理,随后可以用得到的增强音频数据F5替换该待处理多媒体文件中的原始音频数据E5,从而可以提升多媒体文件中的音频质量。
其中,计算机设备20通过训练初始掩码估计模型得到目标掩码估计模型,获取与原始音频数据相关联的目标音频数据帧的目标音频特征,并通过目标掩码估计模型对该目标音频特征进行掩码估计,输出目标音频数据帧所对应的目标掩码以及利用目标掩码进行噪声抑制的具体实现方式,可以参见下述图3-图10所对应实施例中的描述。
请参见图3,图3是本申请实施例提供的一种音频数据处理方法的流程示意图。其中,可以理解的是,本申请实施例提供的方法可以由计算机设备执行,这里的计算机设备包括但不限于运行有目标掩码估计模型的用户终端或业务服务器。为便于理解,本申请实施例以该计算机设备为用户终端为例,以阐述在该用户终端中对原始音频数据进行音频处理(如语音增强)的具体过程。如图3所示,该方法至少可以包括下述步骤S101-步骤S104:
步骤S101,获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧。
具体的,用户终端可以获取包含有业务对象的语音数据和环境中的噪声数据的原始音频数据,该原始音频数据可以为用户终端通过音频设备实时采集得到的音频数据,也可以是从待处理多媒体文件中获取到的音频数据,还可以为其它相关联的用户终端发送过来的音频数据,这里不进行限定。
可以理解,从统计学的角度看,语音数据具有一定的平稳属性,例如,一个持续几十毫秒到几百毫秒的发音单元里,语音数据可表现出明显的稳定性和规律性,基于此,在进行语音增强处理时,对于一段音频数据,可以以较小的发音单元(例如音素、字、字节等)为基础进行语音增强。因此,在对原始音频数据进行音频特征提取之前,用户终端可以对该原始音频数据进行音频预处理,以得到多个位于频域的频谱帧。
在一种实施方式中,可以利用滑动窗从原始音频数据中提取短时片段。具体来说,用户终端可以对原始音频数据进行分帧加窗预处理,从而得到H个音频数据段,其中,H为大于1的正整数,本申请实施例对音频数据段的具体数量不进行限定。
需要说明的是,分帧加窗预处理可以包括分帧操作和加窗操作。首先,用户终端可以对原始音频数据进行分帧操作,得到位于时域上的H个音频信号帧。由于分帧后每一个音频信号帧的开始和结束都会出现间断,以至于分割的音频信号帧越多,与原始音频数据的误差就越大,因此,本申请实施例可以通过加窗操作解决这个问题,以使成帧后的信号变得连续,并且每一帧信号都可以表现出周期函数的特性。也就是说,用户终端可以对上述得到的H个音频信号帧中的每个音频信号帧均进行加窗操作,从而得到信号连续的H个音频数据段。在本申请实施例中,进行加窗操作时,将每一个音频信号帧依次与窗函数进行相乘即可得到对应的音频数据段。其中,窗函数包括但不限于Vorbis窗、汉明窗、矩形窗、汉宁窗等,实际应用中可以根据需要选取合适的窗函数,本申请实施例对此不做限定。
需要说明的是,用户终端可以根据原始音频数据的长度、分帧操作所采用的帧长以及帧移,共同确定可以划分的音频信号帧的数量。其中,帧长是指一个音频信号帧的长度,这里的“长度”可以用多种 方式表示,例如可以用时间或者采样点数来表示。可选的,如果用时间表示,一个音频信号帧的长度可以通常可以取在15ms-30ms之间,实际应用中可以根据需要选取合适的帧长,本申请实施例对此不进行限定,例如,在一些实施例中,可以将帧长设置为20ms,帧长为20ms的一个音频信号帧是指时长为20ms的音频信号。
可选的,也可以用采样点数来表示,例如,在一些实施例中,假设原始音频数据的采样率为16kHz,帧长为20ms,则一个音频信号帧可以由16kHz*20ms=320个采样点组成。
其中,帧移是指每次分帧时移动的距离,以第一个音频信号帧的起始点开始,移动一个帧移,直到下一个音频信号帧的起始点。这里同样也可以用两种方式表示,例如,在一些实施例中,可以用时间表示,将帧移设置为12ms;又例如,在一些实施例中,可以用采样点数表示,对于采样率为16kHz的原始音频数据,可以将帧移设置为192个采样点。
为便于理解,请参见图4a-图4b,图4a-图4b是本申请实施例提供的一种音频预处理的场景示意图。如图4a所示,在对长度为T的原始音频数据进行分帧操作时,可以将帧长设置为T1(例如,设置为20ms),将帧移设置为T2(例如,设置为12ms),则从原始音频数据的起始位置开始,取帧长为T1的音频信号,得到第一个音频信号帧,即音频信号帧1;随后,移动一个长度为T2的帧移,并从移动后的位置开始再取帧长为T1的音频信号,得到第二个音频信号帧,即音频信号帧2;依此类推,最终可以得到H个音频信号帧,其中,H=(T-T1)÷T2+1。
特别的,在分帧操作时,可能会遇到最后剩下的信号长度不够一帧的情况,此时可以将对这剩余的一段信号进行补零操作,使之达到一帧的长度(即T1),或者可以直接将之抛弃,因为最后一帧处于原始音频数据最末尾部分,大部分为静音片段。
进一步地,请参见图4b,如图4b所示,在经过上述分帧操作得到H个音频信号帧后,用户终端可以将窗函数依次作用于每一个音频信号帧,从而可以得到相应的音频数据段。例如,将音频信号帧1与窗函数相乘,可以得到音频数据段1;将音频信号帧2与窗函数相乘,可以得到音频数据段2;…;将音频信号帧H与窗函数相乘,可以得到音频数据段H。可以理解,这里的音频数据段1~音频数据段H是按照时间顺序排列的。
在一些实施例中,分析窗是指分帧加窗预处理时所使用的窗函数,而为了实现语音信号的完美重构、降低语音信号的失真程度,可以在后续将频域的语音频谱还原到时域的语音信号的过程中再加上一个合成窗。其中,分析窗和合成窗都可以采用Vorbis窗,此窗函数满足Princen-Bradley准则。具体实现过程在本申请实施例中不进行展开。Vorbis窗的定义可参见下述公式(1):
其中,n是指当前Vorbis窗所作用的采样点的索引,N为窗长,0≤n≤N-1。
在获取到H个音频数据段后,进一步地,用户终端可以分别对每个音频数据段进行时频变换,从而可以得到每个音频数据段对应的音频数据帧,也就是说,可以将时域上的音频数据段变换为频域上的音频数据帧,以便获得更易于进行噪声抑制的频谱帧。
为便于理解和说明,本申请实施例将以H个音频数据段中的任意一个音频数据段为例,来阐述时频变换的具体过程。假设H个音频数据段中包括音频数据段i,i为小于或等于H的正整数,用户终端可以先对音频数据段i进行时频变换,例如进行傅立叶变换,如快速傅立叶变换(Fast Flourier Transformation,简称FFT),以得到音频数据段i在频域中的直流分量频点和2S个频点,也就是说,傅立叶变换后一共可以得到(1+2S)个频点,S为正整数,本申请实施例对频点的数量不进行限定。每个音频数据段对应的音频信号帧的采样点数与对该音频数据段对应的频点数可以相同,也可以不相同,实际应用中可以根据需要设置傅立叶变换后得到的频点数。例如,在一些实施例中,每个音频信号帧对应的采样点数为320,在进行时频变换时,每个音频数据段对应的频点数则可以设置为512。
由傅立叶变换的性质可知,傅立叶变换后得到的(1+2S)个频点均为复数,每个复数对应一个频率,该复数的模值可以表示该频率的振幅特征,该振幅特征与对应音频信号的振幅之间具有特定的比例关系。需要说明的是,除了第一个复数(即直流分量频点)外,其余的2S个复数是关于其中心共轭对称的,而共轭对称的两个复数的模值(或振幅)相同,因此实际上只需选取这2S个频点中的一半频点的频谱即可。为便于区分,本申请实施例可以将这2S个频点中的前S个频点确定为与第一频点类型相关的频点,相应的,可以将这2S个频点中的后S个频点确定为与第二频点类型相关的频点,即这2S个频点可以包括与第一频点类型相关的S个频点和与第二频点类型相关的S个频点。可以理解,与第一频点类型相关的S个频点和与第二频点类型相关的S个频点是关于其中心共轭对称的。
随后,用户终端可以在上述2S个频点中获取与第一频点类型相关的S个频点,并可以基于与第一频点类型相关的S个频点和直流分量频点,确定音频数据段i对应的音频数据帧。又或者,可选的,由 于共轭对称的特点,也可以基于与第二频点类型相关的S个频点和直流分量频点,确定音频数据段i对应的音频数据帧,本申请实施例对此不进行限定。
可以理解,音频数据段i对应的音频数据帧为频域上的频谱帧。例如,在一些实施例中,时频变换后,每个音频数据段对应得到513个频点,其中包括1个直流分量频点和512个具有共轭对称关系的频点,则可以取512个频点中的前一半频点(即与第一频点类型相关的频点)和该直流分量频点组成对应的音频数据帧。
例如,假设对音频数据段i进行傅立叶变换后得到5个频点(即S=2),包括一个直流分量频点(a+bi),以及频点(c+di)、频点(e+fi)、频点(c-di)、频点(e-fi),其中,频点(c+di)与频点(c-di)是一对共轭复数,频点(e+fi)与频点(e-fi)也是一对共轭复数,因此,可以将频点(c+di)和频点(e+fi)作为与第一频点类型相关的频点,将频点(c-di)和频点(e-fi)作为与第二频点类型相关的频点。进一步,可以基于直流分量频点(a+bi)、频点(c+di)以及频点(e+fi)确定音频数据段i对应的音频数据帧,或者,可以基于直流分量频点(a+bi)、频点(c-di)以及频点(e-fi)确定音频数据段i对应的音频数据帧。
为便于理解,请再次参见图4b,如图4b所示,对音频数据段1进行时频变换后,可以得到音频数据帧1;对音频数据段2进行时频变换后,可以得到音频数据帧2;…;对音频数据段H进行时频变换后,可以得到音频数据帧H。可以理解,这H个音频数据帧在频域上的先后顺序与H个音频数据段在时域上的先后顺序是一致的。
可以理解,在得到H个音频数据帧后,用户终端可以在得到的这H个音频数据帧中,确定目标音频数据帧以及目标音频数据帧之前的K个历史音频数据帧,其中,目标音频数据帧和K个历史音频数据帧均为频谱帧,目标音频数据帧可以为H个音频数据帧中任意一个待处理的音频数据帧,且K个历史音频数据帧中的每个历史音频数据帧均为目标音频数据帧之前的频谱帧,K为小于H的正整数,本申请实施例对K的取值不进行限定。请再次参见图4b,假设将音频数据帧4作为目标音频数据帧,则在音频数据帧4之前的频谱帧有音频数据帧1、音频数据帧2以及音频数据帧3,例如,当K=2时,可以将与音频数据帧4最相近的音频数据帧2和音频数据帧3作为所需的历史音频数据帧。
需要说明的是,在对每个音频数据帧进行处理时,均需要获取该音频数据帧之前的K个历史音频数据帧,特别的,对于当前音频数据帧之前的历史音频数据帧的数量不满足指定K个的情况,可以通过补零操作使该音频数据帧之前的历史音频数据帧的数量达到K个。例如,结合上述图4b,假设将音频数据帧1作为目标音频数据帧,且K=2,可以看到时频变换后音频数据帧1之前是不存在频谱帧的,因此可以在音频数据帧1前面补上2个全零的频谱帧,作为音频数据帧1之前的历史音频数据帧。
步骤S102,在获取到目标音频数据帧的N个目标倒频谱系数时,基于N个目标倒频谱系数,获取与目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数。
可以理解,通过音频预处理获取到目标音频数据帧后,用户终端可以对该目标音频数据帧进行音频特征提取,具体的,可以获取可表征目标音频数据帧的声学特征的N个倒频谱系数,为便于区分,可以将这N个倒频谱系数统称为目标倒频谱系数,并且可以进一步获取可表征不同语音信号间的时间相关性特征的M个一阶时间导数、M个二阶时间导数和频谱动态特征,其中,N为大于1的正整数,M为小于N的正整数。下面将详细阐述目标音频数据帧相关联的目标倒频谱系数、一阶时间导数和二阶时间导数的获取过程。
其中,获取目标音频数据帧的N个目标倒频谱系数的具体过程可以为:假设目标音频数据帧一共包含有S1个频点,这S1个频点包括一个直流分量频点以及与一种频点类型相关的S2个频点,S1和S2均为正整数,且S1=1+S2。结合上述步骤S101对时频变换过程的描述,与一种频点类型相关的S2个频点可以为与第一频点类型相关的S个频点或者与第二频点类型相关的S个频点,这里S2=S。基于此,用户终端可以将这S1个(例如,256+1个)频点映射到N个(例如,56个)声学频带上,其中S1大于或等于N,也就是说,可以将S1个频点的频率划分为更粗糙的频率尺度(即本申请实施例中的声学频带),从而降低后续计算的复杂度。
进一步,可以分别对每个声学频带进行倒谱处理,以得到每个声学频带对应的目标倒频谱系数。为便于理解,这里假设N个声学频带包括声学频带j,j为小于或等于N的正整数,对声学频带j进行倒谱处理的具体过程可以为:
首先,可以获取声学频带j的频带能量。
在一些实施例中,可以对频点数据进行三角滤波以得到其对应每个声学频带的频带能量。例如,可以在包含N个三角滤波器的三角滤波器组中,获取与声学频带j相关联的三角滤波器(例如三角滤波器j),进而可以将该三角滤波器j中的每个滤波点分别作用于声学频带j中对应位置上的频点,即可得到声学频带j的频带能量。
在一些实施例中,当声学频带的数量为56个时,相应的,也需要56个三角滤波器。
进一步地,可以对声学频带j的频带能量进行对数变换,得到声学频带j的对数频带能量,进而可 以对声学频带j的对数频带能量进行离散余弦变换(Discrete Cosine Transform,DCT),从而可以得到声学频带j对应的目标倒频谱系数。
可以理解,在对每个声学频带均进行倒谱处理后,可以得到目标音频数据帧的N个目标倒频谱系数。获取其它音频数据帧的倒频谱系数的过程与获取目标倒频谱系数的过程一致,这里不再进行赘述。
其中,需要说明的是,现有技术通过神经网络直接估计frequency bin的大小(可理解为频域中样本之间的间隔),会导致极大的计算复杂性。为了解决这个问题,本申请不直接处理样本或频谱。假设语音和噪声的频谱包络足够平坦,可以使用比frequency bin更粗糙的分辨率,即将每个频点的频率划分为更粗糙的频率尺度,以降低计算复杂性,本申请实施例可以将这种更粗糙的频率尺度称为声学频带,不同的声学频带可用于刻画人耳对声音的感知所具备的非线性特征。
本申请实施例中的声学频带可以为Bark频率尺度、Mel频率尺度或者其它频率尺度,这里不进行限定。例如,以Bark频率尺度为例,这是一种与人耳对声音感知相匹配的频率刻度。Bark频率尺度是以Hz为单位,把频率映射到心理声学的24个临界频带上,第25个临界频带占据约16kHz~20kHz的频率,1个临界频带的宽度等于一个Bark。简单的说,Bark频率尺度是把物理频率转换到心理声学的频率。
结合上述步骤,如果直接使用frequency bin,需要考虑S1个(例如,257个)频点的(复数)频谱值,那么后续送入目标掩码估计模型的数据量会很大,因此本申请实施例利用频段包络的特征,将S1个频点重新划分为N个声学频带(band),从而可以达到减少计算量的目的。
例如,在一些实施例中,可以通过各种近似函数来近似表示Bark域。假设采样率为16kHz,窗长为512,帧长为20ms,帧移为12ms,傅立叶变换后的一个音频数据帧包含有257个频点,则可以基于设置的频带近似函数将这257个频点划分到56个声学频带上,划分依据的代码如下:
static const opus_int16eband5ms[]={//eband20ms---56ok
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,36,38,40,42,44,46,48,56,64,72,80,92,104,116,128,144,160,176,192,208,232,256}。
也就是说,可以将频点0(即第一个频点,为直流分量频点)划分到第1个声学频带,将频点1(即第二个频点)划分到第2个声学频带,…,将频点232~频点255划分到第55个声学频带,将频点256划分到第56个声学频带。随后可以分别对这56个声学频带进行倒谱处理(即对每个声学频带的频带能量进行对数变换再做DCT),最终得到56个Bark-frequency倒频谱系数(Bark Frequency Cepstrum Coefficient,BFCC)。
除了上述N个目标倒频谱系数外,还考虑这N个目标倒频谱系数的一阶时间导数和二阶时间导数。用户终端基于N个目标倒频谱系数,获取与目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数的具体过程可以为:
首先,对N个目标倒频谱系数进行差分运算,可以得到(N-1)个差分运算值,进而可以将(N-1)个差分运算值中的每个差分运算值作为一个一阶时间导数,随后可以在这(N-1)个一阶时间导数中获取与目标音频数据帧相关联的M个一阶时间导数;类似的,可以对得到的(N-1)个一阶时间导数进行二次差分运算,以得到(N-2)个差分运算值,进而可以将(N-2)个差分运算值中的每个差分运算值作为一个二阶时间导数,随后可以在(N-2)个二阶时间导数中获取与目标音频数据帧相关联的M个二阶时间导数,本申请实施例对M的取值不进行限定,例如,在一些实施例中,可以设置M=6。
为便于理解,请一并参见图5,图5是本申请实施例提供的一种倒频谱系数差分运算的场景示意图。如图5所示,假设一个音频数据帧对应有56个倒频谱系数(例如BFCC),分别为倒频谱系数1、倒频谱系数2、倒频谱系数3、…、倒频谱系数54、倒频谱系数55、倒频谱系数56。其中,对倒频谱系数1和倒频谱系数2进行差分运算(例如倒频谱系数2-倒频谱系数1),可以得到一阶时间导数1;对倒频谱系数2和倒频谱系数3进行差分运算,可以得到一阶时间导数2;…;对倒频谱系数54和倒频谱系数55进行差分运算,可以得到一阶时间导数54;对倒频谱系数55和倒频谱系数56进行差分运算,可以得到一阶时间导数55。
进而可以对得到的一阶时间导数1~一阶时间导数55再进行二次差分运算,例如,对一阶时间导数1和一阶时间导数2进行二次差分运算(例如一阶时间导数2-一阶时间导数1),可以得到二阶时间导数1;…;对一阶时间导数54和一阶时间导数55进行二次差分运算,可以得到二阶时间导数54。
在一些实施例中,可以设置M=6,取前6个一阶时间导数和前6个二阶时间导数,即取一阶时间导数1~一阶时间导数6以及二阶时间导数1~二阶时间导数6。
步骤S103,获取每个历史音频数据帧对应的N个历史倒频谱系数,基于获取到的K*N个历史倒频谱系数,确定与目标音频数据帧相关联的频谱动态特征。
除了上述步骤S102中提及的目标倒频谱系数、一阶时间导数和二阶时间导数之外,还可以考虑过 去音频数据帧对当前音频数据帧的平稳性度量,即频谱动态特征,该特征可以基于过去K个历史音频数据帧对应的频带差异值得到。
在一些实施例中,由于每个历史音频数据帧对应的N个历史倒频谱系数的获取均可以在对该历史音频数据帧进行处理时得到(即该音频数据帧作为目标音频数据帧时,其具体过程可以参见上述步骤S102中N个目标倒频谱系数的获取过程),因此,用户终端可以利用其缓存(例如环形buffer结构)对当前目标音频数据帧之前的最新K个历史音频数据帧分别对应的N个历史倒频谱系数进行存储。当每一个目标音频数据帧更新为其后一个音频数据帧时,相应的,也要对缓存中的历史倒频谱系数进行更新。
具体的,在目标音频数据帧之前的K个历史音频数据帧中,可以获取任意两个相邻的历史音频数据帧作为第一历史音频数据帧和第二历史音频数据帧,其中,第二历史音频数据帧为在第一历史音频数据帧之后得到的频谱帧。进而可以在与目标音频数据帧相关的缓存(例如,用户终端的本地缓存)中,获取第一历史音频数据帧对应的N个历史倒频谱系数以及第二历史音频数据帧对应的N个历史倒频谱系数。
基于获取到的K*N个历史倒频谱系数,确定与所述目标音频数据帧相关联的频谱动态特征时,具体包括:将所述第一历史音频数据帧对应的N个历史倒频谱系数与所述第二历史音频数据帧对应的N个历史倒频谱系数之间的N个系数差异值,作为所述第一历史音频数据帧和所述第二历史音频数据帧之间的帧间差异值;基于所述K个历史音频数据帧中各个相邻的历史音频数据帧之间的K-1帧间差异值,确定与所述目标音频数据帧相关联的频谱动态特征。
在本申请实施例中,为便于区分,可以将获取到的第一历史音频数据帧对应的N个历史倒频谱系数作为第一历史倒频谱系数,且可以将获取到的第二历史音频数据帧对应的N个历史倒频谱系数作为第二历史倒频谱系数。
进一步,可以将第一历史倒频谱系数与第二历史倒频谱系数之间的频带差异值,作为第一历史音频数据帧和第二历史音频数据帧之间的帧间差异值,具体过程可以为:在第一历史倒频谱系数所包含的N个历史倒频谱系数中,获取历史倒频谱系数Lp,同时,可以在第二历史倒频谱系数所包含的N个历史倒频谱系数中,获取历史倒频谱系数Lq,其中,p和q均为正整数,p=1,…,N,q=1,…,N,且p=q。进一步,可以获取历史倒频谱系数Lp与历史倒频谱系数Lq之间的系数差异值(例如,历史倒频谱系数Lp-历史倒频谱系数Lq),随后可以将N个系数差异值,确定为第一历史倒频谱系数与第二历史倒频谱系数之间的频带差异值。即,频带差异值包括N个系数差异值。随后可以将该频带差异值作为第一历史音频数据帧和第二历史音频数据帧之间的帧间差异值。
可选的,在一些实施例中,可以获取所有帧间差异值所包含的N个系数差异值的差异值总和,再对该差异值总和取平均(如该差异值总和/K),即可得到对应的频谱动态特征。
为便于理解,请一并参见图6,图6是本申请实施例提供的一种获取帧间差异值的场景示意图。如图6所示,假设当前有8个历史音频数据帧(即K=8),依次为历史音频数据帧1、历史音频数据帧2、…、历史音频数据帧7、历史音频数据帧8,每个历史音频数据帧对应有56个历史倒频谱系数(即N=56),例如,历史音频数据帧1对应有倒频谱系数A1~倒频谱系数A56,历史音频数据帧2对应有倒频谱系数B1~倒频谱系数B56,…,历史音频数据帧7对应有倒频谱系数C1~倒频谱系数C56,历史音频数据帧8对应有倒频谱系数D1~倒频谱系数D56。
当历史音频数据帧1作为第一历史音频数据帧,历史音频数据帧2作为第二历史音频数据帧时,第一历史倒频谱系数包括倒频谱系数A1~倒频谱系数A56,第二历史倒频谱系数包括倒频谱系数B1~倒频谱系数B56,此时可以获取倒频谱系数A1与倒频谱系数B1之间的系数差异值AB1(例如,倒频谱系数A1-倒频谱系数B1),获取倒频谱系数A2与倒频谱系数B2之间的系数差异值AB2,…,获取倒频谱系数A55与倒频谱系数B55之间的系数差异值AB55,获取倒频谱系数A56与倒频谱系数B56之间的系数差异值AB56,进而可以将56个系数差异值(即系数差异值AB1~系数差异值AB56)作为历史音频数据帧1与历史音频数据帧2之间的帧间差异值1。
依此类推,当历史音频数据帧7作为第一历史音频数据帧,历史音频数据帧8作为第二历史音频数据帧时,同理可以得到56个系数差异值(即系数差异值CD1~系数差异值CD56)作为历史音频数据帧7与历史音频数据帧8之间的帧间差异值7。
随后,可以基于上述8个历史音频数据帧之间的帧间差异值(即帧间差异值1~帧间差异值7),确定与当前音频数据帧相关联的频谱动态特征。例如,可以将每个帧间差异值所包含的56个系数差异值全部相加,得到对应的差异值总和,再对该8个差异值总和取平均即可得到对应的频谱动态特征,即频谱动态特征=(系数差异值AB1+…+系数差异值AB56+…+系数差异值CD1+…+系数差异值CD56)÷8。
需要说明的是,对于当前音频数据帧之前的历史音频数据帧的数量不满足指定K个(例如,8个)的情况,本申请实施例可以通过补零操作得到K个历史音频数据帧,其中,通过补零操作所得到的历 史音频数据帧为全零的频谱帧,相应的,这类全零的频谱帧对应的N个倒频谱系数也均可设置为零值。
步骤S104,将N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征输入至目标掩码估计模型,由目标掩码估计模型输出目标音频数据帧对应的目标掩码。
可以理解,基于机器学习/深度学习的智能语音增强技术,受计算听觉场景分析(Computational Auditory Scene Analysis,CASA)中的时频(Time Frequency,简称T-F)掩蔽概念的启发,对于有监督的语音增强,训练目标的选择对于学习和泛化都很重要。
其中,训练目标是在语音信号的T-F表示上定义的,比如从短时傅立叶变换计算出来的谱图。这些训练目标主要分为两类:一种是基于掩蔽的目标,如IRM,它描述了纯净语音和背景噪声之间的时频关系;另一种是基于映射的目标,如对数功率谱,它对应于干净语音的频谱表示。
在本申请一实施例中,采用前一种方式,利用神经网络的非线性拟合能力从输入特征中估计掩码,然后将掩码与带噪语音信号(即本申请实施例中的原始音频数据)的频谱相乘后,重建时域波形,以实现增强的目的。
具体的,为便于区分,用户终端可以将上述步骤获取到的N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征共(N+2M+1)个特征,均作为目标音频数据帧的目标音频特征,并可将该目标音频特征输入至目标掩码估计模型进行掩码估计。例如,在一些实施例中,目标倒频谱系数的数量为56个,一阶时间导数、二阶时间导数的数量均为6个,则输入目标掩码估计模型的目标音频特征的大小为56+6*2+1=69。
基于神经网络的目标掩码估计模型具有极强的非线性拟合能力,因此可以通过训练初始掩码估计模型学习如何从带噪音频特征中计算得到掩码,模型训练的具体过程可以参见后续图8所对应的实施例。
在一种实施方式中,目标掩码估计模型可以包括掩码估计网络层和掩码输出层,则首先可以将得到的目标音频特征输入至掩码估计网络层,通过该掩码估计网络层对输入的目标音频特征进行掩码估计,可以得到目标音频特征对应的隐藏特征。进一步,可以将该隐藏特征输入至掩码输出层,通过该掩码输出层对该隐藏特征进行特征合并,从而可以得到目标音频数据帧对应的目标掩码,其中,目标掩码的长度为N(与上述划分的声学频带的数量相同)。在本申请实施例中,目标掩码可用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。
在本申请实施例中,掩码也可称为增益或频带增益,可以包括但不限于理想比率掩码(IRM)、理想二值掩码(IBM)、最佳比率掩码(ORM)等,这里将不对目标掩码的类型进行限定。
其中,基于目标语音和背景噪声正交,即目标语音和背景噪声不相关的假设下,IRM直接刻画了时频单元内纯净语音能量和带噪语音能量的比值。IRM取值可在0到1之间,值越大代表时频单元内目标语音占的比重越高。
由于目标语音在时频域上是稀疏分布的,对于一个具体的时频单元,目标语音和背景噪声的能量差异通常比较大,因此大多数时频单元上的信噪比极高或极低,IBM是对这种现实情况的简化描述,将连续的时频单元信噪比离散化为两种状态1和0。在一个时频单元内,如果目标语音占主导(高信噪比),则被标记为1;反之,如果背景噪声占主导(低信噪比),则标记为0,最后将IBM和带噪语音相乘,实际上就是将低信噪比的时频单元置零,以此达到消除背景噪声的目的。因此,IBM可以认为是IRM的二值版本。其中,ORM的定义是通过最小化纯净语音和估计目标语音的均方误差来导出的,与IRM非常相似。
可选的,在一些实施例中,上述掩码估计网络层可以包括存在跳跃连接的第一掩码估计网络层、第二掩码估计网络层以及第三掩码估计网络层,其中,第一掩码估计网络层、第二掩码估计网络层以及第三掩码估计网络层之间的跳跃连接可以避免网络过拟合。
通过该掩码估计网络层对输入的目标音频特征进行掩码估计的具体过程可以为:
将目标音频特征输入至第一掩码估计网络层,通过第一掩码估计网络层输出第一中间特征;
进而可以根据第一掩码估计网络层与第二掩码估计网络层之间的跳跃连接,对第一中间特征和目标音频特征进行特征拼接,以得到第二中间特征,随后可以将得到的第二中间特征输入至第二掩码估计网络层,通过第二掩码估计网络层可以输出第三中间特征;
进一步,可以根据第一掩码估计网络层与第三掩码估计网络层之间的跳跃连接以及第二掩码估计网络层与第三掩码估计网络层之间的跳跃连接,对第三中间特征、目标音频特征以及第一中间特征进行特征拼接,以得到第四中间特征;
将第四中间特征输入至第三掩码估计网络层,通过第三掩码估计网络层可以输出目标音频特征对应的隐藏特征。
可选的,目标掩码估计模型还可以采用更多或更少的掩码估计网络层,本申请实施例对掩码估计网络层的具体数量不进行限定。
其中,掩码估计网络层中的第一掩码估计网络层、第二掩码估计网络层以及第三掩码估计网络层可 以采用门控循环单元(Gated Recurrent Units,GRU)或者长短期记忆网络(Long short-term memory,LSTM)等网络结构,掩码输出层则可以采用全连接层或其它网络结构,本申请实施例对掩码估计网络层和掩码输出层的具体结构不进行限定。
其中,GRU是循环神经网络中的一种门控机制,它与具有遗忘门的LSTM相类似,GRU包含更新门、重置门,相比于LSTM少了输出门,其参数比LSTM少,因此,若采用GRU来设计掩码估计网络层,则可以得到轻量级的掩码估计模型。
为便于理解,请一并参见图7,图7是本申请实施例提供的一种掩码估计模型的网络结构示意图。如图7所示,在获得相应的音频特征(例如大小为69的音频特征)后,可以将其输入掩码估计模型70(即目标掩码估计模型),该模型可以包括门控循环网络层1(即第一掩码估计网络层)、门控循环网络层2(即第二掩码估计网络层)、门控循环网络层3(即第三掩码估计网络层)以及全连接层(即掩码输出层),其采用三层简单的GRU神经网络对音频特征进行建模,最后一层全连接层用于输出增益(即掩码)。
可选的,在该实施例中,输入模型的每个音频数据帧对应的特征数量可以为69,经过的三层门控循环网络层的节点(也称为神经元或感知器)数可依次为64、96、96,相应的,门控循环网络层1输出的第一中间特征的特征维度为64,门控循环网络层2输出的第三中间特征的特征维度为96,门控循环网络层3输出的隐藏特征的特征维度为96,此外,全连接层的节点数可以为56,则最终输出的掩码的维度为56(即输出56个掩码值)。
其中,每一网络层均可以使用合适的激活函数,例如,门控循环网络层1可以使用ReLu函数(Rectified Linear Unit,线性整流函数,又称修正线性单元),门控循环网络层2可以使用ReLu函数,门控循环网络层3可以使用tanh函数(hyperbolic tangent function,双曲正切函数);若掩码采用IRM,则全连接层可以使用sigmoid函数,以保证输出掩码的取值范围为(0,1)。上述每一网络层还可以采用其它函数作为激活函数,本申请实施例对此不进行限定。
需要说明的是,本申请实施例所采用的目标掩码估计模型是一个轻量级神经网络,其中的三层掩码估计网络层可以实现很好的掩码估计效果,且网络参数量较少,网络复杂度低,从而可以减少计算时间和CPU消耗。
可以理解,在假设噪声和语音不相关时,带噪语音的能量必然大于纯净语音的能量,将频域划分为N个声学频带计算能量,对于每个声学频带,该频带所含的噪声越少,语音越纯净,频带增益越大。基于此,对于含噪语音,给每个声学频带乘以增益,其物理意义即为当该声学频带噪声较大时,可将其乘以一个较小的增益,反之则可乘以一个较大的增益,这样便可以增强语音、抑制噪声。
在本申请实施例中,经过上述步骤得到目标掩码(即频带增益)后,用户终端可以利用该目标掩码进行噪声抑制,具体过程可以为:
当目标掩码的长度(即N)小于目标音频数据帧的长度(即S1)时,需要对目标掩码进行插值处理,以得到对应的插值掩码,此时得到的插值掩码的长度与目标音频数据帧的长度(例如,257个频点)相同。
进一步,可以将该插值掩码与目标音频数据帧相乘,即将插值掩码中的每个掩码值对应作用于目标音频数据帧中通过傅立叶变换所得的每一个频点,进而可以对相乘结果进行傅立叶逆变换,从而得到对目标音频数据帧进行噪声抑制后的目标音频数据,即还原为增强后的时域语音信号。
可以理解,对其它音频数据进行噪声抑制的过程与对目标音频数据帧进行噪声抑制的过程类似,这里不再进行赘述。
最终,当对与原始音频数据相关联的每个音频数据帧均进行噪声抑制后,基于每个音频数据帧对应的目标音频数据,可以得到原始音频数据对应的增强音频数据,此时得到的增强音频数据中的噪声含量很低,且没有误消除业务对象的语音数据,因此该增强音频数据具有极高的语音保真度。
上述可知,本申请实施例在对原始音频数据进行语音增强时,可以综合考虑包括目标倒频谱系数、一阶时间导数、二阶时间导数以及频谱动态特征在内的多种音频特征,从而可以更准确地描述业务对象的语音数据与背景的噪声数据之间的时频关系,即可以得到准确度更高的目标掩码,而不会在对背景噪声进行抑制的同时对所需语音也进行抑制。因此,将输出的每一组掩码作用于相应的音频数据帧时,可以有效抑制音频数据中的噪声数据,且提升语音保真度,尤其在实时音视频通讯场景(例如,实时音视频会议场景)下可以为用户提供高质量和高清晰度的语音,提升用户的体验感。
此外,本申请实施例先利用数字信号处理技术对带噪音频数据提取相应的音频特征,再将提取到的音频特征输入轻量化的神经网络模型(即目标掩码估计模型)中,快速进行掩码估计,因此本申请实施例所需的网络复杂度更低,从而可以减少计算复杂度和CPU(Central Processing Unit,中央处理器)消耗,进而提高音频数据处理效率。
请参见图8,图8是本申请实施例提供的一种音频数据处理方法的流程示意图。其中,可以理解的 是,本申请实施例提供的方法可以由计算机设备执行,这里的计算机设备包括但不限于用户终端或业务服务器,例如图1中所示的用户终端200a、用户终端200b、用户终端200c、…、用户终端200n或业务服务器100。为便于理解,本申请实施例以该计算机设备为用户终端为例,以阐述在该用户终端中对初始掩码估计模型进行模型训练的具体过程。如图8所示,该方法至少可以包括下述步骤S201-步骤S205:
步骤S201,获取与样本音频数据相关联的目标样本音频数据帧和K个历史样本音频数据,且获取目标样本音频数据帧对应的样本掩码。
用户终端可以从具有海量音频数据的音频数据库中获取样本音频数据,这里的样本音频数据可以为带噪语音信号(例如携带Babble Noise和样本对象的语音数据的音频数据)。
随后,可以通过对样本音频数据进行分帧加窗预处理、时频变换等操作,获取与该样本音频数据相关联的目标样本音频数据帧和K个历史样本音频数据。其中,目标样本音频数据帧和K个历史样本音频数据帧均为频谱帧,且K个历史样本音频数据帧中的每个历史样本音频数据帧均为目标样本音频数据帧之前的频谱帧,K为正整数,其具体过程可以参见上述图3所对应实施例中的步骤S101,这里不再进行赘述。
此外,为了后续计算损失函数,用户终端还可以获取目标样本音频数据帧对应的样本掩码。
步骤S202,在获取到目标样本音频数据帧的N个目标样本倒频谱系数时,基于N个目标样本倒频谱系数,获取与目标样本音频数据帧相关联的M个样本一阶时间导数和M个样本二阶时间导数。
用户终端可以将目标样本音频数据帧包含的多个频点映射到划分的N个样本声学频带,并通过对每个样本声学频带进行倒谱处理,得到每个样本声学频带分别对应的目标样本倒频谱系数。
进而可以基于获取到的N个目标样本倒频谱系数,获取与目标样本音频数据帧相关联的M个样本一阶时间导数和M个样本二阶时间导数,其中,N为大于1的正整数,M为小于N的正整数。该步骤的具体实现方式可以参见上述图3所对应实施例中的步骤S102,这里不再进行赘述。
步骤S203,获取每个历史样本音频数据帧分别对应的N个历史样本倒频谱系数,基于获取到的K*N个历史样本倒频谱系数确定与目标样本音频数据帧相关联的样本频谱动态特征。
用户终端可以获取K个历史样本音频数据帧中任意两个相邻的历史样本音频数据帧分别对应的N个历史样本倒频谱系数,随后可以基于获取到的两组历史样本倒频谱系数,确定这两个相邻的历史样本音频数据帧之间的样本帧间差异值,最终可以基于K-1个样本帧间差异值,确定与目标样本音频数据帧相关联的样本频谱动态特征。该步骤的具体实现方式可以参见上述图3所对应实施例中的步骤S103,这里不再进行赘述。
步骤S204,将N个目标样本倒频谱系数、M个样本一阶时间导数、M个样本二阶时间导数以及样本频谱动态特征输入至初始掩码估计模型,由初始掩码估计模型输出目标样本音频数据帧对应的预测掩码。
用户终端可以将得到的N个目标样本倒频谱系数、M个样本一阶时间导数、M个样本二阶时间导数以及样本频谱动态特征作为目标样本音频数据帧的样本音频特征,随后可以将该样本音频特征输入至初始掩码估计模型,由初始掩码估计模型输出目标样本音频数据帧对应的预测掩码。
初始掩码估计模型的一种示例性的网络结构可以参见上述图7所对应的实施例。该步骤的具体实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
步骤S205,基于预测掩码和样本掩码,对初始掩码估计模型进行迭代训练,得到目标掩码估计模型,所述目标掩码估计模型用于输出与原始音频数据相关联的目标音频数据帧所对应的目标掩码。
用户终端可以基于预测掩码和样本掩码生成损失函数,进而可以基于该损失函数对初始掩码估计模型中的模型参数进行修正,通过多次的迭代训练,最终可以得到用于输出与原始音频数据相关联的目标音频数据帧所对应的目标掩码的目标掩码估计模型。其中,目标掩码可用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据。
在一种可选的实施方式中,模型训练使用的损失函数可以为Huber loss(是一个用于回归问题的带参损失函数),其公式描述如下:
其中,gtrue是指样本掩码,gpred是指预测掩码,loss中的超参数d可以设置为0.1。
此外,还可以使用其他形式的损失函数,本申请实施例对此不进行限定。
为便于理解,请一并参见图9,图9是本申请实施例提供的一种模型训练的流程示意图。如图9所示,可以设计一个神经网络模型对提取的音频特征进行建模以得到相应的掩码,用以对音频数据中的背景噪声进行抑制。
具体的,在获取到训练语音901(即样本音频数据)后,可以对其进行语音预处理902(即音频预处理),从而得到相关联的多个样本音频数据帧,进而可以依次对每个样本音频数据帧进行音频特征提取903,以得到每个样本音频数据帧分别对应的样本音频特征,随后可以将这些样本音频特征分别输入到初始掩码估计模型中去进行网络模型训练904,从而可得到中间掩码估计模型。
此时需要进一步对得到的中间掩码估计模型验证其泛化性能。类似的,可以对获取到的测试语音905(也可称为测试音频数据,可与样本音频数据一起获取得到)进行语音预处理906,以得到多个测试音频数据帧,进而可以依次对每个测试音频数据帧进行音频特征提取907,以得到每个测试音频数据帧分别对应的测试音频特征.
进一步,可以将这些测试音频特征分别输入到中间掩码估计模型中,通过中间掩码估计模型输出相应的掩码,并可以使用得到的掩码作用于对应的测试音频数据帧,得到抑制背景噪声后的频谱,随后可以对得到的频谱进行傅立叶逆变换并重建时域语音信号908,从而实现语音增强909的目的。
当得到的测试结果满足预期时,可以将该中间掩码估计模型作为后续可直接使用的目标掩码估计模型。
进一步地,请一并参见图10,图10是本申请实施例提供的一种降噪效果示意图。如图10所示,一段含有Babble Noise的语音所对应的频谱为图10中的频谱10A,采用本申请提供的方法和目标掩码估计模型对该带噪语音进行噪声抑制后,得到的增强语音对应的频谱为图10中的频谱10B。由前后两个频谱对比可知,本申请提供的方法可以有效抑制Babble Noise等背景噪声,同时保留了较为完整的语音。
上述可知,本申请实施例通过对初始掩码估计模型进行训练,可以得到用于输出音频数据帧所对应的掩码的目标掩码估计模型,由于该模型是一个轻量级的神经网络模型,因此可以降低语音增强过程中的计算复杂度,保证落地场景时的安装包的大小并减少CPU消耗。此外,训练好的目标掩码估计模型可以自动化地快速输出估计出的掩码,从而可以提升对音频数据进行语音增强处理的效率。
请参见图11,是本申请实施例提供的一种音频数据处理装置的结构示意图。该音频数据处理装置1可以是运行于计算机设备的一个计算机程序(包括程序代码),例如该音频数据处理装置1为一个应用软件;该装置可以用于执行本申请实施例提供的音频数据处理方法中的相应步骤。如图11所示,该音频数据处理装置1可以包括:第一获取模块11、第二获取模块12、第三获取模块13、掩码估计模块14、频带映射模块15、倒谱处理模块16、噪声抑制模块17;
第一获取模块11,用于获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧;目标音频数据帧和K个历史音频数据帧均为频谱帧,且K个历史音频数据帧中的每个历史音频数据帧均为目标音频数据帧之前的频谱帧,K为正整数。
其中,该第一获取模块11可以包括:音频预处理单元111、时频变换单元112、数据帧确定单元113;
音频预处理单元111,用于对原始音频数据进行分帧加窗预处理,得到H个音频数据段;H为大于1的正整数;
时频变换单元112,用于分别对每个音频数据段进行时频变换,得到每个音频数据段对应的音频数据帧;
在一种实施方式中,H个音频数据段包括音频数据段i,i为小于或等于H的正整数;
该时频变换单元112,具体用于对音频数据段i进行傅立叶变换,得到音频数据段i在频域中的直流分量频点和2S个频点;2S个频点包括与第一频点类型相关的S个频点和与第二频点类型相关的S个频点;S为正整数;基于与第一频点类型相关的S个频点和直流分量频点,确定音频数据段i对应的音频数据帧;
数据帧确定单元113,用于在得到的H个音频数据帧中,确定目标音频数据帧以及目标音频数据帧之前的K个历史音频数据帧;K小于H。
其中,音频预处理单元111、时频变换单元112、数据帧确定单元113的具体功能实现方式可以参见上述图3所对应实施例中的步骤S101,这里不再进行赘述。
第二获取模块12,用于在获取到目标音频数据帧的N个目标倒频谱系数时,基于N个目标倒频谱系数,获取与目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数;N为大于1的正整数,M为小于N的正整数;
其中,该第二获取模块12可以包括:第一差分单元121、第二差分单元122;
第一差分单元121,用于对N个目标倒频谱系数进行差分运算,得到(N-1)个差分运算值,将(N-1)个差分运算值中的每个差分运算值作为一个一阶时间导数,在(N-1)个一阶时间导数中获取与目标音频数据帧相关联的M个一阶时间导数;
第二差分单元122,用于对(N-1)个一阶时间导数进行二次差分运算,得到(N-2)个差分运算值,将(N-2) 个差分运算值中的每个差分运算值作为一个二阶时间导数,在(N-2)个二阶时间导数中获取与目标音频数据帧相关联的M个二阶时间导数。
其中,第一差分单元121、第二差分单元122的具体功能实现方式可以参见上述图3所对应实施例中的步骤S102,这里不再进行赘述。
第三获取模块13,用于获取每个历史音频数据帧对应的N个历史倒频谱系数,基于获取到的K*N个历史倒频谱系数确定与目标音频数据帧相关联的频谱动态特征;
其中,该第三获取模块13可以包括:数据帧获取单元131、系数获取单元132、差异确定单元133、特征确定单元134;
数据帧获取单元131,用于在K个历史音频数据帧中,获取任意两个相邻的历史音频数据帧作为第一历史音频数据帧和第二历史音频数据帧;第二历史音频数据帧为在第一历史音频数据帧之后得到的频谱帧;
系数获取单元132,用于在与目标音频数据帧相关的缓存中,获取第一历史音频数据帧对应的N个历史倒频谱系数以及第二历史音频数据帧对应的N个历史倒频谱系数;
差异确定单元133,用于将所述第一历史音频数据帧对应的N个历史倒频谱系数与所述第二历史音频数据帧对应的N个历史倒频谱系数之间的N个系数差异值,作为所述第一历史音频数据帧和所述第二历史音频数据帧之间的帧间差异值;
其中,该差异确定单元133可以包括:系数差异获取子单元1331、差异值确定子单元1332;
系数差异获取子单元1331,用于在第一历史倒频谱系数所包含的N个历史倒频谱系数中,获取历史倒频谱系数Lp,且在第二历史倒频谱系数所包含的N个历史倒频谱系数中,获取历史倒频谱系数Lq;p和q均为小于或等于N的正整数,且p=q;获取历史倒频谱系数Lp与历史倒频谱系数Lq之间的系数差异值;
差异值确定子单元1332,用于基于系数差异值确定第一历史倒频谱系数与第二历史倒频谱系数之间的频带差异值,将频带差异值作为第一历史音频数据帧和第二历史音频数据帧之间的帧间差异值。
其中,系数差异获取子单元1331、差异值确定子单元1332的具体功能实现方式可以参见上述图3所对应实施例中的步骤S103,这里不再进行赘述。
特征确定单元134,用于基于所述K个历史音频数据帧中各个相邻的历史音频数据帧之间的K-1帧间差异值,确定与目标音频数据帧相关联的频谱动态特征。
其中,数据帧获取单元131、系数获取单元132、差异确定单元133、特征确定单元134的具体功能实现方式可以参见上述图3所对应实施例中的步骤S103,这里不再进行赘述。
掩码估计模块14,用于将N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征输入至目标掩码估计模型,由目标掩码估计模型输出目标音频数据帧对应的目标掩码;目标掩码用于抑制原始音频数据中的噪声数据,以得到原始音频数据对应的增强音频数据;
在一种实施方式中,目标掩码估计模型包括掩码估计网络层和掩码输出层;该掩码估计模块14可以包括:掩码估计单元141、掩码输出单元142;
掩码估计单元141,用于将N个目标倒频谱系数、M个一阶时间导数、M个二阶时间导数以及频谱动态特征作为目标音频数据帧的目标音频特征,将目标音频特征输入至掩码估计网络层,通过掩码估计网络层对目标音频特征进行掩码估计,得到目标音频特征对应的隐藏特征;
在一种实施方式中,掩码估计网络层包括存在跳跃连接的第一掩码估计网络层、第二掩码估计网络层以及第三掩码估计网络层;该掩码估计单元141可以包括:第一估计子单元1411、第二估计子单元1412、第三估计子单元1413;
第一估计子单元1411,用于将目标音频特征输入至第一掩码估计网络层,通过第一掩码估计网络层输出第一中间特征;
第二估计子单元1412,用于根据第一掩码估计网络层与第二掩码估计网络层之间的跳跃连接,对第一中间特征和目标音频特征进行特征拼接,得到第二中间特征,将第二中间特征输入至第二掩码估计网络层,通过第二掩码估计网络层输出第三中间特征;
第三估计子单元1413,用于根据第一掩码估计网络层与第三掩码估计网络层之间的跳跃连接以及第二掩码估计网络层与第三掩码估计网络层之间的跳跃连接,对第三中间特征、目标音频特征以及第一中间特征进行特征拼接,得到第四中间特征,将第四中间特征输入至第三掩码估计网络层,通过第三掩码估计网络层输出目标音频特征对应的隐藏特征。
其中,第一估计子单元1411、第二估计子单元1412、第三估计子单元1413的具体功能实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
掩码输出单元142,用于将隐藏特征输入至掩码输出层,通过掩码输出层对隐藏特征进行特征合并, 得到目标音频数据帧对应的目标掩码。
其中,掩码估计单元141、掩码输出单元142的具体功能实现方式可以参见上述图3所对应实施例中的步骤S104,这里不再进行赘述。
在一种实施方式中,目标音频数据帧包含有S1个频点,S1个频点包括一个直流分量频点以及与一种频点类型相关的S2个频点,S1和S2均为正整数;
频带映射模块15,用于将S1个频点映射到N个声学频带上;S1大于或等于N;
倒谱处理模块16,用于分别对每个声学频带进行倒谱处理,得到每个声学频带对应的目标倒频谱系数;
在一种实施方式中,N个声学频带包括声学频带j,j为小于或等于N的正整数;
该倒谱处理模块16可以包括:能量获取单元161、余弦变换单元162;
能量获取单元161,用于获取声学频带j的频带能量,对声学频带j的频带能量进行对数变换,得到声学频带j的对数频带能量;
余弦变换单元162,用于对声学频带j的对数频带能量进行离散余弦变换,得到声学频带j对应的目标倒频谱系数。
其中,能量获取单元161、余弦变换单元162的具体功能实现方式可以参见上述图3所对应实施例中的步骤S102,这里不再进行赘述。
噪声抑制模块17,用于对目标掩码进行插值处理,得到插值掩码;插值掩码的长度与目标音频数据帧的长度相同;将插值掩码与目标音频数据帧相乘,对相乘结果进行傅立叶逆变换,得到对目标音频数据帧进行噪声抑制后的目标音频数据;当对与原始音频数据相关联的每个音频数据帧均进行噪声抑制后,基于所述每个音频数据帧对应的目标音频数据,得到原始音频数据对应的增强音频数据。
其中,第一获取模块11、第二获取模块12、第三获取模块13、掩码估计模块14、频带映射模块15、倒谱处理模块16、噪声抑制模块17的具体功能实现方式可以参见上述图3所对应实施例中的步骤S101-步骤S104,这里不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图12,是本申请实施例提供的一种音频数据处理装置的结构示意图。该音频数据处理装置2可以是运行于计算机设备的一个计算机程序(包括程序代码),例如该音频数据处理装置2为一个应用软件;该装置可以用于执行本申请实施例提供的音频数据处理方法中的相应步骤。如图12所示,该音频数据处理装置2可以包括:第一获取模块21、第二获取模块22、第三获取模块23、掩码预测模块24、模型训练模块25;
第一获取模块21,用于获取与样本音频数据相关联的目标样本音频数据帧和K个历史样本音频数据,且获取目标样本音频数据帧对应的样本掩码;目标样本音频数据帧和K个历史样本音频数据帧均为频谱帧,且K个历史样本音频数据帧中的每个历史样本音频数据帧均为目标样本音频数据帧之前的频谱帧,K为正整数;
第二获取模块22,用于在获取到目标样本音频数据帧的N个目标样本倒频谱系数时,基于N个目标样本倒频谱系数,获取与目标样本音频数据帧相关联的M个样本一阶时间导数和M个样本二阶时间导数;N为大于1的正整数,M为小于N的正整数;
第三获取模块23,用于获取每个历史样本音频数据帧分别对应的N个历史样本倒频谱系数,基于获取到的K*N个历史样本倒频谱系数确定与目标样本音频数据帧相关联的样本频谱动态特征;
掩码预测模块24,用于将N个目标样本倒频谱系数、M个样本一阶时间导数、M个样本二阶时间导数以及样本频谱动态特征输入至初始掩码估计模型,由初始掩码估计模型输出目标样本音频数据帧对应的预测掩码;
模型训练模块25,用于基于所述预测掩码和所述样本掩码,对所述初始掩码估计模型进行迭代训练,得到目标掩码估计模型,所述目标掩码估计模型用于输出与原始音频数据相关联的目标音频数据帧所对应的目标掩码;所述目标掩码用于抑制所述原始音频数据中的噪声数据,以得到所述原始音频数据对应的增强音频数据。
其中,第一获取模块21、第二获取模块22、第三获取模块23、掩码预测模块24、模型训练模块25的具体功能实现方式可以参见上述图8所对应实施例中的步骤S201-步骤S205,这里不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图13,是本申请实施例提供的一种计算机设备的结构示意图。如图13所示,该计算机设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,上述计算机设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接 口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图13所示,作为一种计算机可读存储介质的存储器1005中可以包括操作***、网络通信模块、用户接口模块以及设备控制应用程序。
在如图13所示的计算机设备1000中,网络接口1004可提供网络通讯功能;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以执行前文图3、图8任一个所对应实施例中对该音频数据处理方法的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且计算机可读存储介质中存储有前文提及的音频数据处理装置1和音频数据处理装置2所执行的计算机程序,且计算机程序包括程序指令,当处理器执行程序指令时,能够执行前文图3、图8任一个所对应实施例中对音频数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。
上述计算机可读存储介质可以是前述任一实施例提供的音频数据处理装置或者上述计算机设备的内部存储单元,例如计算机设备的硬盘或内存。该计算机可读存储介质也可以是该计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,该计算机可读存储介质还可以既包括该计算机设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该计算机设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。
此外,这里需要指出的是:本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行前文图3、图8任一个所对应实施例提供的方法。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机程序产品或者计算机程序实施例中未披露的技术细节,请参照本申请方法实施例的描述。
进一步的,请参见图14,图14是本申请实施例提供的一种音频数据处理***的结构示意图。该音频数据处理***3可以包含音频数据处理装置1a和音频数据处理装置2a。
其中,音频数据处理装置1a可以为上述图11所对应实施例中的音频数据处理装置1,可以理解的是,该音频数据处理装置1a可以集成在上述图2所对应实施例中的计算机设备20,因此,这里将不再进行赘述。
其中,音频数据处理装置2a可以为上述图12所对应实施例中的音频数据处理装置2,可以理解的是,该音频数据处理装置2a可以集成在上述图2对应实施例中的计算机设备20,因此,这里将不再进行赘述。
另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的音频数据处理***实施例中未披露的技术细节,请参照本申请方法实施例的描述。
本申请实施例的说明书和权利要求书及附图中的术语“第一”、“第二”等是用于区别不同对象,而非用于描述特定顺序。此外,术语“包括”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、装置、产品或设备没有限定于已列出的步骤或模块,而是可选地还包括没有列出的步骤或模块,或可选地还包括对于这些过程、方法、装置、产品或设备固有的其他步骤单元。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (20)

  1. 一种音频数据处理方法,由计算机设备执行,包括:
    获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧;所述目标音频数据帧和所述K个历史音频数据帧均为频谱帧,且所述K个历史音频数据帧中的每个历史音频数据帧均为所述目标音频数据帧之前的频谱帧,K为正整数;
    在获取到所述目标音频数据帧的N个目标倒频谱系数时,基于所述N个目标倒频谱系数,获取与所述目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数;N为大于1的正整数,M为小于N的正整数;
    获取所述每个历史音频数据帧对应的N个历史倒频谱系数,基于获取到的K*N个历史倒频谱系数,确定与所述目标音频数据帧相关联的频谱动态特征;及,
    将所述N个目标倒频谱系数、所述M个一阶时间导数、所述M个二阶时间导数以及所述频谱动态特征输入至目标掩码估计模型,由所述目标掩码估计模型输出所述目标音频数据帧对应的目标掩码;所述目标掩码用于抑制所述原始音频数据中的噪声数据,以得到所述原始音频数据对应的增强音频数据。
  2. 根据权利要求1所述的方法,其中,所述获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧,包括:
    对所述原始音频数据进行分帧加窗预处理,得到H个音频数据段;H为大于1的正整数;
    分别对每个音频数据段进行时频变换,得到所述每个音频数据段对应的音频数据帧;
    在得到的H个音频数据帧中,确定所述目标音频数据帧以及所述目标音频数据帧之前的K个历史音频数据帧;K小于H。
  3. 根据权利要求2所述的方法,其中,所述H个音频数据段包括音频数据段i,i为小于或等于H的正整数;
    所述分别对每个音频数据段进行时频变换,得到所述每个音频数据段对应的音频数据帧,包括:
    对所述音频数据段i进行傅立叶变换,得到所述音频数据段i在频域中的直流分量频点和2S个频点;所述2S个频点包括与第一频点类型相关的S个频点和与第二频点类型相关的S个频点;S为正整数;
    基于与所述第一频点类型相关的S个频点和所述直流分量频点,确定所述音频数据段i对应的音频数据帧。
  4. 根据权利要求1所述的方法,其中,所述目标音频数据帧包含有S1个频点,所述S1个频点包括一个直流分量频点以及与一种频点类型相关的S2个频点,S1和S2均为正整数;所述获取到所述目标音频数据帧的N个目标倒频谱系数,包括:
    将所述S1个频点映射到N个声学频带上;S1大于或等于N;
    分别对每个声学频带进行倒谱处理,得到所述每个声学频带对应的目标倒频谱系数。
  5. 根据权利要求4所述的方法,其中,所述N个声学频带包括声学频带j,j为小于或等于N的正整数;
    所述分别对每个声学频带进行倒谱处理,得到所述每个声学频带对应的目标倒频谱系数,包括:
    获取所述声学频带j的频带能量,对所述声学频带j的频带能量进行对数变换,得到所述声学频带j的对数频带能量;
    对所述声学频带j的对数频带能量进行离散余弦变换,得到所述声学频带j对应的目标倒频谱系数。
  6. 根据权利要求1所述的方法,其中,所述基于所述N个目标倒频谱系数,获取与所述目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数,包括:
    对所述N个目标倒频谱系数进行差分运算,得到(N-1)个差分运算值,将所述(N-1)个差分运算值中 的每个差分运算值作为一个一阶时间导数,在(N-1)个一阶时间导数中获取与所述目标音频数据帧相关联的M个一阶时间导数;
    对所述(N-1)个一阶时间导数进行二次差分运算,得到(N-2)个差分运算值,将所述(N-2)个差分运算值中的每个差分运算值作为一个二阶时间导数,在(N-2)个二阶时间导数中获取与所述目标音频数据帧相关联的M个二阶时间导数。
  7. 根据权利要求1所述的方法,其中,所述获取所述每个历史音频数据帧对应的N个历史倒频谱系数,包括:
    在所述K个历史音频数据帧中,获取任意两个相邻的历史音频数据帧作为第一历史音频数据帧和第二历史音频数据帧;所述第二历史音频数据帧为在所述第一历史音频数据帧之后得到的频谱帧;
    在与所述目标音频数据帧相关的缓存中,获取所述第一历史音频数据帧对应的N个历史倒频谱系数以及所述第二历史音频数据帧对应的N个历史倒频谱系数。
  8. 根据权利要求7所述的方法,其中,所述基于获取到的K*N个历史倒频谱系数,确定与所述目标音频数据帧相关联的频谱动态特征,包括:
    将所述第一历史音频数据帧对应的N个历史倒频谱系数与所述第二历史音频数据帧对应的N个历史倒频谱系数之间的N个系数差异值,作为所述第一历史音频数据帧和所述第二历史音频数据帧之间的帧间差异值;
    基于所述K个历史音频数据帧中各个相邻的历史音频数据帧之间的K-1帧间差异值,确定与所述目标音频数据帧相关联的频谱动态特征。
  9. 根据权利要求1所述的方法,其中,所述目标掩码估计模型包括掩码估计网络层和掩码输出层;
    所述将所述N个目标倒频谱系数、所述M个一阶时间导数、所述M个二阶时间导数以及所述频谱动态特征输入至目标掩码估计模型,由所述目标掩码估计模型输出所述目标音频数据帧对应的目标掩码,包括:
    将所述N个目标倒频谱系数、所述M个一阶时间导数、所述M个二阶时间导数以及所述频谱动态特征作为所述目标音频数据帧的目标音频特征,将所述目标音频特征输入至所述掩码估计网络层,通过所述掩码估计网络层对所述目标音频特征进行掩码估计,得到所述目标音频特征对应的隐藏特征;
    将所述隐藏特征输入至所述掩码输出层,通过所述掩码输出层对所述隐藏特征进行特征合并,得到所述目标音频数据帧对应的目标掩码。
  10. 根据权利要求9所述的方法,其中,所述掩码估计网络层包括存在跳跃连接的第一掩码估计网络层、第二掩码估计网络层以及第三掩码估计网络层;
    所述将所述目标音频特征输入至所述掩码估计网络层,通过所述掩码估计网络层对所述目标音频特征进行掩码估计,得到所述目标音频特征对应的隐藏特征,包括:
    将所述目标音频特征输入至所述第一掩码估计网络层,通过所述第一掩码估计网络层输出第一中间特征;
    根据所述第一掩码估计网络层与所述第二掩码估计网络层之间的跳跃连接,对所述第一中间特征和所述目标音频特征进行特征拼接,得到第二中间特征,将所述第二中间特征输入至所述第二掩码估计网络层,通过所述第二掩码估计网络层输出第三中间特征;
    根据所述第一掩码估计网络层与所述第三掩码估计网络层之间的跳跃连接以及所述第二掩码估计网络层与所述第三掩码估计网络层之间的跳跃连接,对所述第三中间特征、所述目标音频特征以及所述第一中间特征进行特征拼接,得到第四中间特征;
    将所述第四中间特征输入至所述第三掩码估计网络层,通过所述第三掩码估计网络层输出所述目标音频特征对应的隐藏特征。
  11. 根据权利要求1所述的方法,还包括:
    对所述目标掩码进行插值处理,得到插值掩码;所述插值掩码的长度与所述目标音频数据帧的长度相同;
    将所述插值掩码与所述目标音频数据帧相乘,对相乘结果进行傅立叶逆变换,得到对所述目标音频数据帧进行噪声抑制后的目标音频数据;
    当对与所述原始音频数据相关联的每个音频数据帧均进行噪声抑制后,基于所述每个音频数据帧对应的目标音频数据,得到所述原始音频数据对应的增强音频数据。
  12. 一种音频数据处理方法,由计算机设备执行,包括:
    获取与样本音频数据相关联的目标样本音频数据帧和K个历史样本音频数据,且获取所述目标样本音频数据帧对应的样本掩码;所述目标样本音频数据帧和所述K个历史样本音频数据帧均为频谱帧,且所述K个历史样本音频数据帧中的每个历史样本音频数据帧均为所述目标样本音频数据帧之前的频谱帧,K为正整数;
    在获取到所述目标样本音频数据帧的N个目标样本倒频谱系数时,基于所述N个目标样本倒频谱系数,获取与所述目标样本音频数据帧相关联的M个样本一阶时间导数和M个样本二阶时间导数;N为大于1的正整数,M为小于N的正整数;
    获取所述每个历史样本音频数据帧分别对应的N个历史样本倒频谱系数,基于获取到的K*N个历史样本倒频谱系数确定与所述目标样本音频数据帧相关联的样本频谱动态特征;
    将所述N个目标样本倒频谱系数、所述M个样本一阶时间导数、所述M个样本二阶时间导数以及所述样本频谱动态特征输入至初始掩码估计模型,由所述初始掩码估计模型输出所述目标样本音频数据帧对应的预测掩码;及,
    基于所述预测掩码和所述样本掩码,对所述初始掩码估计模型进行迭代训练,得到目标掩码估计模型,所述目标掩码估计模型用于输出与原始音频数据相关联的目标音频数据帧所对应的目标掩码;所述目标掩码用于抑制所述原始音频数据中的噪声数据,以得到所述原始音频数据对应的增强音频数据。
  13. 一种音频数据处理装置,其特征在于,包括:
    第一获取模块,用于获取与原始音频数据相关联的目标音频数据帧和K个历史音频数据帧;所述目标音频数据帧和所述K个历史音频数据帧均为频谱帧,且所述K个历史音频数据帧中的每个历史音频数据帧均为所述目标音频数据帧之前的频谱帧,K为正整数;
    第二获取模块,用于在获取到所述目标音频数据帧的N个目标倒频谱系数时,基于所述N个目标倒频谱系数,获取与所述目标音频数据帧相关联的M个一阶时间导数和M个二阶时间导数;N为大于1的正整数,M为小于N的正整数;
    第三获取模块,用于获取所述每个历史音频数据帧对应的N个历史倒频谱系数,基于获取到的K*N个历史倒频谱系数,确定与所述目标音频数据帧相关联的频谱动态特征;及,
    掩码估计模块,用于将所述N个目标倒频谱系数、所述M个一阶时间导数、所述M个二阶时间导数以及所述频谱动态特征输入至目标掩码估计模型,由所述目标掩码估计模型输出所述目标音频数据帧对应的目标掩码;所述目标掩码用于抑制所述原始音频数据中的噪声数据,以得到所述原始音频数据对应的增强音频数据。
  14. 根据权利要求13所述的装置,其中,所述目标音频数据帧包含有S1个频点,所述S1个频点包括一个直流分量频点以及与一种频点类型相关的S2个频点,S1和S2均为正整数;所述装置还包括:
    频带映射模块,用于将S1个频点映射到N个声学频带上;S1大于或等于N;
    倒谱处理模块,用于分别对每个声学频带进行倒谱处理,得到所述每个声学频带对应的目标倒频谱系数。
  15. 根据权利要求13所述的装置,其中,所述目标掩码估计模型包括掩码估计网络层和掩码输出层;
    所述掩码估计模块包括:
    掩码估计单元,用于将所述N个目标倒频谱系数、所述M个一阶时间导数、所述M个二阶时间导数以及所述频谱动态特征作为所述目标音频数据帧的目标音频特征,将所述目标音频特征输入至所述掩码估计网络层,通过所述掩码估计网络层对所述目标音频特征进行掩码估计,得到所述目标音频特征对应的隐藏特征;
    掩码输出单元,用于将所述隐藏特征输入至所述掩码输出层,通过所述掩码输出层对所述隐藏特征进行特征合并,得到所述目标音频数据帧对应的目标掩码。
  16. 根据权利要求13所述的装置,还包括:
    噪声抑制模块,用于对所述目标掩码进行插值处理,得到插值掩码;所述插值掩码的长度与所述目标音频数据帧的长度相同;将所述插值掩码与所述目标音频数据帧相乘,对相乘结果进行傅立叶逆变换,得到对所述目标音频数据帧进行噪声抑制后的目标音频数据;基于对与所述原始音频数据相关联的每个音频数据帧进行噪声抑制后的目标音频数据,得到所述原始音频数据对应的增强音频数据。
  17. 一种音频数据处理装置,其特征在于,包括:
    第一获取模块,用于获取与样本音频数据相关联的目标样本音频数据帧和K个历史样本音频数据,且获取所述目标样本音频数据帧对应的样本掩码;所述目标样本音频数据帧和所述K个历史样本音频数据帧均为频谱帧,且所述K个历史样本音频数据帧中的每个历史样本音频数据帧均为所述目标样本音频数据帧之前的频谱帧,K为正整数;
    第二获取模块,用于在获取到所述目标样本音频数据帧的N个目标样本倒频谱系数时,基于所述N个目标样本倒频谱系数,获取与所述目标样本音频数据帧相关联的M个样本一阶时间导数和M个样本二阶时间导数;N为大于1的正整数,M为小于N的正整数;
    第三获取模块,用于获取所述每个历史样本音频数据帧分别对应的N个历史样本倒频谱系数,基于获取到的K*N个历史样本倒频谱系数确定与所述目标样本音频数据帧相关联的样本频谱动态特征;
    掩码预测模块,用于将所述N个目标样本倒频谱系数、所述M个样本一阶时间导数、所述M个样本二阶时间导数以及所述样本频谱动态特征输入至初始掩码估计模型,由所述初始掩码估计模型输出所述目标样本音频数据帧对应的预测掩码;及,
    模型训练模块,用于基于所述预测掩码和所述样本掩码,对所述初始掩码估计模型进行迭代训练,得到目标掩码估计模型,所述目标掩码估计模型用于输出与原始音频数据相关联的目标音频数据帧所对应的目标掩码;所述目标掩码用于抑制所述原始音频数据中的噪声数据,以得到所述原始音频数据对应的增强音频数据。
  18. 一种计算机设备,其特征在于,包括:处理器和存储器;
    所述处理器与所述存储器相连,其中,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使所述计算机设备执行权利要求1-12任一项所述的方法。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,该计算机程序适于由处理器加载并执行,以使具有所述处理器的计算机设备执行权利要求1-12任一项所述的方法。
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中,该计算机指令适于由处理器读取并执行,以使具有所述处理器的计算机设备执行权利要求1-12任一项所述的方法。
PCT/CN2023/108796 2022-09-13 2023-07-24 音频数据处理方法、装置、设备、存储介质及程序产品 WO2024055751A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211110666.3 2022-09-13
CN202211110666.3A CN117746874A (zh) 2022-09-13 2022-09-13 一种音频数据处理方法、装置以及可读存储介质

Publications (1)

Publication Number Publication Date
WO2024055751A1 true WO2024055751A1 (zh) 2024-03-21

Family

ID=90274176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/108796 WO2024055751A1 (zh) 2022-09-13 2023-07-24 音频数据处理方法、装置、设备、存储介质及程序产品

Country Status (2)

Country Link
CN (1) CN117746874A (zh)
WO (1) WO2024055751A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1737906A (zh) * 2004-03-23 2006-02-22 哈曼贝克自动***-威美科公司 利用中枢网络分离语音信号
JP2006349840A (ja) * 2005-06-14 2006-12-28 Mitsubishi Electric Corp マスク作成装置、雑音スペクトル推定装置及び音声認識装置
US20190318755A1 (en) * 2018-04-13 2019-10-17 Microsoft Technology Licensing, Llc Systems, methods, and computer-readable media for improved real-time audio processing
CN113823313A (zh) * 2021-07-12 2021-12-21 腾讯科技(深圳)有限公司 语音处理方法、装置、设备以及存储介质
CN113870893A (zh) * 2021-09-27 2021-12-31 中国科学院声学研究所 一种多通道双说话人分离方法及***
CN113963715A (zh) * 2021-11-09 2022-01-21 清华大学 语音信号的分离方法、装置、电子设备及存储介质
CN114203163A (zh) * 2022-02-16 2022-03-18 荣耀终端有限公司 音频信号处理方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1737906A (zh) * 2004-03-23 2006-02-22 哈曼贝克自动***-威美科公司 利用中枢网络分离语音信号
JP2006349840A (ja) * 2005-06-14 2006-12-28 Mitsubishi Electric Corp マスク作成装置、雑音スペクトル推定装置及び音声認識装置
US20190318755A1 (en) * 2018-04-13 2019-10-17 Microsoft Technology Licensing, Llc Systems, methods, and computer-readable media for improved real-time audio processing
CN113823313A (zh) * 2021-07-12 2021-12-21 腾讯科技(深圳)有限公司 语音处理方法、装置、设备以及存储介质
CN113870893A (zh) * 2021-09-27 2021-12-31 中国科学院声学研究所 一种多通道双说话人分离方法及***
CN113963715A (zh) * 2021-11-09 2022-01-21 清华大学 语音信号的分离方法、装置、电子设备及存储介质
CN114203163A (zh) * 2022-02-16 2022-03-18 荣耀终端有限公司 音频信号处理方法及装置

Also Published As

Publication number Publication date
CN117746874A (zh) 2024-03-22

Similar Documents

Publication Publication Date Title
WO2021196905A1 (zh) 语音信号去混响处理方法、装置、计算机设备和存储介质
Xu et al. Listening to sounds of silence for speech denoising
CN112102846B (zh) 音频处理方法、装置、电子设备以及存储介质
Zhang et al. Sensing to hear: Speech enhancement for mobile devices using acoustic signals
CN114203163A (zh) 音频信号处理方法及装置
WO2022166710A1 (zh) 语音增强方法、装置、设备及存储介质
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
CN111883107A (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
WO2023216760A1 (zh) 语音处理方法、装置、存储介质、计算机设备及程序产品
CN111883135A (zh) 语音转写方法、装置和电子设备
JP2024507916A (ja) オーディオ信号の処理方法、装置、電子機器、及びコンピュータプログラム
CN114898762A (zh) 基于目标人的实时语音降噪方法、装置和电子设备
CN115602165A (zh) 基于金融***的数字员工智能***
US11996114B2 (en) End-to-end time-domain multitask learning for ML-based speech enhancement
CN114783459A (zh) 一种语音分离方法、装置、电子设备和存储介质
CN114333893A (zh) 一种语音处理方法、装置、电子设备和可读介质
WO2024027295A1 (zh) 语音增强模型的训练、增强方法、装置、电子设备、存储介质及程序产品
WO2024055751A1 (zh) 音频数据处理方法、装置、设备、存储介质及程序产品
CN116403594A (zh) 基于噪声更新因子的语音增强方法和装置
WO2022166738A1 (zh) 语音增强方法、装置、设备及存储介质
CN114333891A (zh) 一种语音处理方法、装置、电子设备和可读介质
CN112750456A (zh) 即时通信应用中的语音数据处理方法、装置及电子设备
Lee et al. Speech Enhancement Using Phase‐Dependent A Priori SNR Estimator in Log‐Mel Spectral Domain
WO2024082928A1 (zh) 语音处理方法、装置、设备和介质
Xiang et al. A two-stage deep representation learning-based speech enhancement method using variational autoencoder and adversarial training

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864485

Country of ref document: EP

Kind code of ref document: A1