WO2024041512A1

WO2024041512A1 - Audio noise reduction method and apparatus, and electronic device and readable storage medium

Info

Publication number: WO2024041512A1
Application number: PCT/CN2023/114193
Authority: WO
Inventors: 王少华
Original assignee: 维沃移动通信有限公司
Priority date: 2022-08-25
Filing date: 2023-08-22
Publication date: 2024-02-29
Also published as: CN115995234A

Abstract

An audio noise reduction method and apparatus, and an electronic device and a readable storage medium. The method comprises: calculating a target long-time signal-to-noise ratio and a target long-time stability index, which correspond to a target audio signal, wherein the target long-time stability index is used for indicating the stability level of noise in the target audio signal (101); according to the target long-time signal-to-noise ratio and the target long-time stability index, determining a target acoustic scene corresponding to the target audio signal (102); and performing noise reduction processing on the target audio signal on the basis of the target acoustic scene (103).

Description

音频降噪方法、装置、电子设备及可读存储介质Audio noise reduction method, device, electronic equipment and readable storage medium

相关申请的交叉引用Cross-references to related applications

本申请主张在2022年08月25日在中国提交申请号为202211028582.5的中国专利的优先权，其全部内容通过引用包含于此。This application claims priority to the Chinese patent with application number 202211028582.5 filed in China on August 25, 2022, the entire content of which is incorporated herein by reference.

技术领域Technical field

本申请属于音频处理技术领域，具体涉及一种音频降噪方法、装置、电子设备及可读存储介质。This application belongs to the field of audio processing technology, and specifically relates to an audio noise reduction method, device, electronic equipment and readable storage medium.

背景技术Background technique

声学场景分类在日常生活中有着较为广泛地应用。声学场景分类是指对音频中包含的声学内容进行分析，进而识别出该音频对应的声学场景的过程。Acoustic scene classification is widely used in daily life. Acoustic scene classification refers to the process of analyzing the acoustic content contained in audio and then identifying the acoustic scene corresponding to the audio.

相关技术中的声学场景分类主要通过以下两种方法实现：方法1，基于传统的声学场景分类方法，具体地，观察具体场景的信号特征，提取相应的特征进行声学场景分类。方法2，基于深度学习模型的场景分类方法，具体地，根据输入的语音信号提取语音特征，比如梅尔倒谱系数、对数幅度谱和相位谱等语音特征，并根据提取的语音特征选择合适的深度分类模型进行有监督的学习，然后再通过学习得到的深度分类模型对音频进行声学场景分类。Acoustic scene classification in related technologies is mainly achieved through the following two methods: Method 1 is based on the traditional acoustic scene classification method. Specifically, the signal characteristics of specific scenes are observed and corresponding features are extracted for acoustic scene classification. Method 2, scene classification method based on deep learning model. Specifically, speech features are extracted according to the input speech signal, such as mel cepstrum coefficient, logarithmic amplitude spectrum and phase spectrum, and appropriate speech features are selected based on the extracted speech features. A deep classification model is used for supervised learning, and then the audio is classified into acoustic scenes through the learned deep classification model.

然而，按照上述方法，基于传统的声学场景分类方法只挑选特殊声学场景，基于深度学习模型的场景分类方法过于复杂，难以结合实际降噪需求进行部署。如此，导致相关技术中的声学场景分类方法的通用性和实用性较差。However, according to the above method, the traditional acoustic scene classification method only selects special acoustic scenes, and the scene classification method based on the deep learning model is too complex and difficult to deploy based on actual noise reduction needs. As a result, the acoustic scene classification methods in related technologies are less versatile and practical.

发明内容Contents of the invention

本申请实施例的目的是提供一种音频降噪方法、装置、电子设备及可读存储介质，能够解决相关技术中的音频降噪方法的通用性和实用性较差的问题。The purpose of the embodiments of the present application is to provide an audio noise reduction method, device, electronic equipment and readable storage medium, which can solve the problem of poor versatility and practicality of audio noise reduction methods in related technologies.

第一方面，本申请实施例提供了一种音频降噪方法，该方法包括：计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标，所述目标长时平稳度指标用于指示目标音频信号中噪声的平稳程度；根据目标长时信噪比和目标长时平稳度指标，确定目标音频信号对应的目标声学场景；基于目标声学场景，对目标音频信号进行降噪处理。In a first aspect, embodiments of the present application provide an audio noise reduction method. The method includes: calculating a target long-term signal-to-noise ratio and a target long-term stability index corresponding to the target audio signal. The target long-term stability index is expressed in It is used to indicate the stability of the noise in the target audio signal; based on the target long-term signal-to-noise ratio and the target long-term stability index, determine the target acoustic scene corresponding to the target audio signal; based on the target acoustic scene, perform noise reduction processing on the target audio signal.

第二方面，本申请实施例提供了一种音频降噪装置，该装置包括：处理模块和确定模块。处理模块，用于计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标，目标长时平稳度指标用于指示目标音频信号中噪声的平稳程度；确定模块，用于根据处理模块计算的目标长时信噪比和目标长时平稳度指标，确定目标音频信号对应的目标声学场景；处理模块，还用于基于确定模块确定目标声学场景，对目标音频信号进行降噪处理。In a second aspect, embodiments of the present application provide an audio noise reduction device, which includes: a processing module and a determination module. The processing module is used to calculate the target long-term signal-to-noise ratio and the target long-term stability index corresponding to the target audio signal. The target long-term stability index is used to indicate the smoothness of the noise in the target audio signal; the determination module is used to process according to The target long-term signal-to-noise ratio and target long-term stability index calculated by the module determine the target acoustic scene corresponding to the target audio signal; the processing module is also used to determine the target acoustic scene based on the determination module and perform noise reduction processing on the target audio signal.

第三方面，本申请实施例提供了一种电子设备，包括处理器和存储器，所述存储器存储可在所述处理器上运行的程序或指令，所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤。In a third aspect, embodiments of the present application provide an electronic device, including a processor and a memory. The memory stores programs or instructions that can be run on the processor. When the program or instructions are executed by the processor, Implement the steps of the method described in the first aspect.

第四方面，本申请实施例提供了一种可读存储介质，所述可读存储介质上存储程序或指令，所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。In a fourth aspect, embodiments of the present application provide a readable storage medium. Programs or instructions are stored on the readable storage medium. When the programs or instructions are executed by a processor, the steps of the method described in the first aspect are implemented. .

第五方面，本申请实施例提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现如第一方面所述的方法。 In a fifth aspect, embodiments of the present application provide a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the first aspect. the method described.

第六方面，本申请实施例提供一种计算机程序产品，所述程序产品被存储在存储介质中，所述程序产品被至少一个处理器执行以实现如第一方面所述的方法。In a sixth aspect, embodiments of the present application provide a computer program product, the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the method as described in the first aspect.

在本申请实施例中，计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标，所述目标长时平稳度指标用于指示目标音频信号中噪声的平稳程度；根据目标长时信噪比和目标长时平稳度指标，确定目标音频信号对应的目标声学场景；基于目标声学场景，对目标音频信号进行降噪处理。通过该方案，由于音频信号对应的长时信噪比和平稳度指标为音频信号中噪声的两个本质特征，因此基于目标音频信号对应的目标长时信噪比和目标长时平稳度指标，能够更加准确、快速地确定出目标音频信号对应的目标声学场景，从而可以提高基于目标声学场景对目标音频降噪的准确度，该降噪方法的通用性和实用性更好。In this embodiment of the present application, the target long-term signal-to-noise ratio and the target long-term stability index corresponding to the target audio signal are calculated. The target long-term stability index is used to indicate the smoothness of the noise in the target audio signal; according to the target long-term stability index Based on the time-to-time signal-to-noise ratio and target long-term stability index, the target acoustic scene corresponding to the target audio signal is determined; based on the target acoustic scene, the target audio signal is denoised. Through this solution, since the long-term signal-to-noise ratio and the stationarity index corresponding to the audio signal are two essential characteristics of noise in the audio signal, based on the target long-term signal-to-noise ratio and the target long-term stationarity index corresponding to the target audio signal, The target acoustic scene corresponding to the target audio signal can be determined more accurately and quickly, thereby improving the accuracy of target audio noise reduction based on the target acoustic scene. This noise reduction method is more versatile and practical.

附图说明Description of drawings

图1是本申请实施例提供的音频降噪方法的流程示意图之一；Figure 1 is one of the flow diagrams of the audio noise reduction method provided by the embodiment of the present application;

图2是本申请实施例提供的音频降噪方法的流程示意图之二；Figure 2 is the second schematic flow chart of the audio noise reduction method provided by the embodiment of the present application;

图3是本申请实施例提供的音频降噪装置的结构示意图；Figure 3 is a schematic structural diagram of an audio noise reduction device provided by an embodiment of the present application;

图4是本申请实施例提供的电子设备的结构示意图之一；Figure 4 is one of the structural schematic diagrams of the electronic device provided by the embodiment of the present application;

图5是本申请实施例提供的电子设备的结构示意图之二。FIG. 5 is a second structural schematic diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art fall within the scope of protection of this application.

本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象，而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施，且“第一”、“第二”等所区分的对象通常为一类，并不限定对象的个数，例如第一对象可以是一个，也可以是多个。此外，说明书以及权利要求中“和/或”表示所连接对象的至少其中之一，字符“/”，一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the description and claims of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that the figures so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in orders other than those illustrated or described herein, and that "first," "second," etc. are distinguished Objects are usually of one type, and the number of objects is not limited. For example, the first object can be one or multiple. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the related objects are in an "or" relationship.

对电子设备用户来说，通话的音质是衡量电子设备性能好坏的一个十分重要的指标。为了提高音质，可以对通话中的语音信号进行降噪处理。For electronic device users, the sound quality of calls is a very important indicator of the performance of electronic devices. In order to improve the sound quality, the voice signal during the call can be denoised.

目前，可以通过对语音信号所属的声学场景进行分类，以基于语音信号所属的声学场景，对语音信号进行针对性降噪处理。Currently, the acoustic scene to which the speech signal belongs can be classified to perform targeted noise reduction processing on the speech signal based on the acoustic scene to which the speech signal belongs.

相关技术中的声学场景分类主要通过基于传统的音频降噪方法和基于深度学习模型的场景分类方法。Acoustic scene classification in related technologies is mainly based on traditional audio noise reduction methods and scene classification methods based on deep learning models.

具体地，基于传统的音频降噪方法可以通过观察具体场景的信号特征，以提取相应的特征进行声学场景分类，并在分类后进行相应的噪声抑制。比如，针对风噪场景的检测，针对键盘声的场景检测，针对手机马达振动的场景检测等。可见，传统的声学场景分类方法只能对特殊声学场景进行分类。Specifically, traditional audio noise reduction methods can observe the signal characteristics of specific scenes to extract corresponding features for acoustic scene classification, and perform corresponding noise suppression after classification. For example, detection of wind noise scenes, scene detection of keyboard sounds, scene detection of mobile phone motor vibration, etc. It can be seen that traditional acoustic scene classification methods can only classify special acoustic scenes.

基于深度学习模型的场景分类方法一般框架分为两步。第一步是根据输入的语音信号提取语音特征，比如梅尔倒谱系数，对数幅度谱，相位谱等语音特征；第二步是根据提取的这些语音特征选择合适的深度分类模型进行有监督的学习，然后再通过学习得到的深度分类模型对音频进行声学场景分类。The general framework of scene classification methods based on deep learning models is divided into two steps. The first step is to extract speech features based on the input speech signal, such as Mel cepstral coefficients, logarithmic amplitude spectrum, phase spectrum and other speech features; the second step is to select an appropriate deep classification model based on the extracted speech features for supervised processing. learning, and then classify the audio into acoustic scenes through the learned deep classification model.

然而，基于深度学习的场景分类方法具有以下缺陷：(1)网络尺寸一般比较大，对于一些对功耗要求高的场景上难以实现实时部署；(2)对噪声场景的标签进行逐一标记也是比较大的工作量；(3)对场景分类过于细化，不利于实际使用，比如把场景分为地铁、公交、咖啡厅、食堂、汽车、飞机场等过于多样的场景。However, the scene classification method based on deep learning has the following shortcomings: (1) The network size is generally relatively large, and it is difficult to implement real-time deployment in some scenarios with high power consumption requirements; (2) It is also difficult to label the labels of noisy scenes one by one. Large workload; (3) The classification of scenes is too detailed, which is not conducive to actual use. For example, classifying scenes Divided into subways, buses, cafes, canteens, cars, airports, etc., there are too many scenes.

基于上述论述可知，基于传统的声学场景分类方法只挑选特殊场景，基于深度学习的声学场景方法的又过于复杂，难以结合降噪实际进行部署。如此，导致相关技术中的声学场景分类方法的通用性和实用性较差。Based on the above discussion, it can be seen that the traditional acoustic scene classification method only selects special scenes, and the acoustic scene method based on deep learning is too complex and difficult to deploy in conjunction with actual noise reduction. As a result, the acoustic scene classification methods in related technologies are less versatile and practical.

本申请实施例提供的音频降噪方法旨在给出一种基于主流降噪算法框架的音频降噪方法，基于音频信号对应的长时信噪比和平稳度指标，确定音频信号所属的声学场景，其通用性和实用性更好。The audio noise reduction method provided by the embodiment of this application aims to provide an audio noise reduction method based on the mainstream noise reduction algorithm framework, and determine the acoustic scene to which the audio signal belongs based on the long-term signal-to-noise ratio and smoothness index corresponding to the audio signal. , its versatility and practicality are better.

不同于常见的声学场景分类类型，比如把场景分为地铁、公交、咖啡厅、食堂、汽车、飞机场等过于多样的多分类类型，或者简单的挑选出风噪等特殊场景的二分类。本申请实施例提出从声学场景的两个本质特征，即长时信噪比和噪声平稳度度量(即噪声的平稳度)出发，提出可以对声学场景进行长信噪比和噪声的平稳程度组合的分类。将声学场景分为：第一声学场景，第二声学场景，第三声学场景，第四声学场景。It is different from common acoustic scene classification types, such as dividing scenes into overly diverse multi-classification types such as subways, buses, cafes, canteens, cars, airports, etc., or simply selecting two categories for special scenes such as wind noise. The embodiments of this application propose that based on the two essential characteristics of the acoustic scene, namely, the long-term signal-to-noise ratio and the noise stationarity measurement (i.e., the stationarity of the noise), it is proposed that the acoustic scene can be combined with the long-term signal-to-noise ratio and the stationarity of the noise. Classification. The acoustic scene is divided into: first acoustic scene, second acoustic scene, third acoustic scene, and fourth acoustic scene.

其中，在第一声学场景中：音频信号的长时信噪比大于或等于信噪比门限，且音频信号的长时平稳度指标大于或等于平稳度指标门限；在第二声学场景中：音频信号的长时信噪比大于或等于信噪比门限，且音频信号的长时平稳度指标小于平稳度指标门限；在第三声学场景中：音频信号的长时信噪比小于信噪比门限，且音频信号的长时平稳度指标大于或等于平稳度指标门限；在第四声学场景中：音频信号的长时信噪比小于信噪比门限，且音频信号的长时平稳度指标小于平稳度指标门限。如此，由于可以基于声学场景的两个本质特征，将声学场景分为4类，因此本申请实施例提供的音频降噪方法具有较大的实用性和通用性。Among them, in the first acoustic scene: the long-term signal-to-noise ratio of the audio signal is greater than or equal to the signal-to-noise ratio threshold, and the long-term stability index of the audio signal is greater than or equal to the stability index threshold; in the second acoustic scene: The long-term signal-to-noise ratio of the audio signal is greater than or equal to the signal-to-noise ratio threshold, and the long-term stability index of the audio signal is less than the stability index threshold; in the third acoustic scenario: the long-term signal-to-noise ratio of the audio signal is less than the signal-to-noise ratio threshold, and the long-term stability index of the audio signal is greater than or equal to the stability index threshold; in the fourth acoustic scenario: the long-term signal-to-noise ratio of the audio signal is less than the signal-to-noise ratio threshold, and the long-term stability index of the audio signal is less than Stability index threshold. In this way, since the acoustic scenes can be divided into four categories based on the two essential characteristics of the acoustic scenes, the audio noise reduction method provided by the embodiments of the present application has greater practicability and versatility.

可选地，在确定音频信号的声学场景后，即基于该声学场景对应的噪声抑制策略，对该音频信号进行噪声抑制。如此可以实现在不同声学场景下，用户对降噪的不同需求。比如在嘈杂的环境下，用户希望能够对噪声进行更多的抑制。在高信噪比的场景下，用户希望更高保留原始的语音音质，不希望进行太多的降噪处理。在非平稳场景下，用户希望能够对突发的噪声进行有效的抑制。在平稳的场景下，用户希望噪声的抑制更加自然。Optionally, after the acoustic scene of the audio signal is determined, noise suppression is performed on the audio signal based on the noise suppression strategy corresponding to the acoustic scene. In this way, users' different needs for noise reduction can be realized in different acoustic scenarios. For example, in a noisy environment, users hope to suppress noise more. In scenarios with a high signal-to-noise ratio, users want to retain the original voice quality and do not want to perform too much noise reduction processing. In non-stationary scenarios, users hope to effectively suppress sudden noise. In stable scenes, users hope that noise suppression will be more natural.

下面结合附图，通过具体的实施例及其应用场景对本申请实施例提供的音频降噪方法、装置、电子设备及可读存储介质进行详细地说明。The audio noise reduction method, device, electronic device and readable storage medium provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios.

本申请实施例提供一种音频降噪方法，图1示出了本申请实施例提供的音频降噪方法的一种可能的流程示意图，如图1所示，本申请实施例提供的音频降噪方法可以包括下述的步骤101至步骤103。下面以电子设备执行该方法为例进行说明。The embodiment of the present application provides an audio noise reduction method. Figure 1 shows a possible flow diagram of the audio noise reduction method provided by the embodiment of the present application. As shown in Figure 1, the audio noise reduction method provided by the embodiment of the present application The method may include the following steps 101 to 103. The following uses an electronic device to implement this method as an example for description.

步骤101、电子设备计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标。Step 101: The electronic device calculates the target long-term signal-to-noise ratio and the target long-term stationarity index corresponding to the target audio signal.

其中，目标长时平稳度指标可以用于指示目标音频信号中噪声的平稳程度。目标长时信噪比可以表征目标音频信号的相对信噪比，换句话说，目标长时信噪比表征目标音频信号相对于一段时间内音频信号的相对噪声水平。Among them, the target long-term stationarity index can be used to indicate the stationarity of the noise in the target audio signal. The target long-term signal-to-noise ratio can represent the relative signal-to-noise ratio of the target audio signal. In other words, the target long-term signal-to-noise ratio represents the relative noise level of the target audio signal relative to the audio signal within a period of time.

可选地，目标音频信号可以为电子设备采集的一帧或多帧音频信号，也可以为一个音频文件中的一帧或多帧音频信号，具体可以根据实际使用需求确定。Optionally, the target audio signal can be one or more frames of audio signals collected by the electronic device, or one or more frames of audio signals in an audio file, which can be determined according to actual usage requirements.

示例性地，以目标音频信号为电子设备采集的音频信号为例，在通话过程中，电子设备可以对采集的音频信号进行分帧处理。例如，对实时处理情况来说，电子设备的麦克风采集到的语音信号会实时送到电子设备的数字处理芯片中，比如一次向数字处理芯片送10ms长度的音频信号。由于语音信号是短时平稳(如30ms以内近似认为是平稳的)、长时间不平稳的信号，因此可以将一定时长内的语音信号作为一帧音频信号；例如每30ms的语音信号作为一帧音频信号，即一个处理帧。具体来说，数字处理芯片每次读10ms的音频信号，把读入的10ms音频信号合并之前缓存的音频信号，凑齐30ms左右的音频信号后，对该音频信号(即一帧音频信号)进行一次分析和处理。For example, taking the target audio signal as an audio signal collected by an electronic device, during a call, the electronic device can perform frame processing on the collected audio signal. For example, for real-time processing, the voice signal collected by the microphone of the electronic device will be sent to the digital processing chip of the electronic device in real time. For example, a 10ms long audio signal is sent to the digital processing chip at a time. Since the speech signal is a signal that is stable in the short term (for example, within 30ms is approximately considered stable) and unstable in the long term, the speech signal within a certain period of time can be used as one frame of audio signal; for example, every 30ms of the speech signal is used as one frame of audio. signal, that is, a processing frame. Specifically, the digital processing chip reads the audio signal 10ms each time, and combines the read 10ms audio signal with the previously buffered audio signal. After collecting the audio signal of about 30ms, the audio signal (ie, one frame of audio signal) is analyzed and processed once.

为了便于描述，除特别说明外，在下述实施例中均以目标音频信号为电子设备采集的音频信号为例进行示意。For the convenience of description, unless otherwise specified, in the following embodiments, the target audio signal is an audio signal collected by an electronic device as an example.

可选地，“电子设备计算目标音频信号对应的目标长时信噪比”具体可以通过下述的步骤A和步骤B实现。Optionally, "the electronic device calculates the target long-term signal-to-noise ratio corresponding to the target audio signal" can be specifically implemented through the following steps A and B.

步骤A、电子设备基于N组历史音频信号的瞬时信噪比确定N个第一瞬时信噪比。Step A: The electronic device determines N first instantaneous signal-to-noise ratios based on the instantaneous signal-to-noise ratios of N groups of historical audio signals.

其中，上述N组历史音频信号中的每组历史音频信号中可以包括M个历史音频信号，N个第一瞬时信噪比与N组历史音频信号一一对应；M和N均为正整数。Among them, each group of historical audio signals in the above-mentioned N groups of historical audio signals may include M historical audio signals, and the N first instantaneous signal-to-noise ratios correspond to the N groups of historical audio signals one-to-one; M and N are both positive integers. .

例如，M可以为大于1的整数，N可以为5～10中的任意一个整数。For example, M can be an integer greater than 1, and N can be any integer from 5 to 10.

本申请实施例中，电子设备可以将采集的相邻的M帧音频信号划分为一组，即第1帧音频信号～第M帧音频信号为第1组；第M+1帧音频信号～第2M帧音频信号为第2组，以此类推。In the embodiment of the present application, the electronic device can divide the collected M-frame audio signals into one group, that is, the audio signal from the 1st frame to the M-th frame is the first group; the audio signal from the M+1 frame to the M-th frame is the first group; The 2M frame audio signal is group 2, and so on.

可选地，上述N组历史音频信号可以为最近采集的N组历史音频信号。假设电子设备在采集目标音频信号前，已采集了Q帧音频信号，Q＝W1*M+W2，W1为大于或等于N的整数，W2为小于M的正整数，即电子设备已采集了W1组历史音频信号，那么：“上述N组历史音频信号”为该W1组历史音频信号中最近采集的N组历史音频信号。Optionally, the above-mentioned N groups of historical audio signals may be N groups of recently collected historical audio signals. Assume that the electronic device has collected Q frame audio signals before collecting the target audio signal, Q=W1*M+W2, W1 is an integer greater than or equal to N, and W2 is a positive integer less than M, that is, the electronic device has collected W1 A group of historical audio signals, then: "the above N groups of historical audio signals" are the most recently collected N groups of historical audio signals in the W1 group of historical audio signals.

例如，假设N为10，又假设采集目标音频信号前，电子设备已采集20(W1＝20)组历史音频信号，且按照时间先后顺序依次为：组1～组20，那么上述N组历史音频信号包括组11～组20。For example, assuming that N is 10, and assuming that before collecting the target audio signal, the electronic device has collected 20 (W1=20) groups of historical audio signals, and the order of time is: group 1 to group 20, then the above N groups of historical audio signals The signal includes group 11 to group 20.

本申请实施例中，“N组历史音频信号的瞬时信噪比”可以包括：该N组历史音频信号的每组历史音频信号中的M帧历史音频信号的瞬时信噪比，即包括M*N帧历史音频信号的瞬时信噪比，共M*N个瞬时信噪比。In the embodiment of the present application, the "instantaneous signal-to-noise ratio of N groups of historical audio signals" may include: the instantaneous signal-to-noise ratio of M frames of historical audio signals in each group of N groups of historical audio signals, that is, including M* The instantaneous signal-to-noise ratio of N frames of historical audio signals, a total of M*N instantaneous signal-to-noise ratios.

可选地，一种可能的实现方式中，对于上述N组历史音频信号中的每组历史音频信号，电子设备可以通过下述的步骤1，确定与每组历史音频信号对应的第一瞬时信噪比。即电子设备执行N次步骤1之后，可以得到N个第一瞬时信噪比。Optionally, in a possible implementation, for each group of historical audio signals in the above-mentioned N groups of historical audio signals, the electronic device can determine the first instant corresponding to each group of historical audio signals through step 1 below. signal-to-noise ratio. That is, after the electronic device performs step 1 N times, N first instantaneous signal-to-noise ratios can be obtained.

步骤1，电子设备确定每组历史音频信号的瞬时信噪比中的最大瞬时信噪比，并将该最大瞬时信噪比确定为与每组历史音频信号对应的第一瞬时信噪比。Step 1: The electronic device determines the maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of each group of historical audio signals, and determines the maximum instantaneous signal-to-noise ratio as the first instantaneous signal-to-noise ratio corresponding to each group of historical audio signals.

本申请实施例中，电子设备每采集M帧音频信号即可确定出一个第一瞬时信噪比。具体而言，假设电子设备最近一次采集的一组历史音频信号为：第T组历史音频信号，电子设备可以通过一种方式或另一种方式，确定第T组历史音频信号对应的第一瞬时信噪比snr_M(T)，T为正整数。In the embodiment of the present application, the electronic device can determine a first instantaneous signal-to-noise ratio every time it collects M frames of audio signals. Specifically, assuming that the latest set of historical audio signals collected by the electronic device is: the T-th set of historical audio signals, the electronic device can determine the first moment corresponding to the T-th set of historical audio signals in one way or another. time signal-to-noise ratio snr _M (T), T is a positive integer.

在一种方式中，第T组历史音频信号对应的第一瞬时信噪比snr_M(T)可以通过下述的公式(1)表示：
In one way, the first instantaneous signal-to-noise ratio snr _M (T) corresponding to the T-th group of historical audio signals can be expressed by the following formula (1):

其中，在公式(1)中，f为M的倍数，且为第T组历史音频信号中的第f帧历史音频信号的瞬时信噪比(具体为经过平滑处理的瞬时信噪比)。Among them, in formula (1), f is a multiple of M, and is the instantaneous signal-to-noise ratio of the f-th frame historical audio signal in the T-th group of historical audio signals (specifically, the smoothed instantaneous signal-to-noise ratio).

由公式1可以看出，在一种方式中，在确定出每组历史音频信号的瞬时信噪比中的最大瞬时信噪比前，电子设备需求保存每组历史音频信号中的每帧音频信号的瞬时信噪比。It can be seen from Formula 1 that in one method, before determining the maximum instantaneous signal-to-noise ratio in the instantaneous signal-to-noise ratio of each group of historical audio signals, the electronic device needs to save each frame of the audio signal in each group of historical audio signals. instantaneous signal-to-noise ratio.

在另一种方式中，电子设备可以在采集第T组历史音频信号的过程中，将第T组历史音频信号中的第j帧音频信号的瞬时信噪比与第j-1帧音频信号的瞬时信噪比进行比较；并删除较小的瞬时信噪比，保留较大的瞬时信噪比；然后将该较大的瞬时信噪比与第T组历史音频信号中的第j+1帧音频信号的瞬时信噪比进行比较；以此类推，直至将最近一次保留的瞬时信噪比与第T组历史音频信号中的第M帧音频信号的瞬时信噪比参与比较后，该两者中较大的瞬时信噪比作为：第T组历史音频信号的瞬时信噪比中的最大瞬时信噪比。In another way, in the process of collecting the T group of historical audio signals, the electronic device can compare the instantaneous signal-to-noise ratio of the j-th frame audio signal in the T-th group of historical audio signals with the j-1 th frame audio signal. Compare the instantaneous signal-to-noise ratio; delete the smaller instantaneous signal-to-noise ratio and retain the larger instantaneous signal-to-noise ratio; then compare the larger instantaneous signal-to-noise ratio with the j+1th frame in the T-th group of historical audio signals The instantaneous signal-to-noise ratio of the audio signal is compared; and so on, Until the latest retained instantaneous signal-to-noise ratio is compared with the instantaneous signal-to-noise ratio of the M-th frame audio signal in the T-th group of historical audio signals, the larger instantaneous signal-to-noise ratio of the two is regarded as: T-th group of historical audio signals. The maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of audio signals.

如此，在另一种方式中，由于可以对将同一组历史音频信号中相邻两帧音频信号的瞬时信噪比进行比较，并保留较大的瞬时信噪比，从而可以节省瞬时信噪比的缓存数量。In this way, in another way, since the instantaneous signal-to-noise ratio of two adjacent frames of audio signals in the same set of historical audio signals can be compared and a larger instantaneous signal-to-noise ratio is retained, the instantaneous signal-to-noise ratio can be saved The number of caches.

可以理解，电子设备每采集M帧音频信号更新一次snr_M(T)。即每采集M帧音频信号得到一个第一瞬时噪声比。It can be understood that the electronic device updates snr _M (T) every time it collects M frames of audio signals. That is, a first instantaneous noise ratio is obtained every time M frames of audio signals are collected.

可选地，假设N个第一瞬时信噪比构成一个N维数组snr_matrix，则上述N个第一瞬时信噪比可以通过下述的公式(2)表示：
snr_matrix＝[snr_M(T-N+1)，snr_M(T-N+2)，...，snr_M(T)] (2)；Alternatively, assuming that the N first instantaneous signal-to-noise ratios constitute an N-dimensional array snr _matrix , then the above-mentioned N first instantaneous signal-to-noise ratios can be expressed by the following formula (2):
snr _matrix = [snr _M (T-N+1), snr _M (T-N+2),..., snr _M (T)] (2);

其中，在公式(2)中，snr_M(T)为第T组历史音频信号对应的第一瞬时信噪比。Among them, in formula (2), snr _M (T) is the first instantaneous signal-to-noise ratio corresponding to the T-th group of historical audio signals.

可选地，电子设备可以构造一个用于存储最近N组历史语音信号对应的第一瞬时信噪比的N维数组。从而电子设备每确定一个第一瞬时信噪比，即可以对该N维数组进行一次更新。具体地，电子设备可以将上述N维数组中的第一个第一瞬时信噪比从该N维数组中移出，并将最近一次确定的第一瞬时信噪比添加至N维数组中。这样，电子设备可以直接采用该N维数组中的N个第一瞬时信噪比，确定当前帧语音信号的目标长时信噪比。Alternatively, the electronic device may construct an N-dimensional array for storing the first instantaneous signal-to-noise ratio corresponding to the most recent N groups of historical speech signals. Therefore, each time the electronic device determines a first instantaneous signal-to-noise ratio, the N-dimensional array can be updated once. Specifically, the electronic device can remove the first instantaneous signal-to-noise ratio in the above-mentioned N-dimensional array from the N-dimensional array, and add the most recently determined first instantaneous signal-to-noise ratio to the N-dimensional array. . In this way, the electronic device can directly use the N first instantaneous signal-to-noise ratios in the N-dimensional array to determine the target long-term signal-to-noise ratio of the current frame speech signal.

上述实施例中是以电子设备将每组历史音频信号的M个瞬时信噪比中的最大信噪比确定为与每组历史音频信号对应的第一瞬时信噪比为例的，实际实现中，还可以将每组历史音频信号的M个瞬时信噪比中的次大信噪比或平均瞬时信噪比确定为与每组历史音频信号对应的第一瞬时信噪比。In the above embodiment, the electronic device determines the maximum signal-to-noise ratio among the M instantaneous signal-to-noise ratios of each group of historical audio signals as the first instantaneous signal-to-noise ratio corresponding to each group of historical audio signals. The actual implementation is , the second largest signal-to-noise ratio or the average instantaneous signal-to-noise ratio among the M instantaneous signal-to-noise ratios of each group of historical audio signals can also be determined as the first instantaneous signal-to-noise ratio corresponding to each group of historical audio signals.

如此，由于每组历史音频信号的最大瞬时信噪比能够表征：该组历史音频信号的语音信号与噪声信号的相对强度，而N个第一瞬时信噪比为N组历史音频信号的最大瞬时信噪比，因此N个第一瞬时信噪比能够准确表征目标音频信号所属的音频序列的语音信号的质量。In this way, because the maximum instantaneous signal-to-noise ratio of each group of historical audio signals can represent: the relative strength of the speech signal and the noise signal of the group of historical audio signals, and the N first instantaneous signal-to-noise ratio is the maximum of the N groups of historical audio signals. Instantaneous signal-to-noise ratio, so the N first instantaneous signal-to-noise ratios can accurately characterize the quality of the speech signal of the audio sequence to which the target audio signal belongs.

下面以确定目标音频信号的瞬时信噪比为例，对电子设备确定音频信号的瞬时信噪比的方法进行说明。The following is an example of determining the instantaneous signal-to-noise ratio of a target audio signal to explain the method of electronic equipment determining the instantaneous signal-to-noise ratio of an audio signal.

具体而言，电子设备可以通过下述的步骤i～步骤iii确定目标音频信号的瞬时信噪比。Specifically, the electronic device can determine the instantaneous signal-to-noise ratio of the target audio signal through the following steps i to iii.

步骤i、电子设备先对目标音频信号进行快速傅里叶变换(Fast Fourier Transform，TTF)，即将目标音频信号变换到频域，以得到目标音频信号的目标时频信号X(t,k)。其中，t表示目标音频信号的时间帧，k表示目标音频信号中的第k个频点。Step i. The electronic device first performs Fast Fourier Transform (TTF) on the target audio signal, that is, transforms the target audio signal into the frequency domain to obtain the target time-frequency signal X(t,k) of the target audio signal. Among them, t represents the time frame of the target audio signal, and k represents the k-th frequency point in the target audio signal.

步骤ii、电子设备根据目标时频信号X(t,k)，确定目标音频信号的信号总能量Esignal(t)，Esignal(t)可以通过下述的公式(3)表示：
Step ii. The electronic device determines the total signal energy Esignal(t) of the target audio signal based on the target time-frequency signal X(t,k). Esignal(t) can be expressed by the following formula (3):

其中，在公式(3)中，B为目标音频信号中的频点个数，t表示目标时频信号的时间帧，k表示目标时频信号X(t,k)中的第k个频点。Among them, in formula (3), B is the number of frequency points in the target audio signal, t represents the time frame of the target time-frequency signal, and k represents the k-th frequency point in the target time-frequency signal X(t,k) .

步骤iii、电子设备根据目标时频信号X(t,k)，计算目标音频信号的噪声信号Noise(t,k)；并基于Noise(t,k)，确定目标音频信号的噪声总能量Enoise(t)。其中，k表示噪声信号Noise(t,k)中的第k个频点。Step iii. The electronic device calculates the noise signal Noise(t,k) of the target audio signal based on the target time-frequency signal X(t,k); and based on Noise(t,k), determines the total noise energy Enoise( t). Among them, k represents the k-th frequency point in the noise signal Noise(t,k).

对于电子设备确定噪声信号Noise(t,k)和噪声总能量Enoise(t)的具体方法可以参见相关技术中的相关描述。例如电子设备可以基于信号存在概率的递归平均算法等方法，确定目标音频信号的噪声信号Noise(t,k)。For the specific method of determining the noise signal Noise(t,k) and the total noise energy Enoise(t) by the electronic device, please refer to the relevant descriptions in the related art. For example, the electronic device can determine the noise signal Noise(t,k) of the target audio signal based on methods such as the recursive averaging algorithm of the signal existence probability.

步骤iiii、电子设备根据目标音频信号的噪声信号Noise(t,k)和目标音频信号的信号总能量Esignal(t)，确定目标音频信号的瞬时信噪比snr_c(t)。Step iiii. The electronic device determines the target audio signal according to the noise signal Noise(t,k) of the target audio signal and the signal of the target audio signal. The total energy of the signal Esignal(t) determines the instantaneous signal-to-noise ratio snr _c (t) of the target audio signal.

其中，目标音频信号的瞬时信噪比snr_c(t)可以通过下述的公式(4)表示：
Among them, the instantaneous signal-to-noise ratio snr _c (t) of the target audio signal can be expressed by the following formula (4):

至此，电子设备得到了第t帧音频信号的瞬时信噪比。At this point, the electronic device has obtained the instantaneous signal-to-noise ratio of the t-th frame audio signal.

进一步地，电子设备可以对目标音频信号的瞬时信噪比snr_c(t)进行平滑处理，以得到目标音频信号的最终瞬时信噪比该最终瞬时信噪比可以通过下述的公式(5)表示：
Further, the electronic device can smooth the instantaneous signal-to-noise ratio snr _c (t) of the target audio signal to obtain the final instantaneous signal-to-noise ratio of the target audio signal. The final instantaneous signal-to-noise ratio It can be expressed by the following formula (5):

其中，在公式(5)中，α为平滑因子，为第t-1帧音频信号(即目标音频信号的前一帧音频信号)的最终瞬时信噪比。例如，α的取值范围可以为：0～0.3。Among them, in formula (5), α is the smoothing factor, is the final instantaneous signal-to-noise ratio of the t-1th frame audio signal (that is, the audio signal of the previous frame of the target audio signal). For example, the value range of α can be: 0~0.3.

可以理解，每组历史音频信号的瞬时信噪比具体可以包括：该组历史音频信号中的M帧音频信号的最终瞬时信噪比。It can be understood that the instantaneous signal-to-noise ratio of each group of historical audio signals may specifically include: the final instantaneous signal-to-noise ratio of the M frame audio signals in the group of historical audio signals.

步骤B、电子设备基于N个第一瞬时信噪比确定目标长时信噪比。Step B: The electronic device determines the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios.

可选地，电子设备可以确定N个第一瞬时信噪比的平均信噪比，并将该平均信噪比确定为目标长时信噪比。Alternatively, the electronic device may determine an average signal-to-noise ratio of the N first instantaneous signal-to-noise ratios, and determine the average signal-to-noise ratio as the target long-term signal-to-noise ratio.

具体地，电子设备确定N个第一瞬时信噪比的平均信噪比之后，可以先对该平均信噪比进行平滑处理，然后将平滑处理后的信噪比确定为目标长时信噪比，该目标长时信噪比snr_l(t)可以通过下述的公式(6)表示：
snr_l(t)＝(1-μ)*snr_l(t-1)+μ*mean(snr_matrix(T)) (6)；Specifically, after the electronic device determines the average signal-to-noise ratio of the N first instantaneous signal-to-noise ratios, it can first smooth the average signal-to-noise ratio, and then determine the smoothed signal-to-noise ratio as the target long-term signal-to-noise ratio. ratio, the target long-term signal-to-noise ratio snr _l (t) can be expressed by the following formula (6):
snr _l (t)=(1-μ)*snr _l (t-1)+μ*mean(snr _matrix (T)) (6);

其中，在公式(6)中，snr_l(t-1)为当前帧语音信号的前一帧语音信号的长时信噪比，snr_matrix(T)表示N个第一瞬时信噪比，μ为平滑因子，μ的取值范围为：0～0.1。Among them, in formula (6), snr _l (t-1) is the long-term signal-to-noise ratio of the previous frame of the current frame speech signal, snr _matrix (T) represents the N first instantaneous signal-to-noise ratio, μ is the smoothing factor, and the value range of μ is: 0~0.1.

当然，电子设备还可以基于N个第一瞬时信噪比，采用其他任意可能的方法，确定目标长时信噪比；例如，电子设备可以将第一信噪比集合中的信噪比的平方根，确定为目标长时信噪比。Of course, the electronic device can also use any other possible method to determine the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios; for example, the electronic device can determine the target long-term signal-to-noise ratio in the first set of signal-to-noise ratios. The square root is determined as the target long-term signal-to-noise ratio.

可选地，电子设备基于N个第一瞬时信噪比和第二瞬时信噪比，确定目标长时信噪比；其中，第二瞬时信噪比可以为目标音频信号的瞬时信噪比。具体地，电子设备可以确定N个第一瞬时信噪比和第二瞬时信噪比之间的平均信噪比，然后对该平均信噪比进行平滑处理，并将平滑处理后的平均信噪比确定为目标长时信噪比。Optionally, the electronic device determines the target long-term signal-to-noise ratio based on N first instantaneous signal-to-noise ratios and second instantaneous signal-to-noise ratios; wherein the second instantaneous signal-to-noise ratio can be the instantaneous signal-to-noise ratio of the target audio signal. . Specifically, the electronic device can determine an average signal-to-noise ratio between N first instantaneous signal-to-noise ratios and second instantaneous signal-to-noise ratios, then smooth the average signal-to-noise ratio, and use the smoothed average signal-to-noise ratio as The noise ratio is determined as the target long-term signal-to-noise ratio.

如此，由于电子设备可以基于N个第一瞬时信噪比，确定目标音频信号对应的目标长时信噪比，因此使得该目标长时信噪比能够更好地表征目标音频信号中的噪声的相对平稳度。如此可以更加准确地确定出目标音频信号对应的声学场景。In this way, since the electronic device can determine the target long-term signal-to-noise ratio corresponding to the target audio signal based on the N first instantaneous signal-to-noise ratios, the target long-term signal-to-noise ratio can better characterize the noise in the target audio signal. relative stability. In this way, the acoustic scene corresponding to the target audio signal can be determined more accurately.

可选地，“电子设备估计目标音频信号对应的目标长时平稳度指标”具体可以通过下述的步骤C和步骤D实现。Optionally, "the electronic device estimates the target long-term stationarity index corresponding to the target audio signal" can be specifically implemented through the following steps C and D.

步骤C、电子设备确定第一信号能量和第二信号能量之间的信号能量差。Step C: The electronic device determines the signal energy difference between the first signal energy and the second signal energy.

步骤D、电子设备对信号能量差进行平滑处理，以得到目标长时平稳度指标。Step D. The electronic device smoothes the signal energy difference to obtain the target long-term stability index.

其中，第一信号能量为对目标音频信号进行平稳噪声降噪处理后的信号能量，第二信号能量为对目标音频信号进行深度学习降噪处理后的信号能量。Among them, the first signal energy is the signal energy after smooth noise reduction processing on the target audio signal, and the second signal energy is the signal energy after deep learning noise reduction processing on the target audio signal.

可选地，步骤C中的信号能量差M_st(t)可以通过下述的公式(7)表示：
M_st(t)＝max(10*log10(Es(t))-10*log10(Et(t))，0) (7)；Alternatively, the signal energy difference M _st (t) in step C can be expressed by the following formula (7):
M _st (t)=max(10*log10(Es(t))-10*log10(Et(t)), 0) (7);

其中，在公式(7)中，Es(t)表示第一信号能量，Et(t)表示第二信号能量。Among them, in formula (7), Es(t) represents the first signal energy, and Et(t) represents the second signal energy.

需要说明的是，对于平稳噪声，平稳降噪处理(也称为传统信号处理)和深度学***稳噪声进行很好的抑制，即对于平稳噪声两种降噪方法的差异比较小，也即M_st(t)接近0。平稳降噪处理对非平稳噪声的抑制能力较弱，深度学***稳噪声具有较强的抑制能力；即对于非平稳噪声，第一信号能量大于第二信号能量，从而M_st(t)较大，M_st(t)值取决于两者的噪声能量抑制差。 It should be noted that for stationary noise, both stationary noise reduction processing (also known as traditional signal processing) and deep learning noise reduction processing can suppress stationary noise well, that is, the difference between the two noise reduction methods for stationary noise. Small, that is, M _st (t) is close to 0. Stationary noise reduction processing has a weak ability to suppress non-stationary noise, while deep learning noise reduction processing has a strong ability to suppress non-stationary noise; that is, for non-stationary noise, the first signal energy is greater than the second signal energy, so M _st (t) is larger, the value of M _st (t) depends on the difference in noise energy suppression between the two.

通常情况下，若目标音频信号为平稳语音信号，则M_st(t)一般接近0，目标音频信号为非平稳语音，则M_st(t)可以为几个分贝(dB)。Normally, if the target audio signal is a stationary speech signal, then M _st (t) is generally close to 0. If the target audio signal is a non-stationary speech signal, then M _st (t) can be several decibels (dB).

本申请实施例中，由于M_st(t)表示目标音频信号的瞬时平稳程度，从电子设备可以对M_st(t)进行平滑处理，以得到目标长时平稳度指标，该目标长时平稳度指标可以指示一段时间内噪声的平稳程度，即目标音频信号中噪声的相对平稳程度。In the embodiment of the present application, since M _st (t) represents the instantaneous smoothness of the target audio signal, the slave electronic device can perform smoothing processing on M _st (t) to obtain the target long-term smoothness index. The target long-term smoothness The indicator can indicate the stationarity of the noise over a period of time, that is, the relative stationarity of the noise in the target audio signal.

其中，电子设备可以通过下述的公式(8)对M_st(t)进行平滑处理：
MS_stp(t)＝(1-β)*MS_stp(t-1)+β*M_st(t) (8)；Among them, the electronic device can smooth M _st (t) through the following formula (8):
MS _stp (t)=(1-β)*MS _stp (t-1)+β*M _st (t) (8);

其中，在公式(8)中，MS_stp(t)为目标长时平稳度指标，MS_stp(t-1)为第t-1帧音频信号(即目标音频信号的前一个音频信号)对应的长时平稳度指标，β为平滑因子，β的取值范围可以为：0～0.1。Among them, in formula (8), MS _stp (t) is the target long-term stability index, MS _stp (t-1) is the audio signal corresponding to the t-1th frame (that is, the previous audio signal of the target audio signal) Long-term stability index, β is the smoothing factor, and the value range of β can be: 0~0.1.

需要说明的是，MS_stp(t)越大，表示目标音频信号中的噪声越不平稳。It should be noted that the larger the MS _stp (t) is, the more unstable the noise in the target audio signal is.

下面对电子设备确定第一信号能量的方法进行示例性地说明。The following is an exemplary description of a method for the electronic device to determine the energy of the first signal.

首先，电子设备可以先基于平稳噪声降噪处理方法(也称为传统信号处理方法)，比如最小值跟踪方法，直方图方法等，确定目标音频信号的平稳底噪。然后根据该平稳底噪，确定目标音频信号对应的平稳降噪增益(以下称为第一频点增益)Gs(t,k)。由于Gs(t,k)是根据平稳底噪计算得到，所以只能对目标音频信号中的平稳噪声进行抑制。First, the electronic device can first determine the stationary noise floor of the target audio signal based on stationary noise reduction processing methods (also known as traditional signal processing methods), such as minimum value tracking methods, histogram methods, etc. Then, based on the steady noise floor, the steady noise reduction gain (hereinafter referred to as the first frequency point gain) Gs(t,k) corresponding to the target audio signal is determined. Since Gs(t,k) is calculated based on the stationary noise floor, it can only suppress the stationary noise in the target audio signal.

可以理解，确定Gs(t,k)的方法很多，比如维纳滤波、均衡算法(Minimum Mean Square Error，MMSE)等方法，具体参见相关技术，此处不予详细介绍。It can be understood that there are many methods to determine Gs(t,k), such as Wiener filtering, equalization algorithm (Minimum Mean Square Error, MMSE) and other methods. For details, please refer to related technologies, which will not be introduced in detail here.

其次，电子设备根据目标时频信号X(t,k)和第一频点增益G_s(t，k)，确定第一信号能量Es(t)，具体地，第一信号能量Es(t)可以通过下述的公式(9)表示：
Secondly, the electronic device determines the first signal energy Es(t) based on the target time-frequency signal X(t,k) and the first frequency point gain _Gs (t,k). Specifically, the first signal energy Es(t) It can be expressed by the following formula (9):

其中，在公式(9)中，B为目标音频信号中的频点个数。Among them, in formula (9), B is the number of frequency points in the target audio signal.

下面对电子设备确定第二信号能量的方法进行示例性地说明。The following is an exemplary description of a method for the electronic device to determine the energy of the second signal.

首先，电子设备可以先基于深度学习掩膜(mask)降噪算法，确定第二频点增益G_mask(t，k)。First, the electronic device can first determine the second frequency point gain G _mask (t, k) based on the deep learning mask (mask) noise reduction algorithm.

可以理解，基于深度学***稳噪声和非平稳噪声都有一定的抑制能力。It can be understood that the noise reduction algorithm based on deep learning mask calculation is the current mainstream noise reduction method, and it has a certain ability to suppress stationary noise and non-stationary noise.

对于基于深度学习mask计算的降噪算法的具体描述，参见相关技术。For a detailed description of the noise reduction algorithm based on deep learning mask calculation, see related technology.

其次，电子设备根据目标时频信号X(t,k)和第二频点增益G_mask(t，k)，确定第二信号能量Et(t)，具体地，第二信号能量Et(t)可以通过下述的公式(10)表示：
Secondly, the electronic device determines the second signal energy Et(t) based on the target time-frequency signal X(t,k) and the second frequency point gain G _mask (t,k). Specifically, the second signal energy Et(t) It can be expressed by the following formula (10):

其中，在公式(10)中，B为目标音频信号中的频点个数。Among them, in formula (10), B is the number of frequency points in the target audio signal.

步骤102、电子设备根据目标长时信噪比和目标长时平稳度指标，确定目标音频信号所属的目标声学场景。Step 102: The electronic device determines the target acoustic scene to which the target audio signal belongs based on the target long-term signal-to-noise ratio and the target long-term stationarity index.

可选地，本申请实施例中，上述的步骤102具体可以通过下述的步骤102a或步骤102d实现。Optionally, in this embodiment of the present application, the above-mentioned step 102 may be specifically implemented through the following step 102a or step 102d.

步骤102a、在目标长时信噪比大于或等于信噪比门限，且目标长时平稳度指标大于或等于平稳度指标平稳度指标门限的情况下，电子设备确定目标声学场景为第一声学场景。Step 102a. When the target long-term signal-to-noise ratio is greater than or equal to the signal-to-noise ratio threshold, and the target long-term stability index is greater than or equal to the stability index stability index threshold, the electronic device determines that the target acoustic scene is the first acoustic scene. Scenes.

步骤102b、在目标长时信噪比大于或等于信噪比门限，且目标长时平稳度指标小于平稳度指标平稳度指标门限的情况下，电子设备确定目标声学场景为第二声学场景。Step 102b: When the long-term signal-to-noise ratio of the target is greater than or equal to the signal-to-noise ratio threshold, and the long-term stability index of the target is less than the stability index threshold, the electronic device determines that the target acoustic scene is the second acoustic scene.

步骤102c、在目标长时信噪比小于信噪比门限，且目标长时平稳度指标大于或等于平稳度指标平稳度指标门限的情况下，电子设备确定目标声学场景为第三声学场景；Step 102c. When the long-term signal-to-noise ratio of the target is less than the signal-to-noise ratio threshold, and the long-term stability index of the target is greater than or equal to the stability index threshold, the electronic device determines that the target acoustic scene is the third acoustic scene;

步骤102d、在目标长时信噪比小于信噪比门限，且目标长时平稳度指标小于平稳度指标平稳度指标门限的情况下，电子设备确定目标声学场景为第四声学场景。Step 102d. When the long-term signal-to-noise ratio of the target is less than the signal-to-noise ratio threshold, and the long-term stability index of the target is less than the stationary When the stability index reaches the threshold of the stability index, the electronic device determines that the target acoustic scene is the fourth acoustic scene.

例如，信噪比门限可以为15db；平稳度指标门限2db。For example, the signal-to-noise ratio threshold can be 15db; the stability index threshold can be 2db.

可选地，信噪比门限和平稳度指标门限均可调。Optionally, both the signal-to-noise ratio threshold and the stability index threshold are adjustable.

示例性地，假设信噪比门限为thr_snr，平稳度指标平稳度指标门限为thr_ms，那么如表1所示：For example, assuming that the signal-to-noise ratio threshold is thr_snr and the stability index threshold is thr_ms, then as shown in Table 1:

表1
Table 1

如表1所示，当目标长时信噪比snr_l(t)≥thr_snr，且目标长时平稳度指标MS_st(t)≥thr_ms时，声学场景为高信噪比，非平稳噪声；即第一声学场景。As shown in Table 1, when the target long-term signal-to-noise ratio snr _l (t) ≥ thr_snr, and the target long-term stationarity index MS _st (t) ≥ thr_ms, the acoustic scene is a high signal-to-noise ratio and non-stationary noise; that is, First acoustic scene.

当长时信噪比snr_l(t)≥thr_snr而且平稳度度量MS_st(t)＜thr_ms时，声学场景为高信噪比，平稳噪声；即第二声学场景。When the long-term signal-to-noise ratio snr _l (t) ≥ thr_snr and the stationarity measure MS _st (t) < thr_ms, the acoustic scene is high signal-to-noise ratio and stationary noise; that is, the second acoustic scene.

当长时信噪比snr_l(t)＜thr_snr而且平稳度度量MS_st(t)≥thr_ms时，声学场景为低信噪比，非平稳噪声；即第三声学场景。When the long-term signal-to-noise ratio snr _l (t) < thr_snr and the stationarity measure MS _st (t) ≥ thr_ms, the acoustic scene is low signal-to-noise ratio and non-stationary noise; that is, the third acoustic scene.

当长时信噪比snr_l(t)＜thr_snr而且平稳度度量MS_st(t)＜thr_ms时，声学场景为低信噪比，平稳噪声；即第四声学场景。When the long-term signal-to-noise ratio snr _l (t) < thr_snr and the stationarity measure MS _st (t) < thr_ms, the acoustic scene is low signal-to-noise ratio and stationary noise; that is, the fourth acoustic scene.

其中，thr_snr为可调的信噪比门限，thr_ms为可调的平稳度指标门限。Among them, thr_snr is the adjustable signal-to-noise ratio threshold, and thr_ms is the adjustable stability index threshold.

例如，信噪比门限可以为15db±c内的任意值；平稳度指标门限可以为2db±d，c、d根据实际使用需求确定。For example, the signal-to-noise ratio threshold can be any value within 15db±c; the stability index threshold can be 2db±d, and c and d are determined according to actual usage requirements.

如此，由于第一声学场景、第二声学场景、第三声学场景和第四声学场景能够覆盖现实中的全部声学场景，因此提高了声学场景分类的通用性和适用性。In this way, since the first acoustic scene, the second acoustic scene, the third acoustic scene and the fourth acoustic scene can cover all acoustic scenes in reality, the versatility and applicability of acoustic scene classification are improved.

步骤103、电子设备基于目标声学场景，对目标音频信号进行降噪处理。Step 103: The electronic device performs noise reduction processing on the target audio signal based on the target acoustic scene.

可以理解，在本申请实施例提供的音频降噪方法中，声学场景的分类类型为4类，分别为包括第一声学场景、第二声学场景、第三声学场景和第四声学场景，由于声学场景的分类类型较少，因此电子设备可以针对性地对各声学场景中的音频信号进行降噪处理，即针对不同声学场景进行针对性降噪处理。It can be understood that in the audio noise reduction method provided by the embodiment of the present application, the classification types of acoustic scenes are 4 categories, including the first acoustic scene, the second acoustic scene, the third acoustic scene and the fourth acoustic scene. Since There are fewer classification types of acoustic scenes, so the electronic device can perform targeted noise reduction processing on the audio signals in each acoustic scene, that is, perform targeted noise reduction processing for different acoustic scenes.

示例性地，假设每种声学场景对应的降噪策略均包括：深度学习降噪处理和稳噪声降噪处理，那么：For example, assuming that the noise reduction strategy corresponding to each acoustic scene includes: deep learning noise reduction processing and stable noise noise reduction processing, then:

在第一声学场景对应的降噪策略中，平稳噪声降噪处理的权重＜深度学习降噪处理的权重，且噪声抑制比例为第一比例；In the noise reduction strategy corresponding to the first acoustic scene, the weight of the stationary noise reduction process is < the weight of the deep learning noise reduction process, and the noise suppression ratio is the first ratio;

在第二声学场景对应的降噪策略中，平稳噪声降噪处理的权重＞深度学习降噪处理的权重，且噪声抑制比例为第二比例；In the noise reduction strategy corresponding to the second acoustic scene, the weight of the stationary noise reduction process > the weight of the deep learning noise reduction process, and the noise suppression ratio is the second ratio;

在第三声学场景对应的降噪策略中，平稳噪声降噪处理的权重＜深度学习降噪处理的权重，且噪声抑制比例为第三比例；In the noise reduction strategy corresponding to the third acoustic scene, the weight of the stationary noise reduction process is < the weight of the deep learning noise reduction process, and the noise suppression ratio is the third ratio;

在第四声学场景对应第的降噪策略，平稳噪声降噪处理的权重＞深度学习降噪处理的权重，且噪声抑制比例为第四比例。In the fourth acoustic scene corresponding to the noise reduction strategy, the weight of the stationary noise reduction process is greater than the weight of the deep learning noise reduction process, and the noise suppression ratio is the fourth ratio.

其中，第一比例＜第三比例，且第一比例＜第四比例；相应地，第二比例＜第三比例，且第二比例＜第四比例。Wherein, the first ratio<the third ratio, and the first ratio<the fourth ratio; correspondingly, the second ratio<the third ratio, and the second ratio<the fourth ratio.

可选地，第一比例与第二比例可以相同，第三比例和第四比例可以相同。Optionally, the first ratio and the second ratio may be the same, and the third ratio and the fourth ratio may be the same.

如此，由于可以按照与目标声学场景对应的降噪策略，对目标音频信号执行降噪处理，因此提高对目标音频信号降噪处理的效果，提高电子设备的音质。 In this way, since the noise reduction processing can be performed on the target audio signal according to the noise reduction strategy corresponding to the target acoustic scene, the effect of the noise reduction processing on the target audio signal is improved, and the sound quality of the electronic device is improved.

可选地，电子设备可以在对目标音频信号进行降噪处理后，输出处理后的目标音频信号。例如，电子设备与目标设备通话，即目标音频信号为电子设备在通话过程中获取语音信号为例，电子设备在对目标音频信号进行降噪处理后，可以将处理后的目标音频信号发送给目标设备。Optionally, the electronic device can output the processed target audio signal after performing noise reduction processing on the target audio signal. For example, the electronic device has a conversation with the target device, that is, the target audio signal is a voice signal obtained by the electronic device during the call. After performing noise reduction processing on the target audio signal, the electronic device can send the processed target audio signal to the target. equipment.

需要说明的是，对于音频文件中的每帧音频或电子设备获取的每帧音频，电子设备均可以执行上述的步骤101至步骤103。It should be noted that for each frame of audio in the audio file or each frame of audio obtained by the electronic device, the electronic device can perform the above steps 101 to 103.

在本申请实施例提供的音频降噪方法中，由于音频信号对应的长时信噪比和平稳度指标为音频信号中噪声的两个本质特征，因此基于目标音频信号对应的目标长时信噪比和目标长时平稳度指标，能够更加准确、快速地确定出目标音频信号对应的目标声学场景，从而可以提高基于目标声学场景对目标音频降噪的准确度。In the audio noise reduction method provided by the embodiment of the present application, since the long-term signal-to-noise ratio and smoothness index corresponding to the audio signal are two essential characteristics of the noise in the audio signal, based on the target long-term signal-to-noise ratio corresponding to the target audio signal Compared with the target long-term stability index, the target acoustic scene corresponding to the target audio signal can be determined more accurately and quickly, thereby improving the accuracy of target audio noise reduction based on the target acoustic scene.

下面结合附图2对本申请实施例提供的音频降噪方法进行示例性地说明。The audio noise reduction method provided by the embodiment of the present application will be exemplarily described below with reference to FIG. 2 .

示例性地，以电子设备对通话过程中的音频信号进行降噪处理为例，电子设备可以采用本申请实施例提供的音频降噪方法，对通话过程中的每帧音频信号进行降噪处理。如图2所示，该音频信号降噪处理方法可以包括下述的步骤201至步骤219。For example, taking the electronic device performing noise reduction processing on audio signals during a call as an example, the electronic device can use the audio noise reduction method provided by embodiments of the present application to perform noise reduction processing on each frame of audio signal during the call. As shown in Figure 2, the audio signal noise reduction processing method may include the following steps 201 to 219.

步骤201、电子设备读入语音信号，并对读入的语音信号进行分帧处理。Step 201: The electronic device reads the voice signal and performs frame processing on the read voice signal.

例如，对实时处理情况来说，每次麦克风采集到的语音信号会实时送到电子设备的数字处理芯片中，比如一次送进来10ms长度的数据。由于语音信号是短时平稳(如30ms以内近似认为是平稳的)、长时间不平稳的信号，所以电子设备可以对相对短时的语音信号做分析，如假设取30ms左右的信号为处理的一帧。即一次读进来数据10ms，通过对历史读入的语音数据进行缓存，凑齐30ms左右的语音数据进行一次分析和处理。For example, for real-time processing, the voice signal collected by the microphone will be sent to the digital processing chip of the electronic device in real time, such as 10ms of data at a time. Since the speech signal is a signal that is stable in the short term (for example, within 30 ms is approximately considered to be stable) and unstable in the long term, electronic equipment can analyze the relatively short-term speech signal. For example, assuming that the signal of about 30 ms is taken as a processing step frame. That is, the data is read in for 10ms at a time, and by caching the historically read voice data, about 30ms of voice data is collected for analysis and processing.

可以看出，对语音信号分帧处理的目的是，将每固定时长(如30ms)的语音信号作为一个处理帧，也称为一帧语音信号。It can be seen that the purpose of frame processing of the speech signal is to treat the speech signal of a fixed duration (such as 30ms) as a processing frame, also called a frame of speech signal.

进一步地，分帧处理得到的每帧语音信号为时域信号。Furthermore, each frame of speech signal obtained by frame processing is a time domain signal.

步骤202、电子设备将当前帧语音信号的时域信号作FFT变换，得到当前帧语音信号的时频信号。Step 202: The electronic device performs FFT transformation on the time domain signal of the current frame speech signal to obtain the time-frequency signal of the current frame speech signal.

可以理解，当前帧语音信号即上述实施例中的目标音频信号。It can be understood that the current frame speech signal is the target audio signal in the above embodiment.

步骤203、电子设备根据当前帧语音信号的时频信号，确定当前帧语音信号的信号总能量。Step 203: The electronic device determines the total signal energy of the current frame speech signal based on the time-frequency signal of the current frame speech signal.

步骤204、电子设备可以根据目标时频信号，计算当前帧语音信号的噪声信号；并基于该噪声信号，确定当前帧语音信号的噪声总能量。Step 204: The electronic device can calculate the noise signal of the current frame speech signal based on the target time-frequency signal; and based on the noise signal, determine the total noise energy of the current frame speech signal.

比如，电子设备基于信号存在概率的递归平均算法等方法计算当前帧语音信号的噪声信号。For example, the electronic device calculates the noise signal of the current frame speech signal based on the recursive averaging algorithm of the signal existence probability and other methods.

步骤205、电子设备根据当前帧音频信号的信号总能量和噪声总能量，确定当前帧语音信号的瞬时信噪比。由于信号的瞬时信噪比并不能反映一段时间的信号噪声能量水平。信号的长时信噪比才能更好的做声学场景分类。下面介绍根据瞬时信噪比计算长时信噪比的方法。Step 205: The electronic device determines the instantaneous signal-to-noise ratio of the current frame speech signal based on the total signal energy and the total noise energy of the current frame audio signal. Because the instantaneous signal-to-noise ratio of a signal does not reflect the signal noise energy level over a period of time. The long-term signal-to-noise ratio of the signal can better classify acoustic scenes. The following introduces the method of calculating the long-term signal-to-noise ratio based on the instantaneous signal-to-noise ratio.

步骤206、电子设备对当前帧语音信号的瞬时信噪比进行平滑处理，得到当前帧语音信号的最终瞬时信噪比。Step 206: The electronic device smoothes the instantaneous signal-to-noise ratio of the current frame speech signal to obtain the final instantaneous signal-to-noise ratio of the current frame speech signal.

本申请实施例中，电子设备可以将通话过程中的每M(如100)帧语音信号分为一组历史语音信号。In this embodiment of the present application, the electronic device can divide each M (eg, 100) frames of voice signals during the call into a group of historical voice signals.

可以理解，当前帧语音信号的瞬时信噪比可以参与确定当前帧语音信号的长时信噪比，或者可以参与确认在当前帧语音信号之后读入的语音信号的长时信噪比。It can be understood that the instantaneous signal-to-noise ratio of the current frame speech signal can participate in determining the long-term signal-to-noise ratio of the current frame speech signal, or can participate in confirming the long-term signal-to-noise ratio of the speech signal read after the current frame speech signal.

语音信号的长时信噪比可以表征语音信号相对于一段时间内音频信号(如多帧历史语音信号)的相对噪声水平。 The long-term signal-to-noise ratio of a speech signal can characterize the relative noise level of the speech signal relative to audio signals within a period of time (such as multi-frame historical speech signals).

需要说明的是，电子设备可以对通话过程中的每帧语音信号执行步骤302至步骤306，以得到每帧语音信号的最终瞬时信噪比。It should be noted that the electronic device can perform steps 302 to 306 on each frame of voice signal during the call to obtain the final instantaneous signal-to-noise ratio of each frame of voice signal.

步骤207、电子确定每组历史语音信号的瞬时信噪比中的最大瞬时信噪比，并将该最大瞬时信噪比确定为与每组历史语音信号对应的第一瞬时信噪比。Step 207: Electronically determine the maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of each group of historical speech signals, and determine the maximum instantaneous signal-to-noise ratio as the first instantaneous signal-to-noise ratio corresponding to each group of historical speech signals.

对于电子设备确定每组历史语音信号的瞬时信噪比中的最大瞬时信噪比的方法参见上述实施例中的相关描述。For the method for the electronic device to determine the maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of each group of historical speech signals, please refer to the relevant descriptions in the above embodiments.

步骤208、电子设备构造N维数组。Step 208: The electronic device constructs an N-dimensional array.

其中，N维数组中包括N个第一瞬时信噪比，该N个第一瞬时信噪比为本次通话过程中的最近N组历史语音信号对应的第一瞬时信噪比。Among them, the N-dimensional array includes N first instantaneous signal-to-noise ratios, and the N first instantaneous signal-to-noise ratios are the first instantaneous signal-to-noise ratios corresponding to the most recent N groups of historical speech signals during this call.

本申请实施例中，电子设备每采集M帧语音信号，可以更新一次N维数组。In this embodiment of the present application, the electronic device can update the N-dimensional array every time it collects M frames of speech signals.

步骤209、电子设备确定N维数组中的N个第一瞬时信噪比的平均信噪比，且对该平均信噪比进行平滑处理，并将平滑处理后的平均信噪比确定为目标长时信噪比。Step 209: The electronic device determines the average signal-to-noise ratio of the N first instantaneous signal-to-noise ratios in the N-dimensional array, smoothes the average signal-to-noise ratio, and determines the smoothed average signal-to-noise ratio as the target. Long-term signal-to-noise ratio.

一般来说，目前主流的降噪算法都是深度学***稳噪声有抑制能力，而基于传统方法的降噪算法对非平稳噪声的抑制能力十分有限，本申请实施例提出可以利用这两种降噪算法之间的噪声抑制差异，确定通话过程中的语音信号的平稳度度量指标。Generally speaking, the current mainstream noise reduction algorithms are a combination of deep learning methods and traditional noise reduction methods. Since the mask estimation method based on deep learning has the ability to suppress non-stationary noise, while the noise reduction algorithm based on traditional methods has very limited ability to suppress non-stationary noise, the embodiment of this application proposes to use the advantages between these two noise reduction algorithms. Noise Suppression Difference, a measure of smoothness that determines the speech signal during a call.

步骤210、电子设备估计当前帧语音信号的平稳底噪。Step 210: The electronic device estimates the stationary noise floor of the current frame speech signal.

其中，具体确定语音信号的平稳底噪的方法很多，比如最小值跟踪方法，直方图方法等。Among them, there are many methods to specifically determine the smooth noise floor of speech signals, such as the minimum value tracking method, histogram method, etc.

步骤211、电子设备根据当前帧语音信号的平稳底噪，确定当前帧语音信号对应的平稳降噪增益。Step 211: The electronic device determines the steady noise reduction gain corresponding to the current frame speech signal based on the steady noise floor of the current frame speech signal.

其中，平稳降噪增益用于抑制语音信号中的平稳噪声。Among them, the stationary noise reduction gain is used to suppress stationary noise in the speech signal.

电子设备根据当前帧语音信号的平稳底噪求得的其对应的频点增益，即平稳降噪增益G_s(t，k)，这个增益由于是根据平稳底噪计算得到的，所以只能抑制平稳噪声。计算平稳降噪增益的方法很多，比如维纳滤波、MMSE等方法，由于这些不是本申请的重点，不予详细介绍。The electronic device obtains its corresponding frequency point gain based on the stationary noise floor of the current frame speech signal, that is, the stationary noise reduction gain G _s (t, k). Since this gain is calculated based on the stationary noise floor, it can only be suppressed. Smooth noise. There are many methods for calculating stationary noise reduction gain, such as Wiener filtering, MMSE and other methods. Since these are not the focus of this application, they will not be introduced in detail.

步骤212、电子设备采用平稳降噪增益，对当前帧语音信号进行降噪处理，得到平稳降噪处理后的第一信号能量。Step 212: The electronic device uses the smooth noise reduction gain to perform noise reduction processing on the current frame speech signal to obtain the first signal energy after smooth noise reduction processing.

步骤213、电子设备估计当前帧语音信号的非平稳底噪。Step 213: The electronic device estimates the non-stationary noise floor of the current frame speech signal.

如，电子设备采用深度学***稳底噪。For example, electronic equipment uses deep learning noise reduction algorithms to process the current frame speech signal to obtain the non-stationary noise floor of the current frame speech signal.

步骤214、电子设备根据当前帧语音信号的非平稳底噪，确定当前帧语音信号对应的非平稳降噪增益。Step 214: The electronic device determines the non-stationary noise reduction gain corresponding to the current frame speech signal based on the non-stationary noise floor of the current frame speech signal.

其中，非平稳降噪增益对当前帧语音信号中的非平稳噪声和平稳噪声都有一定的抑制能力。Among them, the non-stationary noise reduction gain has a certain ability to suppress both non-stationary noise and stationary noise in the current frame speech signal.

步骤215、电子设备采用非平稳降噪增益，对当前帧语音信号进行降噪处理，得到非平稳降噪处理后的第二信号能量。Step 215: The electronic device uses non-stationary noise reduction gain to perform noise reduction processing on the current frame speech signal to obtain the second signal energy after non-stationary noise reduction processing.

步骤216、电子设备确定第一信号能量和第二信号能量之间的信号能量差。Step 216: The electronic device determines the signal energy difference between the first signal energy and the second signal energy.

其中，该信号能量差可以作为当前帧语音信号的瞬时平稳度指标。Among them, the signal energy difference can be used as an instantaneous stability index of the current frame speech signal.

可以理解，对于平稳噪声，平稳降噪和深度学***稳噪声，平稳噪声的降噪能力较弱，深度学***稳噪声抑制降噪之后的能量Es(t)，则信号能量差为一个大于0的值。这个值取决于两者的噪声能量抑制差。一般平稳语音接近0，非平稳语音为几个db。It can be understood that for stationary noise, both stationary noise reduction and deep learning noise reduction can achieve better results, that is, the difference between the two noise reduction methods is relatively small, that is, the signal energy difference is close to 0; for non-stationary noise, the reduction of stationary noise The noise reduction ability is weak, and deep learning noise reduction has strong noise reduction ability. That is, the energy Et(t) after deep learning noise reduction is less than the energy Es(t) after stationary noise suppression noise reduction, then the signal energy difference is greater than 0 value. This value depends on the noise energy suppression difference between the two. Generally, stationary speech is close to 0, and non-stationary speech is several db.

由于信号能量差表示当前帧的瞬时平稳程度，不利于实际使用，从而可以对信号能量差作平滑处理，以得到当前帧语音信号在一段时间内的噪声平稳程度。Since the signal energy difference represents the instantaneous smoothness of the current frame, it is not conducive to practical use, so the signal can be The energy difference is smoothed to obtain the noise stability of the current frame speech signal within a period of time.

步骤217、电子设备对信号能量差进行平滑处理，得到当前帧语音信号的长时平稳度指标。Step 217: The electronic device smoothes the signal energy difference to obtain the long-term stability index of the current frame speech signal.

至此，我们得到平稳度度量指标，若该指标的值接近0，则表示当前帧语音信号对应的声学场景为平稳噪声类型；当该指标的值较大，则表示当前帧语音信号对应声学场景为非平稳噪声类型。At this point, we have obtained the stationarity measurement index. If the value of this index is close to 0, it means that the acoustic scene corresponding to the speech signal of the current frame is a stationary noise type; when the value of this index is large, it means that the acoustic scene corresponding to the speech signal of the current frame is of the stationary noise type. Non-stationary noise type.

步骤218、电子设备根据当前帧语音信号的目标长时信噪比、长时平稳度指标，确定当前帧语音信号对应的目标声学场景。Step 218: The electronic device determines the target acoustic scene corresponding to the current frame speech signal based on the target long-term signal-to-noise ratio and long-term stability index of the current frame speech signal.

步骤219、电子设备基于目标声学场景，对当前帧语音信号进行降噪处理。Step 219: The electronic device performs noise reduction processing on the current frame speech signal based on the target acoustic scene.

对于步骤201至步骤219的其他描述，具体可以参见上述实施例中的相关描述，为了避免重复，此处不再赘述。For other descriptions of steps 201 to 219, please refer to the relevant descriptions in the above embodiments. To avoid repetition, they will not be described again here.

本申请实施例提供的音频降噪方法，执行主体可以为音频降噪装置，或者该音频降噪装置中的用于执行音频降噪的方法的控制模块。本申请实施例中以音频降噪装置执行音频降噪方法为例，说明本申请实施例提供的音频降噪装置。For the audio noise reduction method provided by the embodiments of the present application, the execution subject may be an audio noise reduction device, or a control module in the audio noise reduction device for executing the audio noise reduction method. In the embodiment of the present application, an audio noise reduction device performing an audio noise reduction method is used as an example to illustrate the audio noise reduction device provided by the embodiment of the present application.

本申请实施例提供了一种音频降噪装置，图3示出了本申请实施例提供的音频降噪装置的一种可能的结构示意图，如图3所示，该音频降噪装置300可以包括：处理模块301和确定模块302。所述处理模块，用于计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标，所述目标长时平稳度指标用于指示所述目标音频信号中噪声的平稳程度；所述确定模块，用于根据所述处理模块计算的所述目标长时信噪比和所述目标长时平稳度指标，确定所述目标音频信号对应的目标声学场景；所述处理模块，还用于基于所述确定模块确定所述目标声学场景，对所述目标音频信号进行降噪处理。The embodiment of the present application provides an audio noise reduction device. Figure 3 shows a possible structural diagram of the audio noise reduction device provided by the embodiment of the present application. As shown in Figure 3, the audio noise reduction device 300 may include : Processing module 301 and determination module 302. The processing module is used to calculate the target long-term signal-to-noise ratio and the target long-term stability index corresponding to the target audio signal, and the target long-term stability index is used to indicate the smoothness of the noise in the target audio signal; The determination module is configured to determine the target acoustic scene corresponding to the target audio signal based on the target long-term signal-to-noise ratio and the target long-term stationarity index calculated by the processing module; the processing module is also configured to use After determining the target acoustic scene based on the determination module, performing noise reduction processing on the target audio signal.

一种可能的实现方式中，确定模块，具体用于：在所述目标长时信噪比大于或等于信噪比门限，且所述目标长时平稳度指标大于或等于平稳度指标门限的情况下，确定所述目标声学场景为第一声学场景；In a possible implementation, the determination module is specifically used for: when the target long-term signal-to-noise ratio is greater than or equal to the signal-to-noise ratio threshold, and the target long-term stability index is greater than or equal to the stability index threshold. Next, determine the target acoustic scene as the first acoustic scene;

在所述目标长时信噪比大于或等于所述信噪比门限，且所述目标长时平稳度指标小于所述平稳度指标门限的情况下，确定所述目标声学场景为第二声学场景；When the target long-term signal-to-noise ratio is greater than or equal to the signal-to-noise ratio threshold, and the target long-term stationarity index is less than the stationarity index threshold, the target acoustic scene is determined to be the second acoustic scene ;

在所述目标长时信噪比小于所述信噪比门限，且所述目标长时平稳度指标大于或等于所述平稳度指标门限的情况下，确定所述目标声学场景为第三声学场景；When the target long-term signal-to-noise ratio is less than the signal-to-noise ratio threshold, and the target long-term stationarity index is greater than or equal to the stationarity index threshold, the target acoustic scene is determined to be the third acoustic scene ;

在所述目标长时信噪比小于所述信噪比门限，且所述目标长时平稳度指标小于所述平稳度指标门限的情况下，确定所述目标声学场景为第四声学场景。When the target long-term signal-to-noise ratio is less than the signal-to-noise ratio threshold, and the target long-term stationarity index is less than the stationarity index threshold, the target acoustic scene is determined to be the fourth acoustic scene.

一种可能的实现方式中，处理模块，具体用于基于N组历史音频信号的瞬时信噪比确定N个第一瞬时信噪比；并基于所述N个第一瞬时信噪比确定所述目标长时信噪比；In a possible implementation, the processing module is specifically configured to determine N first instantaneous signal-to-noise ratios based on the instantaneous signal-to-noise ratios of N groups of historical audio signals; and determine based on the N first instantaneous signal-to-noise ratios. The long-term signal-to-noise ratio of the target;

其中，每组历史音频信号中包括M个历史音频信号，所述N个第一瞬时信噪比与所述N组历史音频信号一一对应；M和N均为正整数。Each group of historical audio signals includes M historical audio signals, and the N first instantaneous signal-to-noise ratios correspond to the N groups of historical audio signals one-to-one; M and N are both positive integers.

一种可能的实现方式中，处理模块，具体用于确定所述每组历史音频信号的瞬时信噪比中的最大瞬时信噪比，并将所述最大瞬时信噪比确定为与所述每组历史音频信号对应的第一瞬时信噪比。In a possible implementation, the processing module is specifically configured to determine the maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of each group of historical audio signals, and determine the maximum instantaneous signal-to-noise ratio as the same as the instantaneous signal-to-noise ratio of each group of historical audio signals. The first instantaneous signal-to-noise ratio corresponding to the set of historical audio signals.

一种可能的实现方式中，处理模块，具体用于基于所述N个第一瞬时信噪比和第二瞬时信噪比，确定所述目标长时信噪比；In a possible implementation, the processing module is specifically configured to determine the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios and the second instantaneous signal-to-noise ratio;

其中，所述第二瞬时信噪比为所述目标音频信号的瞬时信噪比。Wherein, the second instantaneous signal-to-noise ratio is the instantaneous signal-to-noise ratio of the target audio signal.

一种可能的实现方式中，处理模块，具体用于确定所述N个第一瞬时信噪比的平均信噪比，并将所述平均信噪比确定为所述目标长时信噪比。In a possible implementation, the processing module is specifically configured to determine the average signal-to-noise ratio of the N first instantaneous signal-to-noise ratios, and determine the average signal-to-noise ratio as the target long-term signal-to-noise ratio. .

一种可能的实现方式中，处理模块，具体用于确定第一信号能量和第二信号能量之间的信号能量差；并对所述信号能量差进行平滑处理，得到所述目标长时平稳度指标；In a possible implementation, the processing module is specifically configured to determine the first signal energy and the second signal energy. The signal energy difference between them; and smoothing the signal energy difference to obtain the target long-term stability index;

其中，所述第一信号能量为对所述目标音频信号进行平稳噪声降噪处理后的信号能量，所述第二信号能量为对所述目标音频信号进行深度学习降噪处理后的信号能量。Wherein, the first signal energy is the signal energy after smooth noise reduction processing on the target audio signal, and the second signal energy is the signal energy after deep learning noise reduction processing on the target audio signal.

在本申请实施例中，由于音频信号对应的长时信噪比和平稳度指标为音频信号中噪声的两个本质特征，因此基于目标音频信号对应的目标长时信噪比和目标长时平稳度指标，能够更加准确、快速地确定出目标音频信号对应的目标声学场景，从而可以提高基于目标声学场景对目标音频降噪的准确度。In the embodiment of the present application, since the long-term signal-to-noise ratio and the stability index corresponding to the audio signal are two essential characteristics of the noise in the audio signal, based on the target long-term signal-to-noise ratio and the target long-term stability corresponding to the target audio signal degree index, the target acoustic scene corresponding to the target audio signal can be determined more accurately and quickly, thereby improving the accuracy of target audio noise reduction based on the target acoustic scene.

本申请实施例中的音频降噪装置可以是电子设备，也可以是电子设备中的部件，例如集成电路或芯片。该电子设备可以是终端，也可以为除终端之外的其他设备。示例性的，电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device，MID)、增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer，UMPC)、上网本或者个人数字助理(personal digital assistant，PDA)等，还可以为服务器、网络附属存储器(Network Attached Storage，NAS)、个人计算机(personal computer，PC)、电视机(television，TV)、柜员机或者自助机等，本申请实施例不作具体限定。The audio noise reduction device in the embodiment of the present application may be an electronic device or a component in the electronic device, such as an integrated circuit or chip. The electronic device may be a terminal or other devices other than the terminal. For example, the electronic device can be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle-mounted electronic device, a mobile internet device (Mobile Internet Device, MID), or augmented reality (AR)/virtual reality (VR). ) equipment, robots, wearable devices, ultra-mobile personal computers (UMPC), netbooks or personal digital assistants (personal digital assistants, PDA), etc., and can also be servers, network attached storage (Network Attached Storage), NAS), personal computer (PC), television (TV), teller machine or self-service machine, etc., the embodiments of this application are not specifically limited.

本申请实施例中的音频降噪装置可以为具有操作***的装置。该操作***可以为安卓(Android)操作***，可以为ios操作***，还可以为其他可能的操作***，本申请实施例不作具体限定。The audio noise reduction device in the embodiment of the present application may be a device with an operating system. The operating system can be an Android operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of this application.

本申请实施例提供的音频降噪装置能够实现图1和图2的方法实施例实现的各个过程，为避免重复，这里不再赘述。The audio noise reduction device provided by the embodiment of the present application can implement various processes implemented by the method embodiments of Figures 1 and 2. To avoid repetition, they will not be described again here.

可选地，如图4所示，本申请实施例还提供一种电子设备400，包括处理器401和存储器402，存储器402上存储有可在所述处理器401上运行的程序或指令，该程序或指令被处理器401执行时实现上述音频降噪方法实施例的各个步骤，且能达到相同的技术效果，为避免重复，这里不再赘述。Optionally, as shown in Figure 4, this embodiment of the present application also provides an electronic device 400, including a processor 401 and a memory 402. The memory 402 stores programs or instructions that can be run on the processor 401. When the program or instruction is executed by the processor 401, each step of the above audio noise reduction method embodiment is implemented, and the same technical effect can be achieved. To avoid repetition, the details will not be described here.

需要说明的是，本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.

图5为实现本申请实施例的一种电子设备的硬件结构示意图。FIG. 5 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.

该电子设备500包括但不限于：射频单元501、网络模块502、音频输出单元503、输入单元504、传感器505、显示单元506、用户输入单元507、接口单元508、存储器509以及处理器510等中的至少部分部件。The electronic device 500 includes but is not limited to: radio frequency unit 501, network module 502, audio output unit 503, input unit 504, sensor 505, display unit 506, user input unit 507, interface unit 508, memory 509, processor 510, etc. at least some parts of it.

本领域技术人员可以理解，终端500还可以包括给各个部件供电的电源(比如电池)，电源可以通过电源管理***与处理器510逻辑相连，从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。图5中示出的终端结构并不构成对终端的限定，终端可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置，在此不再赘述。Those skilled in the art can understand that the terminal 500 may also include a power supply (such as a battery) that supplies power to various components. The power supply may be logically connected to the processor 510 through a power management system, thereby managing charging, discharging, and power consumption through the power management system. Management and other functions. The terminal structure shown in FIG. 5 does not constitute a limitation on the terminal. The terminal may include more or fewer components than shown in the figure, or some components may be combined or arranged differently, which will not be described again here.

其中，处理器510，用于计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标，所述目标长时平稳度指标用于指示所述目标音频信号中噪声的平稳程度；Wherein, the processor 510 is used to calculate the target long-term signal-to-noise ratio and the target long-term stationarity index corresponding to the target audio signal, and the target long-term stationarity index is used to indicate the smoothness of the noise in the target audio signal;

所述处理器510，还用于根据所述目标长时信噪比和所述目标长时平稳度指标，确定所述目标音频信号对应的目标声学场景；The processor 510 is also configured to determine the target acoustic scene corresponding to the target audio signal according to the target long-term signal-to-noise ratio and the target long-term stationarity index;

所述处理器510，还用于基于所述处理器510确定所述目标声学场景，对所述目标音频信号进行降噪处理。The processor 510 is also configured to perform noise reduction processing on the target audio signal based on the target acoustic scene determined by the processor 510 .

一种可能的实现方式中，处理器510，具体用于：在所述目标长时信噪比大于或等于信噪比门限，且所述目标长时平稳度指标大于或等于平稳度指标门限的情况下，确定所述目标声学场景为第一声学场景；In a possible implementation, the processor 510 is specifically configured to: when the target long-term signal-to-noise ratio is greater than or equal to the signal-to-noise ratio threshold, and the target long-term stability index is greater than or equal to the stability index threshold. case, Determine the target acoustic scene as the first acoustic scene;

一种可能的实现方式中，处理器510，具体用于基于N组历史音频信号的瞬时信噪比确定N个第一瞬时信噪比；并基于所述N个第一瞬时信噪比确定所述目标长时信噪比；In a possible implementation, the processor 510 is specifically configured to determine N first instantaneous signal-to-noise ratios based on the instantaneous signal-to-noise ratios of N groups of historical audio signals; and based on the N first instantaneous signal-to-noise ratios Determine the long-term signal-to-noise ratio of the target;

一种可能的实现方式中，处理器510，具体用于确定所述每组历史音频信号的瞬时信噪比中的最大瞬时信噪比，并将所述最大瞬时信噪比确定为与所述每组历史音频信号对应的第一瞬时信噪比。In a possible implementation, the processor 510 is specifically configured to determine the maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of each group of historical audio signals, and determine the maximum instantaneous signal-to-noise ratio as the same as the The first instantaneous signal-to-noise ratio corresponding to each set of historical audio signals.

一种可能的实现方式中，处理器510，具体用于基于所述N个第一瞬时信噪比和第二瞬时信噪比，确定所述目标长时信噪比；In a possible implementation, the processor 510 is specifically configured to determine the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios and the second instantaneous signal-to-noise ratio;

一种可能的实现方式中，处理器510，具体用于确定所述N个第一瞬时信噪比的平均信噪比，并将所述平均信噪比确定为所述目标长时信噪比。In a possible implementation, the processor 510 is specifically configured to determine the average signal-to-noise ratio of the N first instantaneous signal-to-noise ratios, and determine the average signal-to-noise ratio as the target long-term signal-to-noise ratio. Compare.

一种可能的实现方式中，处理器510，具体用于确定第一信号能量和第二信号能量之间的信号能量差；并对所述信号能量差进行平滑处理，得到所述目标长时平稳度指标；In a possible implementation, the processor 510 is specifically configured to determine the signal energy difference between the first signal energy and the second signal energy; and perform smoothing processing on the signal energy difference to obtain the target long-term stability. degree index;

应理解的是，本申请实施例中，输入单元504可以包括图形处理单元(Graphics Processing Unit，GPU)5041和麦克风5042，图形处理器5041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元506可包括显示面板5061，可以采用液晶显示器、有机发光二极管等形式来配置显示面板5061。用户输入单元507包括触控面板5071以及其他输入设备5072中的至少一种。触控面板5071，也称为触摸屏。触控面板5071可包括触摸检测装置和触摸控制器两个部分。其他输入设备5072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆，在此不再赘述。It should be understood that in the embodiment of the present application, the input unit 504 may include a graphics processing unit (Graphics Processing Unit, GPU) 5041 and a microphone 5042. The graphics processor 5041 is responsible for the image capture device (GPU) in the video capture mode or the image capture mode. Process the image data of still pictures or videos obtained by cameras (such as cameras). The display unit 506 may include a display panel 5061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 507 includes a touch panel 5071 and at least one of other input devices 5072 . Touch panel 5071, also called touch screen. The touch panel 5071 may include two parts: a touch detection device and a touch controller. Other input devices 5072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here.

本申请实施例中，射频单元501接收来自网络侧设备的下行数据后，可以传输给处理器510进行处理；另外，射频单元501可以向网络侧设备发送上行数据。通常，射频单元501包括但不限于天线、放大器、收发信机、耦合器、低噪声放大器、双工器等。In this embodiment of the present application, after receiving downlink data from the network side device, the radio frequency unit 501 can transmit it to the processor 510 for processing; in addition, the radio frequency unit 501 can send uplink data to the network side device. Generally, the radio frequency unit 501 includes, but is not limited to, an antenna, amplifier, transceiver, coupler, low noise amplifier, duplexer, etc.

存储器509可用于存储软件程序或指令以及各种数据。存储器509可主要包括存储程序或指令的第一存储区和存储数据的第二存储区，其中，第一存储区可存储操作***、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外，存储器509可以包括易失性存储器或非易失性存储器，或者，存储器509 可以包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(Read-Only Memory，ROM)、可编程只读存储器(Programmable ROM，PROM)、可擦除可编程只读存储器(Erasable PROM，EPROM)、电可擦除可编程只读存储器(Electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory，RAM)，静态随机存取存储器(Static RAM，SRAM)、动态随机存取存储器(Dynamic RAM，DRAM)、同步动态随机存取存储器(Synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM，DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM，SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM，DRRAM)。本申请实施例中的存储器509包括但不限于这些和任意其它适合类型的存储器。Memory 509 may be used to store software programs or instructions as well as various data. The memory 509 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, Image playback function, etc.) etc. Additionally, memory 509 may include volatile memory or non-volatile memory, or memory 509 Both volatile and non-volatile memory can be included. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (Random Access Memory, RAM), static random access memory (Static RAM, SRAM), dynamic random access memory (Dynamic RAM, DRAM), synchronous dynamic random access memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synch link DRAM) , SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DRRAM). Memory 509 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.

处理器510可包括一个或多个处理单元；可选地，处理器510集成应用处理器和调制解调处理器，其中，应用处理器主要处理涉及操作***、用户界面和应用程序等的操作，调制解调处理器主要处理无线通信信号，如基带处理器。可以理解的是，上述调制解调处理器也可以不集成到处理器510中。The processor 510 may include one or more processing units; optionally, the processor 510 integrates an application processor and a modem processor, where the application processor mainly handles operations related to the operating system, user interface, application programs, etc., Modem processors mainly process wireless communication signals, such as baseband processors. It can be understood that the above modem processor may not be integrated into the processor 510.

本申请实施例还提供一种可读存储介质，所述可读存储介质上存储有程序或指令，该程序或指令被处理器执行时实现上述音频降噪方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Embodiments of the present application also provide a readable storage medium. Programs or instructions are stored on the readable storage medium. When the program or instructions are executed by a processor, each process of the audio noise reduction method embodiment is implemented, and can achieve The same technical effects are not repeated here to avoid repetition.

其中，所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质，包括计算机可读存储介质，如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。Wherein, the processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage media, such as computer read-only memory ROM, random access memory RAM, magnetic disk or optical disk, etc.

本申请实施例另提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现上述音频降噪方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。An embodiment of the present application further provides a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the above audio noise reduction method embodiment. Each process can achieve the same technical effect. To avoid repetition, we will not go into details here.

本申请实施例提供一种计算机程序产品，该程序产品被存储在存储介质中，该程序产品被至少一个处理器执行以实现如上述音频降噪方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Embodiments of the present application provide a computer program product. The program product is stored in a storage medium. The program product is executed by at least one processor to implement each process of the above audio noise reduction method embodiment, and can achieve the same technology. The effect will not be described here in order to avoid repetition.

应理解，本申请实施例提到的芯片还可以称为***级芯片、***芯片、芯片***或片上***芯片等。It should be understood that the chips mentioned in the embodiments of this application may also be called system-on-chip, system-on-a-chip, system-on-a-chip or system-on-chip, etc.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外，需要指出的是，本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能，还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能，例如，可以按不同于所描述的次序来执行所描述的方法，并且还可以添加、省去、或组合各种步骤。另外，参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端(可以是手机，计算机，服务器，或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk , CD), including several instructions to cause a terminal (which can be a mobile phone, computer, server, or network device, etc.) to execute the steps described in various embodiments of this application. method.

上面结合附图对本申请的实施例进行了描述，但是本申请并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本申请的启示下，在不脱离本申请宗旨和权利要求所保护的范围情况下，还可做出很多形式，均属于本申请的保护之内。 The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims

一种音频降噪方法，所述方法包括：An audio noise reduction method, the method includes:

计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标，所述目标长时平稳度指标用于指示所述目标音频信号中噪声的平稳程度；Calculate the target long-term signal-to-noise ratio and the target long-term stationarity index corresponding to the target audio signal. The target long-term stationarity index is used to indicate the smoothness of the noise in the target audio signal;

根据所述目标长时信噪比和所述目标长时平稳度指标，确定所述目标音频信号对应的目标声学场景；Determine the target acoustic scene corresponding to the target audio signal according to the target long-term signal-to-noise ratio and the target long-term stationarity index;

基于所述目标声学场景，对所述目标音频信号进行降噪处理。Based on the target acoustic scene, the target audio signal is subjected to noise reduction processing.
根据权利要求1所述的方法，其中，所述根据所述目标长时信噪比和所述目标长时平稳度指标，确定所述目标音频信号对应的目标声学场景，包括：The method according to claim 1, wherein determining the target acoustic scene corresponding to the target audio signal according to the target long-term signal-to-noise ratio and the target long-term stationarity index includes:

在所述目标长时信噪比大于或等于信噪比门限，且所述目标长时平稳度指标大于或等于平稳度指标门限的情况下，确定所述目标声学场景为第一声学场景；When the long-term signal-to-noise ratio of the target is greater than or equal to the signal-to-noise ratio threshold, and the long-term stability index of the target is greater than or equal to the stability index threshold, determine the target acoustic scene to be the first acoustic scene;

在所述目标长时信噪比大于或等于所述信噪比门限，且所述目标长时平稳度指标小于所述平稳度指标门限的情况下，确定所述目标声学场景为第二声学场景；When the target long-term signal-to-noise ratio is greater than or equal to the signal-to-noise ratio threshold, and the target long-term stationarity index is less than the stationarity index threshold, the target acoustic scene is determined to be the second acoustic scene ;

在所述目标长时信噪比小于所述信噪比门限，且所述目标长时平稳度指标大于或等于所述平稳度指标门限的情况下，确定所述目标声学场景为第三声学场景；When the target long-term signal-to-noise ratio is less than the signal-to-noise ratio threshold, and the target long-term stationarity index is greater than or equal to the stationarity index threshold, the target acoustic scene is determined to be the third acoustic scene ;

在所述目标长时信噪比小于所述信噪比门限，且所述目标长时平稳度指标小于所述平稳度指标门限的情况下，确定所述目标声学场景为第四声学场景。When the target long-term signal-to-noise ratio is less than the signal-to-noise ratio threshold, and the target long-term stationarity index is less than the stationarity index threshold, the target acoustic scene is determined to be the fourth acoustic scene.
根据权利要求1所述的方法，其中，计算目标音频信号对应的目标长时信噪比，包括：The method according to claim 1, wherein calculating the target long-term signal-to-noise ratio corresponding to the target audio signal includes:

基于N组历史音频信号的瞬时信噪比确定N个第一瞬时信噪比；Determine N first instantaneous signal-to-noise ratios based on the instantaneous signal-to-noise ratios of N groups of historical audio signals;

基于所述N个第一瞬时信噪比确定所述目标长时信噪比；Determine the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios;

其中，每组历史音频信号中包括M个历史音频信号，所述N个第一瞬时信噪比与所述N组历史音频信号一一对应；Wherein, each group of historical audio signals includes M historical audio signals, and the N first instantaneous signal-to-noise ratios correspond to the N groups of historical audio signals one-to-one;

M和N均为正整数。M and N are both positive integers.
根据权利要求3所述的方法，其中，所述基于N组历史音频信号的瞬时信噪比确定N个第一瞬时信噪比，包括：The method of claim 3, wherein determining N first instantaneous signal-to-noise ratios based on instantaneous signal-to-noise ratios of N groups of historical audio signals includes:

确定所述每组历史音频信号的瞬时信噪比中的最大瞬时信噪比，并将所述最大瞬时信噪比确定为与所述每组历史音频信号对应的所述第一瞬时信噪比。Determine the maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of each group of historical audio signals, and determine the maximum instantaneous signal-to-noise ratio as the first instantaneous signal-to-noise ratio corresponding to each group of historical audio signals. Compare.
根据权利要求3所述的方法，其中，所述基于所述N个第一瞬时信噪比确定所述目标长时信噪比，包括：The method of claim 3, wherein determining the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios includes:

基于所述N个第一瞬时信噪比和第二瞬时信噪比，确定所述目标长时信噪比；Based on the N first instantaneous signal-to-noise ratios and the second instantaneous signal-to-noise ratios, determine the target long-term signal-to-noise ratio;

其中，所述第二瞬时信噪比为所述目标音频信号的瞬时信噪比。Wherein, the second instantaneous signal-to-noise ratio is the instantaneous signal-to-noise ratio of the target audio signal.
根据权利要求3所述的方法，其中，所述基于所述N个第一瞬时信噪比，确定所述目标长时信噪比，包括：The method of claim 3, wherein determining the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios includes:

确定所述N个第一瞬时信噪比的平均信噪比，并将所述平均信噪比确定为所述目标长时信噪比。An average signal-to-noise ratio of the N first instantaneous signal-to-noise ratios is determined, and the average signal-to-noise ratio is determined as the target long-term signal-to-noise ratio.
根据权利要求1至6中任一项所述的方法，其中，计算目标音频信号对应的目标长时平稳度指标，包括：The method according to any one of claims 1 to 6, wherein calculating the target long-term stationarity index corresponding to the target audio signal includes:

确定第一信号能量和第二信号能量之间的信号能量差；determining a signal energy difference between the first signal energy and the second signal energy;

对所述信号能量差进行平滑处理，得到所述目标长时平稳度指标；Smooth the signal energy difference to obtain the target long-term stability index;

其中，所述第一信号能量为对所述目标音频信号进行平稳噪声降噪处理后的信号能量，所述第二信号能量为对所述目标音频信号进行深度学习降噪处理后的信号能量。Wherein, the first signal energy is the signal energy after smooth noise reduction processing on the target audio signal, and the second signal energy is the signal energy after deep learning noise reduction processing on the target audio signal.
一种声学场景分类装置，所述装置包括：处理模块和确定模块；An acoustic scene classification device, the device includes: a processing module and a determination module;

所述处理模块，用于计算目标音频信号对应的目标长时信噪比和目标长时平稳度指标，所述目标长时平稳度指标用于指示所述目标音频信号中噪声的平稳程度； The processing module is used to calculate the target long-term signal-to-noise ratio and the target long-term stability index corresponding to the target audio signal. The target long-term stability index is used to indicate the smoothness of the noise in the target audio signal;

所述确定模块，用于根据所述处理模块计算的所述目标长时信噪比和所述目标长时平稳度指标，确定所述目标音频信号对应的目标声学场景；The determination module is configured to determine the target acoustic scene corresponding to the target audio signal based on the target long-term signal-to-noise ratio and the target long-term stationarity index calculated by the processing module;

所述处理模块，还用于基于所述确定模块确定所述目标声学场景，对所述目标音频信号进行降噪处理。The processing module is also configured to determine the target acoustic scene based on the determination module, and perform noise reduction processing on the target audio signal.
根据权利要求8所述的装置，其中，所述确定模块，具体用于：The device according to claim 8, wherein the determining module is specifically used for:

在所述目标长时信噪比大于或等于信噪比门限，且所述目标长时平稳度指标大于或等于平稳度指标门限的情况下，确定所述目标声学场景为第一声学场景；When the long-term signal-to-noise ratio of the target is greater than or equal to the signal-to-noise ratio threshold, and the long-term stability index of the target is greater than or equal to the stability index threshold, determine the target acoustic scene to be the first acoustic scene;

在所述目标长时信噪比大于或等于所述信噪比门限，且所述目标长时平稳度指标小于所述平稳度指标门限的情况下，确定所述目标声学场景为第二声学场景；When the target long-term signal-to-noise ratio is greater than or equal to the signal-to-noise ratio threshold, and the target long-term stationarity index is less than the stationarity index threshold, the target acoustic scene is determined to be the second acoustic scene ;

在所述目标长时信噪比小于所述信噪比门限，且所述目标长时平稳度指标大于或等于所述平稳度指标门限的情况下，确定所述目标声学场景为第三声学场景；When the target long-term signal-to-noise ratio is less than the signal-to-noise ratio threshold, and the target long-term stationarity index is greater than or equal to the stationarity index threshold, the target acoustic scene is determined to be the third acoustic scene ;

在所述目标长时信噪比小于所述信噪比门限，且所述目标长时平稳度指标小于所述平稳度指标门限的情况下，确定所述目标声学场景为第四声学场景。When the target long-term signal-to-noise ratio is less than the signal-to-noise ratio threshold, and the target long-term stationarity index is less than the stationarity index threshold, the target acoustic scene is determined to be the fourth acoustic scene.
根据权利要求9所述的装置，其中，所述处理模块，具体用于基于N组历史音频信号的瞬时信噪比确定N个第一瞬时信噪比；并基于所述N个第一瞬时信噪比确定所述目标长时信噪比；The device according to claim 9, wherein the processing module is specifically configured to determine N first instantaneous signal-to-noise ratios based on instantaneous signal-to-noise ratios of N groups of historical audio signals; and based on the N first instantaneous signal-to-noise ratios; The temporal signal-to-noise ratio determines the long-term signal-to-noise ratio of the target;

其中，每组历史音频信号中包括M个历史音频信号，所述N个第一瞬时信噪比与所述N组历史音频信号一一对应；M和N均为正整数。Each group of historical audio signals includes M historical audio signals, and the N first instantaneous signal-to-noise ratios correspond to the N groups of historical audio signals one-to-one; M and N are both positive integers.
根据权利要求10所述的装置，其中，所述处理模块，具体用于确定所述每组历史音频信号的瞬时信噪比中的最大瞬时信噪比，并将所述最大瞬时信噪比确定为与所述每组历史音频信号对应的第一瞬时信噪比。The device according to claim 10, wherein the processing module is specifically configured to determine the maximum instantaneous signal-to-noise ratio among the instantaneous signal-to-noise ratios of each group of historical audio signals, and determine the maximum instantaneous signal-to-noise ratio. is the first instantaneous signal-to-noise ratio corresponding to each set of historical audio signals.
根据权利要求10所述的装置，其中，所述处理模块，具体用于基于所述N个第一瞬时信噪比和第二瞬时信噪比，确定所述目标长时信噪比；The device according to claim 10, wherein the processing module is specifically configured to determine the target long-term signal-to-noise ratio based on the N first instantaneous signal-to-noise ratios and the second instantaneous signal-to-noise ratio;

其中，所述第二瞬时信噪比为所述目标音频信号的瞬时信噪比。Wherein, the second instantaneous signal-to-noise ratio is the instantaneous signal-to-noise ratio of the target audio signal.
根据权利要求10所述的装置，其中，所述处理模块，具体用于确定所述N个第一瞬时信噪比的平均信噪比，并将所述平均信噪比确定为所述目标长时信噪比。The device according to claim 10, wherein the processing module is specifically configured to determine an average signal-to-noise ratio of the N first instantaneous signal-to-noise ratios, and determine the average signal-to-noise ratio as the target Long-term signal-to-noise ratio.
根据权利要求8至13中任一项所述的装置，其中，所述处理模块，具体用于确定第一信号能量和第二信号能量之间的信号能量差；并对所述信号能量差进行平滑处理，得到所述目标长时平稳度指标；The device according to any one of claims 8 to 13, wherein the processing module is specifically configured to determine the signal energy difference between the first signal energy and the second signal energy; and perform the signal energy difference on the signal energy difference. Smoothing process to obtain the long-term stability index of the target;

其中，所述第一信号能量为对所述目标音频信号进行平稳噪声降噪处理后的信号能量，所述第二信号能量为对所述目标音频信号进行深度学习降噪处理后的信号能量。Wherein, the first signal energy is the signal energy after smooth noise reduction processing on the target audio signal, and the second signal energy is the signal energy after deep learning noise reduction processing on the target audio signal.
一种电子设备，包括处理器和存储器，所述存储器存储可在所述处理器上运行的程序或指令，所述程序或指令被所述处理器执行时实现如权利要求1至7中任一项所述的音频降噪方法的步骤。An electronic device, including a processor and a memory. The memory stores programs or instructions that can be run on the processor. When the program or instructions are executed by the processor, any one of claims 1 to 7 is implemented. The steps of the audio noise reduction method described in the item.
一种计算机可读存储介质，所述可读存储介质上存储程序或指令，所述程序或指令被处理器执行时实现如权利要求1至7中任一项所述的音频降噪方法的步骤。A computer-readable storage medium storing programs or instructions on the readable storage medium. When the programs or instructions are executed by a processor, the steps of the audio noise reduction method according to any one of claims 1 to 7 are implemented. .
一种计算机程序产品，所述计算机程序产品被存储在存储介质中，所述计算机程序产品被至少一个处理器执行以实现如权利要求1至7中任一项所述的音频降噪方法。A computer program product, the computer program product is stored in a storage medium, and the computer program product is executed by at least one processor to implement the audio noise reduction method according to any one of claims 1 to 7.
一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现如权利要求1至7中任一项所述的音频降噪方法。 A chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the audio according to any one of claims 1 to 7 Noise reduction method.