WO2020151338A1 - 一种音频噪声的检测方法、装置、存储介质和移动终端 - Google Patents

一种音频噪声的检测方法、装置、存储介质和移动终端 Download PDF

Info

Publication number
WO2020151338A1
WO2020151338A1 PCT/CN2019/118544 CN2019118544W WO2020151338A1 WO 2020151338 A1 WO2020151338 A1 WO 2020151338A1 CN 2019118544 W CN2019118544 W CN 2019118544W WO 2020151338 A1 WO2020151338 A1 WO 2020151338A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise detection
convolutional layers
usage rate
voice signal
time
Prior art date
Application number
PCT/CN2019/118544
Other languages
English (en)
French (fr)
Inventor
庞烨
周新宇
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020151338A1 publication Critical patent/WO2020151338A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of computer technology, and in particular to a method , device, storage medium and mobile terminal for detecting audio noise.
  • Voice assistant is a smart mobile phone application that can help users solve many problems through the intelligent interaction of instant question and answer.
  • the mobile terminal needs to detect the noise in the voice signal after obtaining the user's voice.
  • classifiers SVM, random forest, etc.
  • a neural network is used to detect audio noise using acoustic features such as MFCC .
  • MFCC acoustic features
  • the embodiments of the present application provide a method , device, storage medium, and mobile terminal for detecting audio noise. Even if the mobile terminal is offline, it can detect noise in a voice signal with good real-time performance.
  • the first aspect of the embodiments of the present application provides a method for detecting audio noise, including:
  • the extracted GFCC features and Gabor features are sequentially input into the N-layer convolutional layer, a fully connected layer, and a softmax layer of the CNN neural network model to obtain the noise detection result of the speech signal, 3 ⁇ N ⁇ 5 .
  • a second aspect of the embodiments of the present application provides an audio noise detection device, including:
  • the voice signal acquisition module is used to acquire the input voice signal
  • the framing module is used for framing the voice signal
  • a voice feature extraction module which is used to extract the GFCC feature and Gabor feature of the voice signal after framing
  • the noise detection module is used to input the extracted GFCC features and Gabor features into the N-layer convolutional layer, a fully connected layer and a softmax layer of the CNN neural network model in order to obtain the noise detection result of the voice signal , 3 ⁇ N ⁇ 5.
  • the third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are When executed, the steps of the audio noise detection method proposed in the first aspect of the embodiments of the present application are implemented.
  • a fourth aspect of the embodiments of the present application provides a mobile terminal, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the The computer-readable instructions implement the steps of the audio noise detection method as proposed in the first aspect of the embodiments of the present application.
  • the audio noise detection method proposed in this application includes: acquiring an input voice signal; framing the voice signal; extracting the GFCC feature and Gabor feature of the voice signal after the framing; and extracting the extracted GFCC
  • the feature and the Gabor feature are sequentially input to the N-layer convolutional layer, a fully connected layer, and a softmax layer of the CNN neural network model to obtain the noise detection result of the speech signal, 3 ⁇ N ⁇ 5.
  • the CNN network structure is a simplified structure without pooling layer, and the number of convolutional layers is small, which can greatly reduce the amount of calculation, so that the mobile terminal's own processor can complete the CNN network model. For calculation, there is no need to connect to the server. Therefore, even if the mobile terminal is offline, it is sufficient to detect the noise in the voice signal, and the real-time performance is good.
  • FIG. 1 is a flowchart of a first embodiment of a method for detecting audio noise according to an embodiment of the present application
  • FIG. 2 is a flowchart of a second embodiment of a method for detecting audio noise according to an embodiment of the present application
  • FIG. 3 is a flowchart of a third embodiment of a method for detecting audio noise according to an embodiment of the present application
  • FIG. 4 is a structural diagram of an embodiment of an audio noise detection device provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a mobile terminal provided by an embodiment of the present application.
  • the embodiments of the present application provide a detection method , device, storage medium, and mobile terminal for audio noise. Even if the mobile terminal is in an offline state, it is sufficient to detect noise in a voice signal with good real-time performance.
  • a first embodiment of an audio noise detection method in an embodiment of the present application includes:
  • the input voice signal is acquired.
  • the voice signal can be input by the user in real time, or it can be a pre-recorded voice signal.
  • the voice signal After the input voice signal is acquired, the voice signal is framed. Framing is the windowing and segmentation processing of the speech signal. As the window goes to the right (assuming the right represents time forward), the windowed signal is gradually expanded and processed. Since the speech signal is not a complete piece of steady-state signal, it is necessary to divide the signal into frames so that the length of each frame signal is between 20ms and 40ms, which can meet the requirements of GFCC and Gabor feature extraction without losing information.
  • the speech signal it is preferable to frame the speech signal into a 25ms signal, and then extract the GFCC feature and the Gabor feature of the signal respectively.
  • GFCC is a feature extraction technology based on FFT, similar to MFCC, but uses Gammatone filter bank and equivalent rectangular bandwidth (ERB) ratio instead of Mel filter bank. Since the Gammatone filter bank is the filter response closest to the human cochlea, GFCC is also called auditory feature. As a new auditory cepstrum coefficient, it has better recognition rate and noise robustness than LPCC and MFCC .
  • the step of extracting GFCC features belongs to the prior art, and can specifically include: sequentially performing signal pre-emphasis, signal windowing, DFT, Gammatone filtering, cube root compression, DCT transformation and other processing on the framed speech signal, thereby outputting GFCC cepstrum coefficients feature.
  • Gabor is a linear filter used for edge extraction, which can provide good direction selection and scale selection characteristics to improve the robustness of noise recognition.
  • the step of extracting Gabor features also belongs to the prior art, and may specifically include: pre-emphasizing the framed speech signal, adding window processing, and then inputting a two-dimensional Gabor filter to obtain Gabor features.
  • a two-dimensional Gabor filter is the product of a sine plane wave and a Gaussian kernel function. The former is a tuning function and the latter is a window function.
  • the extracted GFCC features and Gabor features are sequentially input into the N-layer convolutional layer, a fully connected layer, and a softmax layer of the CNN neural network model to obtain the The noise detection result of the speech signal.
  • the input GFCC features and Gabor features are in the form of a matrix, and the CNN network outputs the probability value of the voice signal containing noise. If the probability value exceeds a certain threshold, it indicates that the voice signal to be detected contains noise.
  • the CNN neural network model does not contain a pooling layer, and the number of convolutional layers is N (3 ⁇ N ⁇ 5).
  • N 3 ⁇ N ⁇ 5
  • a certain preferred CNN network structure is shown in Table 1 below:
  • the above CNN network structure is a simplified structure without pooling layer, and the number of convolutional layers is small, which can greatly reduce the amount of calculation, so that the mobile terminal's own processor can complete the CNN network model. For calculation, there is no need to connect to the server. Therefore, even if the mobile terminal is offline, it is sufficient to detect the noise in the voice signal, and the real-time performance is good.
  • extracting GFCC features and Gabor features, and inputting these two types of audio features into the simplified CNN network no pooling layer, and the number of convolutional layers is 3 to 5
  • the specific calculation process can include: (1) Input GFCC features and Gabor features into the first convolutional layer of the CNN network. Both GFCC features and Gabor features are matrices with the same dimensions. Form (for example, it can be a 5*8 matrix).
  • the first layer of convolutional layer outputs the first intermediate result, which is a matrix form that meets the requirements of the next layer (second layer of convolutional layer) ;
  • the convolutional layer, the fully connected layer and the underlying calculation process of softmax belong to the prior art.
  • a general CNN network contains multiple convolutional layers and pooling layers. This application reduces the amount of calculation by simplifying the hierarchical structure of the network .
  • the audio noise detection method proposed in the embodiment of this application includes: acquiring an input voice signal; framing the voice signal; extracting the GFCC feature and Gabor feature of the voice signal after the framing; The GFCC feature and the Gabor feature are sequentially input into the N-layer convolutional layer, a fully connected layer, and a softmax layer of the CNN neural network model to obtain the noise detection result of the speech signal, 3 ⁇ N ⁇ 5.
  • the CNN network structure is a simplified structure without pooling layer, and the number of convolutional layers is small, which can greatly reduce the amount of calculation, so that the mobile terminal's own processor can complete the CNN network model. For calculation, there is no need to connect to the server. Therefore, even if the mobile terminal is offline, it is sufficient to detect the noise in the voice signal, and the real-time performance is good.
  • a second embodiment of a method for detecting audio noise in an embodiment of the present application includes:
  • steps 201-205 please refer to the first embodiment of this application.
  • the current time when the input voice signal is obtained, the current time is recorded as the start time of noise detection; when the noise detection result of the voice signal is obtained, the current time can be recorded as the noise detection Then, calculate the difference between the end time and the start time to get the noise detection time; finally, the number of convolutional layers of the CNN network can be adjusted according to the noise detection time, so as to ensure that the noise detection time is at one Within the acceptable range, improve user experience.
  • step 208 may include:
  • the current time point t 1 is recorded as the start time
  • the noise detection result is obtained through the CNN network
  • the current time point t 2 is recorded as the end time
  • t 1 -t 2 That is the processing time of noise detection; if the processing time exceeds a certain threshold, it indicates that the real-time performance of speech recognition is poor, and if the convolutional layer of the CNN network exceeds the lower limit of 3, it can be reduced by reducing the convolutional layer Calculation amount to reduce processing time and improve the real-time performance of speech recognition.
  • the audio noise detection method proposed in the embodiment of the application includes: when the input voice signal is acquired, the current system time is recorded as the start time of noise detection; the voice signal is divided into frames; and the divided frames are extracted respectively.
  • the GFCC features and Gabor features of the speech signal; the extracted GFCC features and Gabor features are sequentially input into the N-layer convolutional layer, a fully connected layer, and a softmax layer of the CNN neural network model to obtain the speech signal
  • the noise detection result of the voice signal is obtained, the current system time is recorded as the end time of the noise detection; the noise detection is calculated according to the start time and the end time Time; adjust the size of the number N of convolutional layers according to the noise detection time.
  • the CNN network structure is a simplified structure without pooling layer, and the number of convolutional layers is small, which can greatly reduce the amount of calculation, so that the mobile terminal's own processor can complete the CNN network model. For calculation, there is no need to connect to the server. Therefore, even if the mobile terminal is offline, it is sufficient to detect the noise in the voice signal, and the real-time performance is good.
  • this embodiment can adjust the number of convolutional layers of the CNN neural network according to the noise detection time, thereby reducing the amount of calculation, reducing processing time, and improving the real-time performance of speech recognition.
  • a third embodiment of a method for detecting audio noise in an embodiment of the present application includes:
  • Steps 301-304 are the same as steps 101-104. For details, please refer to the relevant descriptions of steps 101-104.
  • the embodiment of this application will monitor the CPU usage rate and memory usage rate of the mobile terminal in real time, and adjust the size of the number N of convolutional layers according to the CPU usage rate and memory usage rate. Specifically, if the CPU usage or memory usage exceeds a certain threshold, it indicates that the hardware equipment of the mobile terminal is overloaded with calculations, which will cause adverse effects such as application freezes, and the number of convolutional layers of the CNN network can be appropriately reduced. N To reduce the computational load of the system.
  • step 306 may include:
  • the amount of calculation can be reduced by reducing the convolutional layer to reduce the computational load of the hardware device and avoid the phenomenon of jams.
  • the audio noise detection method proposed in this application includes: obtaining the input voice signal; framing the voice signal; extracting the GFCC feature and Gabor feature of the voice signal after the framing; GFCC features and Gabor features are sequentially input into the N-layer convolutional layer, a fully connected layer, and a softmax layer of the CNN neural network model to obtain the noise detection result of the voice signal, 3 ⁇ N ⁇ 5; the monitoring system's CPU usage
  • the number of convolutional layers N is adjusted according to the CPU usage rate and the memory usage rate.
  • the CNN network structure is a simplified structure without pooling layer, and the number of convolutional layers is small, which can greatly reduce the amount of calculation, so that the mobile terminal's own processor can complete the CNN network model.
  • this embodiment can adjust the number of convolutional layers of the CNN network according to the CPU usage and memory usage of the mobile terminal, thereby reducing the amount of calculation and the computational load of hardware devices. Avoid stalling.
  • an embodiment of an audio noise detection device in an embodiment of the present application includes:
  • the voice signal acquisition module 401 is used to acquire the input voice signal
  • the framing module 402 is used for framing the voice signal
  • the voice feature extraction module 403 is configured to extract the GFCC feature and Gabor feature of the voice signal after frame division;
  • the noise detection module 404 is used to input the extracted GFCC features and Gabor features into the N-layer convolutional layer, a fully connected layer, and a softmax layer of the CNN neural network model in sequence to obtain the noise detection of the voice signal As a result, 3 ⁇ N ⁇ 5.
  • the audio noise detection device may further include:
  • the start time recording module is used to record the current system time as the start time of noise detection when the input voice signal is acquired;
  • the end time recording module is used to record the current system time as the end time of the noise detection when the noise detection result of the voice signal is obtained;
  • a noise detection time calculation module configured to calculate the noise detection time according to the start time and the end time
  • the first convolutional layer adjustment module is configured to adjust the number N of convolutional layers according to the noise detection time.
  • the first convolutional layer adjustment module may include:
  • the first determining unit is configured to determine whether the noise detection time exceeds a preset first threshold
  • a second judgment unit configured to further judge whether the number of convolutional layers N is greater than 3 if the noise detection time exceeds the first threshold
  • the first convolutional layer adjustment unit is configured to adjust the number N of convolutional layers to N-1 if the number N of convolutional layers is greater than 3.
  • the audio noise detection device may further include:
  • System performance monitoring module used to monitor the system's CPU usage and memory usage
  • the second convolutional layer adjustment module is configured to adjust the size of the number N of convolutional layers according to the CPU usage rate and the memory usage rate.
  • the second convolutional layer adjustment module may include:
  • the third determining unit is configured to determine whether the CPU usage rate or the memory usage rate exceeds a preset second threshold
  • a fourth determining unit configured to further determine whether the number N of convolutional layers is greater than 3 if the CPU usage rate or the memory usage rate exceeds the second threshold;
  • the second convolutional layer adjustment unit is configured to adjust the number N of convolutional layers to N-1 if the number N of convolutional layers is greater than 3.
  • An embodiment of the present application also provides a mobile terminal, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions The steps of any one of the audio noise detection methods shown in Figures 1 to 3 are implemented.
  • Fig. 5 is a schematic diagram of a mobile terminal provided by an embodiment of the present application.
  • the mobile terminal 5 of this embodiment includes a processor 50, a memory 51, and computer-readable instructions 52 stored in the memory 51 and running on the processor 50.
  • the processor 50 executes the computer-readable instructions 52
  • the steps in the foregoing embodiments of the audio noise detection method are implemented, such as steps 101 to 104 shown in FIG. 1.
  • the processor 50 executes the computer-readable instructions 52
  • the functions of the modules/units in the foregoing device embodiments such as the functions of the modules 401 to 404 shown in FIG. 4, are implemented.
  • the computer-readable instructions 52 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 51 and executed by the processor 50, To complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 52 in the mobile terminal 5.
  • the processor 50 may be a central processing unit (Central Processing Unit, CPU), it can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 51 may be an internal storage unit of the mobile terminal 5, such as a hard disk or a memory of the mobile terminal 5.
  • the memory 51 may also be an external storage device of the mobile terminal 5, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), or a secure digital (Secure Digital, SD) equipped on the mobile terminal 5. Flash memory card Card) etc.
  • the memory 51 may also include both an internal storage unit of the mobile terminal 5 and an external storage device.
  • the memory 51 is used to store the computer-readable instructions and other instructions and data required by the mobile terminal 5.
  • the memory 51 can also be used to temporarily store data that has been output or will be output.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several computer-readable instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store computer-readable instructions.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephone Function (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

一种音频噪声的检测方法、装置、存储介质和移动终端,涉及计算机技术领域。该音频噪声的检测方法包括:获取输入的语音信号(101);对所述语音信号进行分帧(102);分别提取分帧后的所述语音信号的GFCC特征和Gabor特征(103);将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果(104),3≤N≤5。该CNN网络结构为一种简化结构,不含池化层,且卷积层的数量较少,能够极大地减小计算量,从而使得使用移动终端自身的处理器即可完成该CNN网络模型的计算,不必连接服务器,因此即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。

Description

一种音频噪声的检测方法、装置、存储介质和移动终端
本申请要求于2019年1月23日提交中国专利局、申请号为201910064238.3、申请名称为“一种音频噪声的检测方法、装置、存储介质和移动终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种音频噪声的检测方法 装置、存储介质和移动终端。
背景技术
语音助手是一款智能型的手机应用,通过即时问答的智能交互,能够帮助用户解决很多问题。在使用语音助手时,移动终端在获得用户发出的语音后,需要检测语音信号中的噪声,目前通常采用分类器(SVM,随机森林等),或者使用神经网络利用MFCC等声学特征来检测音频噪声。然而,由于采用这些方式的计算量十分巨大,移动终端在采集到语音信号之后,需要将语音信号通过网络发送至计算能力更强大的服务器进行计算,这就导致语音识别的反应时间较长,而且若移动终端处于离线状态,也无法执行。
技术问题
有鉴于此,本申请实施例提供了一种音频噪声的检测方法 装置、存储介质和移动终端,即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。
技术解决方案
本申请实施例的第一方面,提供了一种音频噪声的检测方法,包括:
获取输入的语音信号;
对所述语音信号进行分帧;
分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。
本申请实施例的第二方面,提供了一种音频噪声的检测装置,包括:
语音信号获取模块,用于获取输入的语音信号;
分帧模块,用于对所述语音信号进行分帧;
语音特征提取模块,用于分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
噪声检测模块,用于将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。
本申请实施例的第三方面,提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如本申请实施例的第一方面提出的音频噪声的检测方法的步骤。
本申请实施例的第四方面,提供了一种移动终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如本申请实施例的第一方面提出的音频噪声的检测方法的步骤。
有益效果
本申请提出的音频噪声的检测方法包括:获取输入的语音信号;对所述语音信号进行分帧;分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。该CNN网络结构为一种简化结构,不含池化层,且卷积层的数量较少,能够极大地减小计算量,从而使得使用移动终端自身的处理器即可完成该CNN网络模型的计算,不必连接服务器,因此即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。
附图说明
图1是本申请实施例提供的一种音频噪声的检测方法的第一个实施例的流程图;
图2是本申请实施例提供的一种音频噪声的检测方法的第二个实施例的流程图;
图3是本申请实施例提供的一种音频噪声的检测方法的第三个实施例的流程图;
图4是本申请实施例提供的一种音频噪声的检测装置的一个实施例的结构图;
图5是本申请实施例提供的一种移动终端的示意图。
本发明的实施方式
本申请实施例提供了一种音频噪声的检测方法 装置、存储介质和移动终端,即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。
请参阅图1,本申请实施例中一种音频噪声的检测方法的第一个实施例包括:
101、获取输入的语音信号;
本申请应用于移动终端,首先获取输入的语音信号,该语音信号可以由用户实时输入,也可以是预先录制好的一段语音信号。
102、对所述语音信号进行分帧;
在获取到输入的语音信号之后,对该语音信号进行分帧。分帧是语音信号的加窗及分段处理,随着窗口的往右(假设向右代表时间向前)推移,对加窗后的信号逐步展开处理。由于语音信号不是完整的一段稳态信号,故需要将信号分帧,使得每帧信号的长度在20ms至40ms之间,能够在不丢失信息的情况下满足GFCC和Gabor特征提取的要求。
103、分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
本申请实施例优选将语音信号分帧为25ms的信号,然后分别提取该信号的GFCC特征和Gabor特征。
GFCC是基于FFT的特征提取技术,类似于 MFCC,但采用Gammatone滤波器组和等效的矩形带宽(ERB)比例而不是梅尔滤波器组。由于Gammatone滤波器组是最接近人体耳蜗的滤波器响应,GFCC也称为听觉特征,作为一种新的听觉倒谱系数,与LPCC和MFCC相比,具有更好的识别率和噪声鲁棒性。GFCC特征的提取步骤属于现有技术,具体可以包括:将分帧后的语音信号依次执行信号预加重、信号加窗、DFT、Gammatone滤波、立方根压缩、DCT变换等处理,从而输出GFCC倒谱系数特征。
Gabor是一个用于边缘提取的线性滤波器,能够提供良好的方向选择和尺度选择特性,用于提高噪声识别的鲁棒性。Gabor特征的提取步骤同样属于现有技术,具体可以包括:将分帧后的语音信号预加重,加窗处理,然后输入二维的Gabor滤波器,从而得到Gabor特征。在空间域,一个二维的Gabor滤波器是一个正弦平面波和高斯核函数的乘积,前者是调谐函数,后者是窗口函数。
104、将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果。
在提取到语音信号的GFCC特征和Gabor特征之后,将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果。具体的,输入的GFCC特征和Gabor特征为矩阵的形式,CNN网络输出的是语音信号包含噪声的概率值,若该概率值超过一定的阈值,则表明待检测的语音信号带有噪声。
另外,所述CNN神经网络模型不含池化层,且卷积层的数量为N(3≤N≤5),某个优选的CNN网络结构如以下的表1所示:
表1
核数目 核宽度
卷积层1 40 5*5
卷积层2 20 5*5
卷积层3 10 5*5
全连接层 100  
Softmax层 2  
上述CNN网络结构为一种简化结构,不含池化层,且卷积层的数量较少,能够极大地减小计算量,从而使得使用移动终端自身的处理器即可完成该CNN网络模型的计算,不必连接服务器,因此即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。另外,通过实验数据的证明,提取GFCC特征和Gabor特征,将这两类音频特征输入该简化的CNN网络(无池化层,且卷积层的数量为3至5),能够获得较为理想的噪声检测效果。
以表1结构的CNN网络模型为例,具体的计算过程可以包括:(1)将GFCC特征和Gabor特征输入该CNN网络的第一层卷积层,GFCC特征和Gabor特征都是维度相同的矩阵形式(比如可以为5*8矩阵),通过卷积计算,第一层卷积层输出第一中间结果,该第一中间结果为符合下一层(第二层卷积层)要求的矩阵形式;
(2)将第一层卷积层输出的中间结果输入第二层卷积层,通过卷积计算,输出第二中间结果,该第二中间结果为符合下一层(第三层卷积层)要求的矩阵形式;
(3)以此类推,上一层卷积层的输出结果作为下一层卷积层的输入,最后得到最后一层卷积层的输出结果,该输出结果为符合下一层(全连接层)要求的矩阵形式;
(4)将最后一层卷积层的输出结果输入全连接层,进行全连接层的计算,得到全连接层的输出结果,该全连接层的输出结果为符合下一层(softmax层)要求的矩阵形式;
(5)将全连接层的输出结果输入softmax层,输出结果为一个概率值,用于表示语音信号包含噪声的概率大小。
卷积层,全连接层和softmax的底层计算过程均属于现有技术,一般的CNN网络包含多个卷积层和池化层,本申请通过简化网络的层级结构,从而实现减少计算量的目的。
本申请实施例提出的音频噪声的检测方法包括:获取输入的语音信号;对所述语音信号进行分帧;分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。该CNN网络结构为一种简化结构,不含池化层,且卷积层的数量较少,能够极大地减小计算量,从而使得使用移动终端自身的处理器即可完成该CNN网络模型的计算,不必连接服务器,因此即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。
请参阅图2,本申请实施例中一种音频噪声的检测方法的第二个实施例包括:
201、获取输入的语音信号;
202、当获取输入的语音信号时,记录当前的***时间作为噪声检测的起始时间;
203、对所述语音信号进行分帧;
204、分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
205、将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果;
步骤201-205的具体说明可参照本申请的第一个实施例。
206、在得到所述语音信号的噪声检测结果时,记录当前的***时间作为噪声检测的结束时间;
207、根据所述起始时间和所述结束时间计算得到噪声检测时间;
208、根据所述噪声检测时间调整所述卷积层的数量N的大小。
在本申请实施例中,当获取到输入的语音信号时,记录当前的时间,作为噪音检测的起始时间;当得到所述语音信号的噪声检测结果时,可以记录当前的时间,作为噪音检测的结束时间;然后,计算该结束时间和起始时间的差值,可以得到噪音检测的时间;最后可根据该噪声检测时间调整该CNN网络的卷积层的数量,从而保证噪声检测时间处于一个可接受的范围之内,提升用户体验。
进一步的,步骤208可以包括:
(1)判断所述噪声检测时间是否超过预设的第一阈值;
(2)若所述噪声检测时间超过所述第一阈值,则进一步判断所述卷积层的数量N是否大于3;
(3)若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
比如,在获取到输入的语音信息时,记录当前的时间点t 1作为起始时间,在通过CNN网络得到噪声检测结果后,记录当前的时间点t 2作为结束时间,则t 1-t 2即为噪音检测的处理时间;若该处理时间超过一定的阈值,表明语音识别的实时性较差,而若CNN网络的卷积层超过下限值3,则可以通过减少卷积层的方式减少计算量,以降低处理时间,提高语音识别的实时性。
本申请实施例提出的音频噪声的检测方法包括:当获取输入的语音信号时,记录当前的***时间作为噪声检测的起始时间;对所述语音信号进行分帧;分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5;在得到所述语音信号的噪声检测结果时,记录当前的***时间作为噪声检测的结束时间;根据所述起始时间和所述结束时间计算得到噪声检测时间;根据所述噪声检测时间调整所述卷积层的数量N的大小。该CNN网络结构为一种简化结构,不含池化层,且卷积层的数量较少,能够极大地减小计算量,从而使得使用移动终端自身的处理器即可完成该CNN网络模型的计算,不必连接服务器,因此即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。与本申请的第一个实施例相比,本实施例可以根据噪声检测时间调整CNN神经网络的卷积层的数量大小,从而减少计算量,以降低处理时间,提高语音识别的实时性。
请参阅图3,本申请实施例中一种音频噪声的检测方法的第三个实施例包括:
301、获取输入的语音信号;
302、对所述语音信号进行分帧;
303、分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
304、将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果;
步骤301-304与步骤101-104相同,具体可参照步骤101-104的相关说明。
305、监测***的CPU使用率和内存使用率;
306、根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。
本申请实施例在语音识别的过程中,会实时监测移动终端的CPU使用率和内存使用率,并根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。具体的,若CPU使用率或内存使用率超过一定的阈值,表明移动终端的硬件设备计算负荷过重,会造成应用卡顿等不良影响,则可以适当地降低CNN网络的卷积层的数量N的大小,以减小***的计算负荷。
进一步的,步骤306可以包括:
(1)判断所述CPU使用率或内存使用率是否超过预设的第二阈值;
(2)若所述CPU使用率或内存使用率超过所述第二阈值,则进一步判断所述卷积层的数量N是否大于3;
(3)若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
若CNN网络的卷积层超过下限值3,则可以通过减少卷积层的方式减少计算量,以降低硬件设备的计算负荷,避免产生卡顿的现象。
本申请提出的音频噪声的检测方法包括:获获取输入的语音信号;对所述语音信号进行分帧;分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5;监测***的CPU使用率和内存使用率;根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。该CNN网络结构为一种简化结构,不含池化层,且卷积层的数量较少,能够极大地减小计算量,从而使得使用移动终端自身的处理器即可完成该CNN网络模型的计算,不必连接服务器,因此即使移动终端处于离线状态,也很够检测语音信号中的噪声,且实时性好。与本申请的第一个实施例相比,本实施例可以根据移动终端的CPU使用率和内存使用率调整CNN网络的卷积层的数量大小,从而减少计算量,降低硬件设备的计算负荷,避免产生卡顿的现象。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
上面主要描述了一种音频噪声的检测方法,下面将对一种音频噪声的检测装置进行详细描述。
请参阅图4,本申请实施例中一种音频噪声的检测装置的一个实施例包括:
语音信号获取模块401,用于获取输入的语音信号;
分帧模块402,用于对所述语音信号进行分帧;
语音特征提取模块403,用于分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
噪声检测模块404,用于将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。
进一步的,所述音频噪声的检测装置还可以包括:
起始时间记录模块,用于当获取输入的语音信号时,记录当前的***时间作为噪声检测的起始时间;
结束时间记录模块,用于在得到所述语音信号的噪声检测结果时,记录当前的***时间作为噪声检测的结束时间;
噪声检测时间计算模块,用于根据所述起始时间和所述结束时间计算得到噪声检测时间;
第一卷积层调整模块,用于根据所述噪声检测时间调整所述卷积层的数量N的大小。
更进一步的,所述第一卷积层调整模块可以包括:
第一判断单元,用于判断所述噪声检测时间是否超过预设的第一阈值;
第二判断单元,用于若所述噪声检测时间超过所述第一阈值,则进一步判断所述卷积层的数量N是否大于3;
第一卷积层调整单元,用于若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
进一步的,所述音频噪声的检测装置还可以包括:
***性能监测模块,用于监测***的CPU使用率和内存使用率;
第二卷积层调整模块,用于根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。
更进一步的,所述第二卷积层调整模块可以包括:
第三判断单元,用于判断所述CPU使用率或内存使用率是否超过预设的第二阈值;
第四判断单元,用于若所述CPU使用率或内存使用率超过所述第二阈值,则进一步判断所述卷积层的数量N是否大于3;
第二卷积层调整单元,用于若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
本申请实施例还提供一种移动终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如图1至图3表示的任意一种音频噪声的检测方法的步骤。
图5是本申请一实施例提供的移动终端的示意图。如图5所示,该实施例的移动终端5包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机可读指令52。所述处理器50执行所述计算机可读指令52时实现上述各个音频噪声的检测方法实施例中的步骤,例如图1所示的步骤101至104。或者,所述处理器50执行所述计算机可读指令52时实现上述各装置实施例中各模块/单元的功能,例如图4所示模块401至404的功能。
示例性的,所述计算机可读指令52可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令52在所述移动终端5中的执行过程。
所述处理器50可以是中央处理单元(Central Processing Unit,CPU),还可以是其它通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器51可以是所述移动终端5的内部存储单元,例如移动终端5的硬盘或内存。所述存储器51也可以是所述移动终端5的外部存储设备,例如所述移动终端5上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器51还可以既包括所述移动终端5的内部存储单元也包括外部存储设备。所述存储器51用于存储所述计算机可读指令以及所述移动终端5所需的其它指令和数据。所述存储器51还可以用于暂时地存储已经输出或者将要输出的数据。
在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机非易失性可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干计算机可读指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储计算机可读指令的介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机非易失性可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种音频噪声的检测方法,其特征在于,包括:
    获取输入的语音信号;
    对所述语音信号进行分帧;
    分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
    将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。
  2. 根据权利要求1所述的音频噪声的检测方法,其特征在于,还包括:
    当获取输入的语音信号时,记录当前的***时间作为噪声检测的起始时间;
    在得到所述语音信号的噪声检测结果时,记录当前的***时间作为噪声检测的结束时间;
    根据所述起始时间和所述结束时间计算得到噪声检测时间;
    根据所述噪声检测时间调整所述卷积层的数量N的大小。
  3. 根据权利要求2所述的音频噪声的检测方法,其特征在于,所述根据所述噪声检测时间调整所述卷积层的数量N的大小包括:
    判断所述噪声检测时间是否超过预设的第一阈值;
    若所述噪声检测时间超过所述第一阈值,则进一步判断所述卷积层的数量N是否大于3;
    若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
  4. 根据权利要求1至3中任一项所述的音频噪声的检测方法,其特征在于,还包括:
    监测***的CPU使用率和内存使用率;
    根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。
  5. 根据权利要求4所述的音频噪声的检测方法,其特征在于,所述根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小包括:
    判断所述CPU使用率或内存使用率是否超过预设的第二阈值;
    若所述CPU使用率或内存使用率超过所述第二阈值,则进一步判断所述卷积层的数量N是否大于3;
    若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
  6. 一种音频噪声的检测装置,其特征在于,包括:
    语音信号获取模块,用于获取输入的语音信号;
    分帧模块,用于对所述语音信号进行分帧;
    语音特征提取模块,用于分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
    噪声检测模块,用于将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。
  7. 根据权利要求6所述的音频噪声的检测装置,其特征在于,还包括:
    起始时间记录模块,用于当获取输入的语音信号时,记录当前的***时间作为噪声检测的起始时间;
    结束时间记录模块,用于在得到所述语音信号的噪声检测结果时,记录当前的***时间作为噪声检测的结束时间;
    噪声检测时间计算模块,用于根据所述起始时间和所述结束时间计算得到噪声检测时间;
    第一卷积层调整模块,用于根据所述噪声检测时间调整所述卷积层的数量N的大小。
  8. 根据权利要求7所述的音频噪声的检测装置,其特征在于,所述第一卷积层调整模块包括:
    第一判断单元,用于判断所述噪声检测时间是否超过预设的第一阈值;
    第二判断单元,用于若所述噪声检测时间超过所述第一阈值,则进一步判断所述卷积层的数量N是否大于3;
    第一卷积层调整单元,用于若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
  9. 根据权利要求6至8中任一项所述的音频噪声的检测装置,其特征在于,还包括:
    ***性能监测模块,用于监测***的CPU使用率和内存使用率;
    第二卷积层调整模块,用于根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。
  10. 根据权利要求9所述的音频噪声的检测装置,其特征在于,所述第二卷积层调整模块包括:
    第三判断单元,用于判断所述CPU使用率或内存使用率是否超过预设的第二阈值;
    第四判断单元,用于若所述CPU使用率或内存使用率超过所述第二阈值,则进一步判断所述卷积层的数量N是否大于3;
    第二卷积层调整单元,用于若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
  11. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:
    获取输入的语音信号;
    对所述语音信号进行分帧;
    分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
    将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。
  12. 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被处理器执行时还实现如下步骤:
    当获取输入的语音信号时,记录当前的***时间作为噪声检测的起始时间;
    在得到所述语音信号的噪声检测结果时,记录当前的***时间作为噪声检测的结束时间;
    根据所述起始时间和所述结束时间计算得到噪声检测时间;
    根据所述噪声检测时间调整所述卷积层的数量N的大小。
  13. 根据权利要求12所述的计算机非易失性可读存储介质,其特征在于,所述根据所述噪声检测时间调整所述卷积层的数量N的大小包括:
    判断所述噪声检测时间是否超过预设的第一阈值;
    若所述噪声检测时间超过所述第一阈值,则进一步判断所述卷积层的数量N是否大于3;
    若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
  14. 根据权利要求11至13任一项所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被处理器执行时还实现如下步骤:
    监测***的CPU使用率和内存使用率;
    根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。
  15. 根据权利要求14所述的计算机非易失性可读存储介质,其特征在于,所述根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小包括:
    判断所述CPU使用率或内存使用率是否超过预设的第二阈值;
    若所述CPU使用率或内存使用率超过所述第二阈值,则进一步判断所述卷积层的数量N是否大于3;
    若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
  16. 一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取输入的语音信号;
    对所述语音信号进行分帧;
    分别提取分帧后的所述语音信号的GFCC特征和Gabor特征;
    将提取到的所述GFCC特征和Gabor特征依次输入CNN神经网络模型的N层卷积层、一层全连接层以及一层softmax层,得到所述语音信号的噪声检测结果,3≤N≤5。
  17. 根据权利要求16所述的服务器,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    当获取输入的语音信号时,记录当前的***时间作为噪声检测的起始时间;
    在得到所述语音信号的噪声检测结果时,记录当前的***时间作为噪声检测的结束时间;
    根据所述起始时间和所述结束时间计算得到噪声检测时间;
    根据所述噪声检测时间调整所述卷积层的数量N的大小。
  18. 根据权利要求17所述的服务器,其特征在于,所述根据所述噪声检测时间调整所述卷积层的数量N的大小包括:
    判断所述噪声检测时间是否超过预设的第一阈值;
    若所述噪声检测时间超过所述第一阈值,则进一步判断所述卷积层的数量N是否大于3;
    若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
  19. 根据权利要求16至18任一项所述的服务器,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    监测***的CPU使用率和内存使用率;
    根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小。
  20. 根据权利要求19所述的服务器,其特征在于,所述根据所述CPU使用率和内存使用率调整所述卷积层的数量N的大小包括:
    判断所述CPU使用率或内存使用率是否超过预设的第二阈值;
    若所述CPU使用率或内存使用率超过所述第二阈值,则进一步判断所述卷积层的数量N是否大于3;
    若所述卷积层的数量N大于3,则将所述卷积层的数量N调整为N-1。
PCT/CN2019/118544 2019-01-23 2019-11-14 一种音频噪声的检测方法、装置、存储介质和移动终端 WO2020151338A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910064238.3A CN109658943B (zh) 2019-01-23 2019-01-23 一种音频噪声的检测方法、装置、存储介质和移动终端
CN201910064238.3 2019-01-23

Publications (1)

Publication Number Publication Date
WO2020151338A1 true WO2020151338A1 (zh) 2020-07-30

Family

ID=66119349

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118544 WO2020151338A1 (zh) 2019-01-23 2019-11-14 一种音频噪声的检测方法、装置、存储介质和移动终端

Country Status (2)

Country Link
CN (1) CN109658943B (zh)
WO (1) WO2020151338A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112863548A (zh) * 2021-01-22 2021-05-28 北京百度网讯科技有限公司 训练音频检测模型的方法、音频检测方法及其装置
CN113408718A (zh) * 2021-06-07 2021-09-17 厦门美图之家科技有限公司 设备处理器选择方法、***、终端设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658943B (zh) * 2019-01-23 2023-04-14 平安科技(深圳)有限公司 一种音频噪声的检测方法、装置、存储介质和移动终端
CN110600054B (zh) * 2019-09-06 2021-09-21 南京工程学院 基于网络模型融合的声场景分类方法
CN111192600A (zh) * 2019-12-27 2020-05-22 北京网众共创科技有限公司 声音数据的处理方法及装置、存储介质和电子装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065631A (zh) * 2013-01-24 2013-04-24 华为终端有限公司 一种语音识别的方法、装置
US20170301343A1 (en) * 2013-03-15 2017-10-19 Setem Technologies, Inc. Method and system for generating advanced feature discrimination vectors for use in speech recognition
CN108073856A (zh) * 2016-11-14 2018-05-25 华为技术有限公司 噪音信号的识别方法及装置
CN108877775A (zh) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质
CN109658943A (zh) * 2019-01-23 2019-04-19 平安科技(深圳)有限公司 一种音频噪声的检测方法、装置、存储介质和移动终端

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531174A (zh) * 2016-11-27 2017-03-22 福州大学 基于小波包分解和声谱图特征的动物声音识别方法
CN207440765U (zh) * 2017-01-04 2018-06-01 意法半导体股份有限公司 片上***和移动计算设备
KR102457463B1 (ko) * 2017-01-16 2022-10-21 한국전자통신연구원 희소 파라미터를 사용하는 압축 신경망 시스템 및 그것의 설계 방법
CN109087655A (zh) * 2018-07-30 2018-12-25 桂林电子科技大学 一种交通道路声音监测与异常声音识别***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065631A (zh) * 2013-01-24 2013-04-24 华为终端有限公司 一种语音识别的方法、装置
US20170301343A1 (en) * 2013-03-15 2017-10-19 Setem Technologies, Inc. Method and system for generating advanced feature discrimination vectors for use in speech recognition
CN108073856A (zh) * 2016-11-14 2018-05-25 华为技术有限公司 噪音信号的识别方法及装置
CN108877775A (zh) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质
CN109658943A (zh) * 2019-01-23 2019-04-19 平安科技(深圳)有限公司 一种音频噪声的检测方法、装置、存储介质和移动终端

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112863548A (zh) * 2021-01-22 2021-05-28 北京百度网讯科技有限公司 训练音频检测模型的方法、音频检测方法及其装置
CN113408718A (zh) * 2021-06-07 2021-09-17 厦门美图之家科技有限公司 设备处理器选择方法、***、终端设备及存储介质
CN113408718B (zh) * 2021-06-07 2024-05-31 厦门美图之家科技有限公司 设备处理器选择方法、***、终端设备及存储介质

Also Published As

Publication number Publication date
CN109658943B (zh) 2023-04-14
CN109658943A (zh) 2019-04-19

Similar Documents

Publication Publication Date Title
WO2020151338A1 (zh) 一种音频噪声的检测方法、装置、存储介质和移动终端
WO2019227590A1 (zh) 语音增强方法、装置、计算机设备及存储介质
KR102159217B1 (ko) 전자장치, 신분 검증 방법, 시스템 및 컴퓨터 판독 가능한 저장매체
WO2021042870A1 (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
WO2020173133A1 (zh) 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
WO2018149077A1 (zh) 声纹识别方法、装置、存储介质和后台服务器
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
CN110634497A (zh) 降噪方法、装置、终端设备及存储介质
WO2021115083A1 (zh) 基于神经网络的音频信号时序处理方法、装置及***及计算机可读存储介质
WO2019232826A1 (zh) i-vector向量提取方法、说话人识别方法、装置、设备及介质
WO2022141868A1 (zh) 一种提取语音特征的方法、装置、终端及存储介质
WO2020192009A1 (zh) 一种基于神经网络的静音检测方法、终端设备及介质
CN108806707B (zh) 语音处理方法、装置、设备及存储介质
WO2022178942A1 (zh) 情绪识别方法、装置、计算机设备和存储介质
Winursito et al. Feature data reduction of MFCC using PCA and SVD in speech recognition system
WO2020147642A1 (zh) 语音信号处理方法、装置、计算机可读介质及电子设备
WO2021151310A1 (zh) 语音通话的噪声消除方法、装置、电子设备及存储介质
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
WO2020015546A1 (zh) 一种远场语音识别方法、语音识别模型训练方法和服务器
CN111862978A (zh) 一种基于改进mfcc系数的语音唤醒方法及***
CN117496990A (zh) 语音去噪方法、装置、计算机设备及存储介质
CN115328661B (zh) 一种基于语音和图像特征的算力均衡执行方法及芯片
CN110197657A (zh) 一种基于余弦相似度的动态音声特征提取方法
US20220265184A1 (en) Automatic depression detection method based on audio-video
WO2021164256A1 (zh) 语音信号处理方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19911010

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19911010

Country of ref document: EP

Kind code of ref document: A1