CN112802463B - Audio signal screening method, device and equipment - Google Patents

Audio signal screening method, device and equipment Download PDF

Info

Publication number
CN112802463B
CN112802463B CN202011557215.5A CN202011557215A CN112802463B CN 112802463 B CN112802463 B CN 112802463B CN 202011557215 A CN202011557215 A CN 202011557215A CN 112802463 B CN112802463 B CN 112802463B
Authority
CN
China
Prior art keywords
audio signal
frame
signal
noise
noise reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011557215.5A
Other languages
Chinese (zh)
Other versions
CN112802463A (en
Inventor
刘鲁鹏
元海明
李贝
王晓红
陈佳路
高强
夏龙
郭常圳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ape Power Future Technology Co Ltd
Original Assignee
Beijing Ape Power Future Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ape Power Future Technology Co Ltd filed Critical Beijing Ape Power Future Technology Co Ltd
Priority to CN202011557215.5A priority Critical patent/CN112802463B/en
Publication of CN112802463A publication Critical patent/CN112802463A/en
Application granted granted Critical
Publication of CN112802463B publication Critical patent/CN112802463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Noise Elimination (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application relates to an audio signal screening method, device and equipment. The method comprises the following steps: determining the signal-to-noise ratio of each frame of audio signal in the audio signal; counting the proportion value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total number of frames of the audio signal; and determining whether the audio signal is a target audio signal according to the comparison result of the proportion value and a set proportion threshold value. The scheme provided by the application can simply and effectively realize screening out the target audio signal with low background noise, and has better universality.

Description

Audio signal screening method, device and equipment
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for screening audio signals.
Background
In the field of artificial intelligence of speech recognition, a large number of audio signal samples are needed for machine learning, and the quality of the audio signal samples can directly influence the accuracy of a training model in the machine learning process. However, a great deal of noise exists in the audio signals collected in daily life, which is not beneficial to model training of voice categories, so that the audio signals with smaller noise need to be screened out from a plurality of audio signals. In the audio screening method in the related art, the characteristics of the audio to be screened are compared with the characteristics of the target audio (the audio meeting the noise requirement), and if the comparison result meets the preset condition, the audio to be screened is used as the available audio or as the training sample.
However, in the scheme implemented by the related art, before feature comparison, feature extraction needs to be performed on each audio signal, the audio feature extraction is not easy, and the accuracy of screening is not high due to the fact that the audio feature extraction is wrong; in addition, according to training requirements of different categories or functions, corresponding feature extraction models need to be set for audio feature extraction, the feature extraction models are low in universality, and the implementation complexity is high.
Disclosure of Invention
In order to overcome the problems in the related art, the application provides an audio signal screening method, an audio signal screening device and audio signal screening equipment.
A first aspect of the present application provides an audio signal screening method, including:
determining the signal-to-noise ratio of each frame of audio signal in the audio signal;
counting the proportion value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total number of frames of the audio signal;
determining whether the audio signal is a target audio signal according to a comparison result of the proportion value and a set proportion threshold value; in the step, according to the condition that the ratio of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than a set signal-to-noise ratio threshold to the total number of frames of the audio signal is larger than a set ratio threshold, determining the audio signal as a target audio signal;
the determining the signal-to-noise ratio of each frame of audio signal in the audio signal comprises:
framing the audio signal;
carrying out noise reduction processing on each frame of audio signal to obtain each frame of audio signal subjected to noise reduction;
determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction;
the determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction comprises:
obtaining the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction; subtracting the signal energy of each frame of audio signal after noise reduction from the signal energy of each frame of audio signal before noise reduction to obtain the noise energy of each frame of audio signal before noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy of each frame of audio signal after noise reduction to the noise energy, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
In one embodiment, the determining whether the audio signal is a target audio signal according to the comparison result of the ratio value and a set ratio threshold includes:
and determining the audio signal as a target audio signal according to the condition that the proportion value is greater than a set proportion threshold value.
In one embodiment, said counting a ratio of a number of frames of the audio signal per frame whose signal-to-noise ratio is greater than a set signal-to-noise ratio threshold to a total number of frames of the audio signal includes:
traversing the signal-to-noise ratio of each frame of audio signal, and determining the number of frames in which the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold;
and obtaining the proportion value of the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total frame number of the audio signal according to the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value and the total frame number of the audio signal.
In one embodiment, the framing the audio signal comprises:
framing the audio signal according to a preset time length;
if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the integral multiple of the preset time length is met, and then framing is carried out.
The second aspect of the present application provides an audio signal screening apparatus, comprising:
the signal-to-noise ratio module of each frame is used for determining the signal-to-noise ratio of each frame of audio signals in the audio signals;
the proportion value module is used for counting the proportion value of the number of frames of each frame of audio signal, the signal to noise ratio of which is greater than a set signal to noise ratio threshold value, in the total number of frames of the audio signal;
the screening module is used for determining whether the audio signal is a target audio signal according to a comparison result of the proportion value determined by the proportion value module and a set proportion threshold value; in the step, according to the condition that the ratio value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value to the total number of frames of the audio signal is larger than the set ratio threshold value, determining the audio signal as a target audio signal;
the per-frame signal-to-noise ratio module comprises:
a framing submodule for framing the audio signal;
the noise reduction submodule is used for carrying out noise reduction processing on each frame of audio signal obtained by the framing submodule to obtain each frame of audio signal subjected to noise reduction;
the determining submodule is used for determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction obtained by the noise reduction submodule and the noise energy of each frame of audio signal before noise reduction;
the determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction comprises:
obtaining the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction; subtracting the signal energy of each frame of audio signal after noise reduction from the signal energy of each frame of audio signal before noise reduction to obtain the noise energy of each frame of audio signal before noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy of each frame of audio signal after noise reduction to the noise energy, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
A third aspect of the present application provides an electronic device comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.
A fourth aspect of the present application provides a non-transitory machine-readable storage medium having stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to perform a method as described above.
The technical scheme provided by the application can comprise the following beneficial effects:
the technical scheme of the application firstly determines the signal-to-noise ratio of each frame of audio signal in the audio signal (namely the audio signal to be screened); counting the proportion value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total number of frames of the audio signal; then, according to the comparison result between the ratio value and the set ratio threshold, whether the audio signal is the target audio signal is determined, that is, the background noise of the audio signal to be screened can be judged by comparing the ratio value of the number of frames of each frame of audio signal with the signal-to-noise ratio greater than the set signal-to-noise ratio threshold to the total number of frames of the audio signal with the set ratio threshold, so as to screen out the target audio signal with low background noise. The screening method is simple and effective, has strong universality, can effectively reduce the complexity of audio signal screening, and improves the screening efficiency.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the application.
Fig. 1 is a schematic flowchart of an audio signal screening method according to an embodiment of the present application;
fig. 2 is another schematic flow chart of an audio signal screening method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a framing process of an audio signal according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application;
fig. 5 is another schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.
Detailed Description
Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present application have been illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In the artificial intelligence field of speech recognition, a large number of audio signal samples are needed for model training, and a large number of noises exist in audio signals collected in daily life, which are not beneficial to model training of speech categories, so that audio signals with smaller noises need to be screened out from a large number of audio signals. In the related art, the characteristics of the audio to be screened are compared with the characteristics of the target audio (audio meeting the noise requirement), and if the comparison result meets a preset condition, the audio to be screened can be used as the audio or used as a training sample. Before feature comparison, feature extraction needs to be performed on each audio signal, the audio feature extraction is not easy, and the accuracy of screening is not high and the screening efficiency is low due to the fact that the audio feature extraction is wrong.
In order to solve the above problem, an embodiment of the present invention provides an audio signal screening method, which can simply and effectively screen out a target audio signal with low background noise.
The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of an audio signal screening method according to an embodiment of the present application.
Referring to fig. 1, an embodiment of an audio signal screening method in an embodiment of the present application includes:
step 101, determining the signal-to-noise ratio of each frame of audio signal in the audio signal.
SIGNAL-to-NOISE RATIO (SNR) refers to the RATIO of SIGNAL to NOISE in an electronic device or system. In the embodiment of the present application, the signal-to-noise ratio of each frame of audio signal refers to the ratio of the effective sound signal to the background noise in each frame of audio signal.
In this step, the audio signal may be framed; carrying out noise reduction processing on each frame of audio signal to obtain each frame of audio signal subjected to noise reduction; and determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction.
In this embodiment of the application, the algorithm for performing noise reduction processing on the audio signal may be a Minimum tracking noise estimation algorithm, a Minimum Controlled Recursive Averaging (MCRA) algorithm, or a Minimum Controlled recursive Averaging (IMCRA) algorithm based on wiener filtering.
It is to be understood that the noise reduction algorithm in the embodiment of the present application is not limited, and may be any algorithm capable of reducing the background noise in the audio signal.
102, counting the proportion value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than a set signal-to-noise ratio threshold value in the total number of frames of the audio signal.
In the step, the signal-to-noise ratio of each frame of audio signal can be traversed, and the number of frames with the signal-to-noise ratio of each frame of audio signal larger than a set signal-to-noise ratio threshold is determined; and obtaining the proportion value of the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total frame number of the audio signal according to the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value and the total frame number of the audio signal.
The signal-to-noise ratio threshold is set as an empirical threshold for judging the background noise in each frame of audio signal. In the embodiment of the present application, an empirical threshold is preset, that is, a signal-to-noise ratio threshold is set. In practical applications, the range of the snr threshold may be set to be 15 to 25dB, for example, 20dB, according to actual requirements.
And 103, determining whether the audio signal is the target audio signal according to the comparison result of the proportion value and the set proportion threshold.
In the step, the audio signal is determined to be the target audio signal according to the condition that the ratio value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value to the total number of frames of the audio signal is larger than the set ratio threshold value.
For example, assuming that the set proportion threshold is 0.8, if the proportion value of the number of frames in which the signal-to-noise ratio of each frame of the audio signal is greater than the set signal-to-noise ratio threshold to the total number of frames of the audio signal is greater than 0.8, it indicates that the signal-to-noise ratios of the audio signal x for more than 80% of the time duration are all greater than 20dB, that is, the noise content of the audio signal x is low, and the audio signal x is clean audio, so that the audio signal x is screened out.
The technical scheme of the application firstly determines the signal-to-noise ratio of each frame of audio signal in the audio signal (namely the audio signal to be screened); counting the proportion value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than a set signal-to-noise ratio threshold value in the total number of frames of the audio signal; then, whether the audio signal is a target audio signal can be determined according to the comparison result of the proportion value and the set proportion threshold, namely, the background noise of the audio signal to be screened can be judged by comparing the proportion value of the number of frames of each frame of audio signal with the signal to noise ratio greater than the set signal to noise ratio threshold in the total number of frames of the audio signal with the set proportion threshold, so that the target audio signal with low background noise can be screened. The screening method is simple and effective, has strong universality, can effectively reduce the complexity of audio signal screening, and improves the screening efficiency.
For convenience of understanding, an application example of the audio signal screening method is provided below, and an example of the audio signal screening method in the embodiment of the present application includes:
in the embodiment of the present application, it is assumed that a training model of speech recognition needs to recognize a speaker's voice with environmental sounds, and a training sample of the training model needs an audio signal of the speaker's voice with low background noise (or meets the requirement of low background noise). The background noise of the audio signal to be screened can be environmental sound, that is, the audio signal with the environmental sound meeting the requirement needs to be screened out in the embodiment of the present application, and the audio signal is used as a training sample of a training model.
Fig. 2 is another schematic flow chart of the audio signal screening method according to the embodiment of the present application.
Referring to fig. 2, an embodiment of an audio signal screening method in the embodiment of the present application includes:
step 201, framing the audio signal.
In the embodiment of the present application, it is assumed that the audio signal is x, i.e., the audio signal to be filtered.
This step may frame the audio signal by a preset time length; if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the integral multiple of the preset time length is met, and then framing is carried out.
For example, the audio signal x is framed, each frame may have a preset time length, for example, 32ms, and if the audio length is less than an integer multiple of 32ms, the tail of the audio signal x may be padded with zeros first, so that the length of the audio signal x reaches the integer multiple of 32ms, and then framing is performed. As for the framing method, as shown in fig. 3, frames do not overlap with each other, and the audio signal of each frame after framing can be recorded as:
x i i =1,2. Where n is the total number of frames of the audio signal x. Note that 32ms is an empirical value, and can be adjusted as needed.
Step 202, performing noise reduction processing on each frame of audio signal to obtain each frame of audio signal after noise reduction.
This step is for x i Noise reduction is carried out to obtain each frame of audio signal s after noise reduction i
In the embodiment of the present application, the algorithm for performing noise reduction processing on the audio signal may be a Minimum tracking noise estimation algorithm, a Minimum Controlled Recursive Averaging (MCRA) algorithm, or an advanced Minimum Controlled recursive Averaging (IMCRA) algorithm based on wiener filtering.
It should be noted that, the algorithm selected for performing the noise reduction processing on the audio signal is not limited, that is, the noise reduction algorithm is not limited, as long as the background noise in the audio signal can be eliminated.
Step 203, calculating the signal energy of each frame of audio signal before and after noise reduction respectively to obtain the signal energy of each frame of audio signal before and after noise reduction.
In the embodiment of the present application, the audio signal x per frame before noise reduction can be determined i M sampling points according to the audio signal x of each frame before noise reduction i Respectively corresponding sampling values of the middle M sampling points, and calculating each frame of audio signal x before noise reduction i The signal energy of (c). For example, the audio signal x per frame before noise reduction can be calculated according to the following formula i Signal energy E of x_i
Figure GDA0004045711780000091
Wherein E is x_i For each frame of audio signal x before noise reduction i M is the audio signal x of each frame before the noise reduction i Total number of sampling points in (1), x i,j Representing an audio signal x per frame i The value of the j-th sampling point.
In the embodiment of the present application, the noise-reduced audio signal s per frame can be determined i With each frame of audio signal x before noise reduction i M sampling points corresponding to the positions are used for reducing the noise according to each frame of audio signal s i Respectively corresponding to the M sampling points, and calculating the noise-reduced audio signal s of each frame i The signal energy of (a). For example, the noise-reduced audio signal per frame s can be calculated according to the following formula i Signal energy E of s_i
Figure GDA0004045711780000092
Wherein E is s_i For each frame of the noise-reduced audio signal s i M is the noise-reduced audio signal s of each frame i Total number of sampling points in(s) i,j Representing each frame of the audio signal s i The value of the j-th sampling point.
It will be appreciated that in practical applications, the calculation of the energy of the audio signal may be implemented in other ways, and the above description of the algorithm is only exemplary and should not be taken as the only limitation of the calculation of the energy of the audio signal.
And 204, obtaining the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction.
Subtracting the signal energy of each frame of audio signal after noise reduction from the signal energy of each frame of audio signal before noise reduction to obtain the noise energy of each frame of audio signal before noise reduction.
This step calculates the noise energy of each frame of audio signal before noise reduction, i.e. calculates x i Noise energy of (E) n_i
Illustratively, x may be calculated according to the following formula i Noise energy E of n_i
E n_i =E x_i -E s_i
Wherein, E n_i Is x i Noise energy of E s_i For each frame of the noise-reduced audio signal s i Signal energy of, E x_i For each frame of audio signal x before noise reduction i The signal energy of (c).
Step 205, determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction.
According to the ratio of the signal energy and the noise energy of each frame of audio signal after noise reduction, carrying out logarithmic operation, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
Recording each frame audio signal x before noise reduction i Has a signal-to-noise ratio of snr i Illustratively, the signal-to-noise ratio may be calculated according to the following formula:
snr i =10log 10 (E s_i /E n_i )
wherein, snr is i For each frame of audio signal x before noise reduction i Signal to noise ratio of, E s_i For the signal energy of each frame of the noise-reduced audio signal, E n_i The noise energy of each frame of audio signal before noise reduction.
And step 206, counting the proportion value of the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total frame number of the audio signal.
In the step, the signal-to-noise ratio of each frame of audio signal is traversed, and the number of frames with the signal-to-noise ratio of each frame of audio signal larger than a set signal-to-noise ratio threshold is determined; and obtaining a proportion value of the number of frames with the signal-to-noise ratio of each frame of audio signals larger than the set signal-to-noise ratio threshold in the total number of frames of the audio signals according to the number of frames with the signal-to-noise ratio of each frame of audio signals larger than the set signal-to-noise ratio threshold and the total number of frames of the audio signals.
In the embodiment of the present application, it is assumed that the snr threshold snr is set thresh Is 20dB. It should be noted that the setting of the snr threshold to 20dB is only illustrative and not limiting, and can be adjusted as needed.
In the step, the signal-to-noise ratio snr of each frame signal is traversed i Statistics of snr i Greater than snr thresh Is proportional to the total number of frames n of the audio signal x, which is denoted as r.
And step 207, determining the audio signal as a target audio signal according to the fact that the proportion value is larger than the set proportion threshold value.
For example, assuming that the proportion threshold is set to be 0.8, if the proportion value r is greater than 0.8, it indicates that the signal-to-noise ratios of the audio signal x for more than 80% of the time duration are all greater than 20dB, that is, the noise content of the audio signal x is low, and the audio signal x is clean audio, it is determined that the audio signal x is a target audio signal, and the audio signal x may be selected into a sample library for training a speech recognition model. Otherwise, the audio signal x is discarded. It should be noted that, the setting of the ratio threshold to be 0.8 is only an example and is not limited thereto, and the setting may be adjusted according to needs, for example, the value range of the setting of the ratio threshold may be between 0.7 and 0.9.
In the embodiment of the application, it is assumed that a sample voice library needs to be constructed, wherein the sample voice library can be historical voice data and historical text data corresponding to the historical voice data, which are uttered by surrounding users at different distances and different orientations relative to a target user; the historical voice data can comprise common communication phrase voice data, and the historical text data comprises common communication phrase text data; the commonly used communication phrases include names, appellations, commonly used chat phrases between surrounding users and target users, commonly used calling phrases between surrounding users and target users, and the like. The audio signals in the sample speech library are all audio signals with small background noise after being screened by the audio signal screening method in the embodiment of the application, so that the training effect can be more excellent when the sample speech library is used for model training.
Corresponding to the embodiment of the application function realization method, the application also provides an audio signal screening device, electronic equipment and a corresponding embodiment.
Fig. 4 is a schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application.
Referring to fig. 4, the audio signal filtering apparatus includes: a signal-to-noise ratio per frame module 401, a scale value module 402, and a filtering module 403.
A per-frame snr module 401 for determining an snr of each frame of the audio signal.
A ratio module 402, configured to count a ratio of a number of frames in which the snr of each frame of audio signal is greater than a set snr threshold to a total number of frames of the audio signal. Wherein the set SNR threshold is an empirical threshold. In practical applications, the range of the snr threshold may be set to be 15 to 25dB, for example, 20dB, according to actual requirements.
The scale value module 402 may traverse the signal-to-noise ratio of each frame of audio signal, and determine the number of frames in which the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold; and obtaining the proportion value of the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total frame number of the audio signal according to the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value and the total frame number of the audio signal.
And a screening module 403, configured to determine whether the audio signal is the target audio signal according to a comparison result between the ratio value determined by the ratio value module and a set ratio threshold.
The screening module 403 may determine that the audio signal is the target audio signal, that is, the clean audio signal with low background noise, according to the fact that the ratio value is greater than the set ratio threshold. For example, assuming that the ratio threshold is set to be 0.8, if the ratio value is greater than 0.8, it means that the signal-to-noise ratio of the audio signal x for more than 80% of the time period is greater than 20dB, i.e. the noise content of the audio signal x is low, and the audio signal x is clean audio, so that the audio signal x is screened out.
The technical scheme of the application firstly determines the signal-to-noise ratio of each frame of audio signal in the audio signal (namely the audio signal to be screened); counting the proportion value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than a set signal-to-noise ratio threshold value in the total number of frames of the audio signal; then, whether the audio signal is a target audio signal can be determined according to the comparison result of the proportion value and the set proportion threshold, namely, the background noise of the audio signal to be screened can be judged by comparing the proportion value of the number of frames of each frame of audio signal with the signal to noise ratio greater than the set signal to noise ratio threshold in the total number of frames of the audio signal with the set proportion threshold, so that the target audio signal with low background noise can be screened. The screening method is simple and effective, has strong universality, can effectively reduce the complexity of audio signal screening, and improves the screening efficiency.
Fig. 5 is another schematic structural diagram of an audio signal screening apparatus according to an embodiment of the present application.
Referring to fig. 5, the audio signal screening apparatus includes: a signal-to-noise ratio per frame module 401, a scale value module 402, and a filtering module 403.
The functions of the signal-to-noise ratio module 401, the ratio module 402, and the filtering module 403 of each frame may refer to the description in fig. 4, and are not described herein again.
The per-frame snr module 401 may further include: a framing sub-module 4011, a noise reduction sub-module 4012, and a determination sub-module 4013.
The framing submodule 4011 is configured to frame the audio signal.
Wherein, the framing submodule 4011 frames the audio signal according to a preset time length; if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the integral multiple of the preset time length is met, and then framing is carried out.
The noise reduction sub-module 4012 is configured to perform noise reduction processing on each frame of audio signals obtained by the framing sub-module 4011 to obtain each frame of audio signals after noise reduction.
The algorithm selected by the noise reduction sub-module 4012 to perform noise reduction processing on the audio signal is not limited, that is, the noise reduction algorithm is not limited, as long as the background noise in the audio signal can be eliminated.
The determining submodule 4013 is configured to determine, according to the signal energy of each frame of audio signal after noise reduction obtained by the noise reducing submodule 4012 and the noise energy of each frame of audio signal before noise reduction, a signal-to-noise ratio of each frame of audio signal before noise reduction.
The determining submodule 4013 may obtain the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy to the noise energy of each frame of audio signal after noise reduction, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 6 is a schematic structural diagram of an electronic device shown in an embodiment of the present application. The electronic device may be a mobile terminal device or a server device, etc.
Referring to fig. 6, an electronic device 600 includes a memory 610 and a processor 620.
The Processor 620 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 610 may include various types of storage units such as system memory, read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at run-time. In addition, the memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 610 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 610 has stored thereon executable code that, when processed by the processor 620, may cause the processor 620 to perform some or all of the methods described above.
The aspects of the present application have been described in detail hereinabove with reference to the accompanying drawings. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required in the present application. In addition, it can be understood that the steps in the method of the embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device of the embodiment of the present application may be combined, divided, and deleted according to actual needs.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or electronic device, server, etc.), causes the processor to perform part or all of the steps of the above-described method according to the present application.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the applications disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present application has been presented for purposes of illustration and description and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (7)

1. A method for audio signal screening, comprising:
determining the signal-to-noise ratio of each frame of audio signal in the audio signal;
counting the proportion value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total number of frames of the audio signal;
determining whether the audio signal is a target audio signal according to a comparison result of the proportion value and a set proportion threshold value; in the step, according to the condition that the ratio value of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value to the total number of frames of the audio signal is larger than the set ratio threshold value, determining the audio signal as a target audio signal;
the determining the signal-to-noise ratio of each frame of audio signals in the audio signals comprises:
framing the audio signal;
carrying out noise reduction processing on each frame of audio signal to obtain each frame of audio signal subjected to noise reduction;
determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction;
the determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction comprises:
obtaining the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction; subtracting the signal energy of each frame of audio signal after noise reduction from the signal energy of each frame of audio signal before noise reduction to obtain the noise energy of each frame of audio signal before noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy of each frame of audio signal after noise reduction to the noise energy, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
2. The method of claim 1, wherein the determining whether the audio signal is a target audio signal according to the comparison of the ratio value and a set ratio threshold comprises:
and determining the audio signal as a target audio signal according to the condition that the proportion value is greater than a set proportion threshold value.
3. The method as claimed in claim 1, wherein said counting the ratio of the number of frames of the audio signal per frame whose snr is greater than the set snr threshold to the total number of frames of the audio signal comprises:
traversing the signal-to-noise ratio of each frame of audio signal, and determining the number of frames of which the signal-to-noise ratio of each frame of audio signal is greater than a set signal-to-noise ratio threshold;
and obtaining the proportion value of the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value in the total frame number of the audio signal according to the frame number of each frame of audio signal with the signal-to-noise ratio larger than the set signal-to-noise ratio threshold value and the total frame number of the audio signal.
4. The method of claim 3, wherein the framing the audio signal comprises:
framing the audio signal according to a preset time length;
if the audio length of the audio signal does not meet the integral multiple of the preset time length, zero filling processing is carried out on the tail part of the audio signal so that the integral multiple of the preset time length is met, and then framing is carried out.
5. An audio signal screening apparatus, comprising:
the signal-to-noise ratio module of each frame is used for determining the signal-to-noise ratio of each frame of audio signals in the audio signals;
the proportion value module is used for counting the proportion value of the number of frames of each frame of audio signal, the signal to noise ratio of which is greater than a set signal to noise ratio threshold value, in the total number of frames of the audio signal;
the screening module is used for determining whether the audio signal is a target audio signal according to a comparison result of the proportion value determined by the proportion value module and a set proportion threshold value; in the step, according to the condition that the ratio of the number of frames of each frame of audio signal with the signal-to-noise ratio larger than a set signal-to-noise ratio threshold to the total number of frames of the audio signal is larger than a set ratio threshold, determining the audio signal as a target audio signal;
the per-frame signal-to-noise ratio module comprises:
a framing submodule for framing the audio signal;
the noise reduction submodule is used for carrying out noise reduction processing on each frame of audio signal obtained by the framing submodule to obtain each frame of audio signal subjected to noise reduction;
the determining submodule is used for determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction, which are obtained by the noise reduction submodule;
the determining the signal-to-noise ratio of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal after noise reduction and the noise energy of each frame of audio signal before noise reduction comprises:
obtaining the noise energy of each frame of audio signal before noise reduction according to the signal energy of each frame of audio signal before noise reduction and the signal energy of each frame of audio signal after noise reduction; subtracting the signal energy of each frame of audio signal after noise reduction from the signal energy of each frame of audio signal before noise reduction to obtain the noise energy of each frame of audio signal before noise reduction;
and carrying out logarithmic operation according to the ratio of the signal energy of each frame of audio signal after noise reduction to the noise energy, and determining the signal-to-noise ratio of each frame of audio signal before noise reduction.
6. An electronic device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-4.
7. A non-transitory machine-readable storage medium having stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-4.
CN202011557215.5A 2020-12-24 2020-12-24 Audio signal screening method, device and equipment Active CN112802463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011557215.5A CN112802463B (en) 2020-12-24 2020-12-24 Audio signal screening method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011557215.5A CN112802463B (en) 2020-12-24 2020-12-24 Audio signal screening method, device and equipment

Publications (2)

Publication Number Publication Date
CN112802463A CN112802463A (en) 2021-05-14
CN112802463B true CN112802463B (en) 2023-03-31

Family

ID=75804517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011557215.5A Active CN112802463B (en) 2020-12-24 2020-12-24 Audio signal screening method, device and equipment

Country Status (1)

Country Link
CN (1) CN112802463B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114040309B (en) * 2021-09-24 2024-03-19 北京小米移动软件有限公司 Wind noise detection method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 Multi-microphone voice acquisition method and device
CN110265052A (en) * 2019-06-24 2019-09-20 秒针信息技术有限公司 The signal-to-noise ratio of radio equipment determines method, apparatus, storage medium and electronic device
CN110706693A (en) * 2019-10-18 2020-01-17 浙江大华技术股份有限公司 Method and device for determining voice endpoint, storage medium and electronic device
CN111833895A (en) * 2019-04-23 2020-10-27 北京京东尚科信息技术有限公司 Audio signal processing method, apparatus, computer device and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9099098B2 (en) * 2012-01-20 2015-08-04 Qualcomm Incorporated Voice activity detection in presence of background noise

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 Multi-microphone voice acquisition method and device
CN111833895A (en) * 2019-04-23 2020-10-27 北京京东尚科信息技术有限公司 Audio signal processing method, apparatus, computer device and medium
CN110265052A (en) * 2019-06-24 2019-09-20 秒针信息技术有限公司 The signal-to-noise ratio of radio equipment determines method, apparatus, storage medium and electronic device
CN110706693A (en) * 2019-10-18 2020-01-17 浙江大华技术股份有限公司 Method and device for determining voice endpoint, storage medium and electronic device

Also Published As

Publication number Publication date
CN112802463A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112786066B (en) Audio signal screening method and device and electronic equipment
JP2006003899A (en) Gain-constraining noise suppression
CN106024002B (en) Time zero convergence single microphone noise reduction
CN108806707B (en) Voice processing method, device, equipment and storage medium
CN110111811B (en) Audio signal detection method, device and storage medium
CN112802463B (en) Audio signal screening method, device and equipment
CN112309417A (en) Wind noise suppression audio signal processing method, device, system and readable medium
CN112750453B (en) Audio signal screening method, device, equipment and storage medium
US20230162754A1 (en) Automatic Leveling of Speech Content
CN108093356B (en) Howling detection method and device
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
CN113611329A (en) Method and device for detecting abnormal voice
CN112652323B (en) Audio signal screening method and device, electronic equipment and storage medium
CN108053834A (en) audio data processing method, device, terminal and system
CN108899041B (en) Voice signal noise adding method, device and storage medium
CN112289337A (en) Method and device for filtering residual noise after machine learning voice enhancement
WO2023102930A1 (en) Speech enhancement method, electronic device, program product, and storage medium
CN115457973A (en) Speaker segmentation method, system, terminal and storage medium
CN115171735A (en) Voice activity detection method, storage medium and electronic equipment
CN111145770B (en) Audio processing method and device
CN104715761B (en) A kind of audio valid data detection method and system
CN113053399A (en) Multi-channel audio mixing method and device
CN113409802B (en) Method, device, equipment and storage medium for enhancing voice signal
KR100639930B1 (en) Voice 2 stage end-point detection apparatus for automatic voice recognition system and method therefor
US20240170004A1 (en) Context aware audio processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant