WO2014177084A1 - 激活音检测方法和装置 - Google Patents

激活音检测方法和装置 Download PDF

Info

Publication number
WO2014177084A1
WO2014177084A1 PCT/CN2014/077704 CN2014077704W WO2014177084A1 WO 2014177084 A1 WO2014177084 A1 WO 2014177084A1 CN 2014077704 W CN2014077704 W CN 2014077704W WO 2014177084 A1 WO2014177084 A1 WO 2014177084A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
characteristic parameter
tonal
parameter
signal
Prior art date
Application number
PCT/CN2014/077704
Other languages
English (en)
French (fr)
Inventor
朱长宝
袁浩
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=51843162&utm_source=***_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=WO2014177084(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to US14/915,246 priority Critical patent/US9978398B2/en
Priority to JP2016537092A priority patent/JP6412132B2/ja
Priority to EP14791094.7A priority patent/EP3040991B1/en
Priority to PL14791094T priority patent/PL3040991T3/pl
Priority to KR1020167005654A priority patent/KR101831078B1/ko
Publication of WO2014177084A1 publication Critical patent/WO2014177084A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold

Definitions

  • the present invention relates to the field of communications, and in particular, to an activation tone detecting method and apparatus.
  • the user In a normal voice call, the user sometimes speaks and sometimes listens. At this time, the inactive tone phase occurs during the call. Under normal circumstances, the total non-speech activation phase of the two parties exceeds the total voice coding duration of the two parties. %.
  • the inactive tone phase only background noise, background noise usually does not have any useful information.
  • the activated sound and the inactive sound are detected by the activated sound detection (VAD) algorithm, and processed separately by different methods.
  • VAD activated sound detection
  • Modern 4 ⁇ multi-speech coding standards such as AMR and AMR-WB, support VAD functions. In terms of efficiency, the VAD of these encoders does not achieve good performance under all typical background noise.
  • these encoders have lower VAD efficiency.
  • these VADs sometimes have error detection, resulting in a significant quality degradation of the corresponding processing algorithms.
  • the related VAD technology may have inaccurate judgments. For example, some VAD technologies are inaccurately detected several frames before the voice segment, and some VADs are inaccurately detected several frames after the voice segment.
  • An activation sound detection method includes:
  • the final joint VAD decision result is obtained according to the number of frames of the continuous activation tone, the average full-band signal-to-noise ratio, the tonal signal flag, and at least two existing VAD decision results.
  • the method further includes:
  • the frame energy parameter of the current frame is calculated.
  • the value of the characteristic parameter is calculated.
  • the values of the spectral flatness characteristic parameter and the tonal characteristic parameter are calculated according to the spectrum amplitude; the tonal signal flag is calculated according to the tonal characteristic parameter, the spectral center of gravity characteristic parameter, the time domain stability characteristic parameter, and the spectral flatness characteristic parameter.
  • the method further includes:
  • the background noise energy estimated by acquiring the previous frame includes:
  • the frame energy parameter is a weighted superposition value or a direct superimposed value of each sub-band signal energy
  • the spectral center of gravity characteristic parameter is a ratio of a weighted accumulated value of all or part of the subband signal energy to an unweighted accumulated value, or a value obtained by smoothing the ratio;
  • the time domain stability characteristic parameter is a variance of a plurality of adjacent two frames of energy amplitude superposition values and a desired ratio of squares of a plurality of adjacent two frames of energy amplitude superposition values, or the ratio is multiplied by a coefficient;
  • the spectral flatness characteristic parameter is the geometric mean and arithmetic mean of one or more spectral magnitudes. The ratio of the number, or the ratio is multiplied by a coefficient;
  • the tonal feature parameter is obtained by calculating the correlation coefficient of the intra-frame spectral difference coefficients of the two frames before and after, or continuing to smooth-filter the correlation coefficient.
  • calculating the tonal signal flag according to the tonal characteristic parameter, the spectral gravity center characteristic parameter, the time domain stability characteristic parameter, and the spectral flatness characteristic parameter comprises:
  • the current frame signal is a non-tonal signal, and a tonality frame is used to indicate whether the current frame is a tonal frame;
  • step C) performs step C) when one of the following conditions is satisfied, and perform step D) when neither of the following conditions is met:
  • Condition 1 The value of the tonal characteristic parameter to " fl / - ratel or its smoothed filtered value is greater than the corresponding set first tonal characteristic parameter determination threshold;
  • Condition 2 the value of the tonal characteristic parameter to " fl / - ratel or its smoothed filtered value is greater than the corresponding set second second characteristic parameter threshold;
  • step C) determining whether the current frame is a tonal frame, and setting the value of the tonal frame flag according to the determination result: determining that the current frame is a tonal frame when all of the following conditions are met, and not under any one or more of the following conditions When satisfied, the current frame is determined to be a non-tonal frame and step D) is performed:
  • the time domain stability characteristic parameter value is less than a set first time domain stability determination threshold value
  • Condition 2 The spectral center of gravity characteristic parameter value is greater than a set first spectral center of gravity determination threshold;
  • Condition 3 When the spectral flatness characteristic parameters of each subband are smaller than the corresponding preset spectral flatness determination threshold Determining that the current frame is a tonal frame, and setting a value of the tonal frame flag;
  • tonaHty gre ⁇ is the characteristic parameter of the tonality degree of the previous frame
  • the initial value range is [0, 1]
  • td - scale - A is the attenuation coefficient
  • td - scale - B is the accumulation coefficient.
  • the current frame is determined to be a tonal signal
  • the tonality characteristic parameter tonalit y- de gr ee is less than or equal to the set tonality threshold, it is determined that the current frame is a non-tonal signal.
  • the method further includes:
  • the current consecutive activated sound frame number is calculated by the previous joint VAD decision result by continuous_speech_num2:
  • the number of consecutive activated audio frames is 0.
  • the live sound frame when the existing VAD decision result or the joint VAD decision result is 0, is represented as an inactive sound frame, and the number of consecutively activated sound frames, the average full-band signal-to-noise ratio, the tonal signal mark, and at least two
  • the results of the existing VAD decision result obtained by the final joint VAD decision include:
  • the VAD decision result is the result of the joint VAD decision, wherein the logical operation refers to an OR operation or an AND operation:
  • Condition 1 The average full-band signal-to-noise ratio is greater than the signal-to-noise ratio threshold.
  • Condition 2 continuous—speech—num2 is greater than the number of consecutively activated audio frames and the average full-band signal-to-noise ratio is greater than the signal-to-noise ratio threshold.
  • the tonal signal flag is set to 1.
  • the audio frame, the final combined VAD decision result according to the number of consecutive active audio frames, the average full-band signal to noise ratio, the tonal signal flag, and the at least two existing VAD decision results include:
  • the joint VAD decision result is 1 when any of the following conditions is satisfied, and the logical operation of the at least two existing VAD decision results is selected as an output when at least one of the following conditions is not satisfied, wherein the logical operation means "or" Operation or "and” operation:
  • Condition 1 At least two existing VAD decisions are all 1 .
  • Condition 2 The sum of at least two existing VAD decisions is greater than the joint decision threshold, and the tonal signal flag is set to 1 .
  • Condition 3 continuous—speech—num2 is greater than the number of consecutive active audio frames and the average full-band signal-to-noise ratio is greater than the signal-to-noise ratio threshold, and the tonal signal flag is set to 1.
  • the embodiment of the invention further provides an activation sound detecting device, comprising:
  • the joint decision module is configured to obtain a final joint VAD decision result according to the number of consecutively activated audio frames, the average full-band signal-to-noise ratio, the tonal signal flag, and at least two existing VAD decision results.
  • the device further includes a parameter obtaining module, where the parameter obtaining module includes: a first parameter acquiring unit configured to obtain a subband signal and a spectrum amplitude of the current frame; and a second parameter acquiring unit configured to be based on the subband The signal is calculated to obtain a value of a frame energy parameter, a spectral center of gravity characteristic parameter, and a time domain stability characteristic parameter of the current frame;
  • a third parameter obtaining unit is configured to calculate a value of the spectral flatness characteristic parameter and the tonal characteristic parameter according to the spectrum amplitude
  • the fourth parameter acquiring unit is configured to calculate the tonal signal flag according to the tonal feature parameter, the spectral center of gravity feature parameter, the time domain stability feature parameter, and the spectral flatness characteristic parameter.
  • the parameter obtaining module further includes:
  • the fifth parameter obtaining unit is configured to obtain the background noise energy estimated by the previous frame; the sixth parameter acquiring unit is configured to calculate, according to the background noise energy estimated by the previous frame and the frame energy parameter of the current frame, Average full signal to noise ratio.
  • the parameter obtaining module further includes: The seventh parameter obtaining unit is configured to determine, when the current frame is the first frame, the number of the consecutive active audio frames is 0,
  • Embodiments of the present invention provide an activation tone detecting method and apparatus, and obtain a final joint VAD decision result according to the number of consecutively activated audio frames, the average full-band signal-to-noise ratio, the tonal signal flag, and at least two existing VAD decision results.
  • the VAD decision is comprehensively based on various parameters, which improves the accuracy of the VAD decision and solves the problem of inaccurate VAD detection.
  • FIG. 1 is a flowchart of a method for detecting an activated sound according to Embodiment 1 of the present invention
  • FIG. 2 is a flowchart of a method for detecting an activated sound according to Embodiment 2 of the present invention
  • FIG. 3 is a flowchart of an embodiment of the present invention
  • 4 is a schematic structural diagram of an activation sound detecting device
  • FIG. 4 is a schematic structural diagram of the parameter obtaining module 302 of FIG.
  • embodiments of the present invention provide an active tone detection method. Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The embodiments in the present application and the features in the embodiments may be arbitrarily combined with each other without conflict.
  • the embodiment of the present invention provides an activation sound detection method, and the process of completing the VAD by using the method is as shown in FIG. 1 , including:
  • Step 101 Acquire a judgment result of at least two existing VADs
  • Step 102 Obtain a subband signal and a spectrum amplitude of a current frame.
  • an audio stream with a frame length of 20 ms and a sampling rate of 32 kHz is taken as an example for description.
  • the activation sound detection method provided by the embodiment of the present invention is also applicable under other frame lengths and sampling rates.
  • the current frame time domain signal is input to the filter bank unit, and subband filtering calculation is performed to obtain a filter group subband signal.
  • a 40-channel filter bank is used, and the technical solution provided by the embodiment of the present invention is also applicable to a filter bank using other channel numbers.
  • the current frame time domain signal is input into the 40-channel filter bank for sub-band filtering calculation, and the filter sub-band signals of 40 sub-bands on 16 time samples are obtained ⁇ , 0 ⁇ : ⁇ 40, 0 ⁇ / ⁇ 16 , wherein the index of the filter group subband, the value of which represents the subband corresponding to the coefficient, is the time sample index of each subband, and the implementation steps are as follows:
  • the filter group sub-band signals of 16 time samples of 40 sub-bands are obtained; 0 ⁇ A ⁇ 40, 0 ⁇ / ⁇ 16. Then, the filter group sub-band signal is time-frequency transformed, and the spectrum amplitude is calculated.
  • Embodiments of the present invention can be implemented by performing time-frequency transform on all filter bank sub-bands or partial filter bank sub-bands and calculating spectrum amplitudes.
  • the time-frequency transform method according to the embodiment of the present invention may be DFT, FFT, DCT or DST.
  • the DFT is taken as an example to illustrate the implementation method thereof.
  • the calculation process is as follows:
  • the time-frequency transform calculation expression is as follows:
  • X [k, j] (real (X DFT [k, j]f + (image(X DFT [k,j]) 2 ;0 ⁇ k ⁇ 0;0 ⁇ j ⁇ 6; ⁇ in pow ⁇ . k - ⁇ ) , i ge(X DFT — P0W [k, j]), respectively, represents the real and imaginary parts of the spectral coefficient k - ⁇ . If it is even, use the following equation to calculate the spectral amplitude at each frequency: XDFT_AMP [ 8 ⁇ A ⁇ 10; 0 ⁇ ⁇ 8; If it is an odd number, then use the following equation to calculate the spectral amplitude at each frequency point:
  • X DFT_AMP 0 ⁇ 10; 0 ⁇ ⁇ 8; that is, the amplitude of the spectrum after time-frequency conversion.
  • Step 103 Calculate a value of a frame energy parameter, a spectral center of gravity characteristic parameter, and a time domain stability characteristic parameter of the current frame according to the subband signal, and calculate a value of the spectral flatness characteristic parameter and the tonal characteristic parameter according to the spectrum amplitude;
  • each parameter is obtained by the following method:
  • the frame energy parameter is a weighted superposition value or a direct superposition value of each sub-band signal energy:
  • E b [k] Y ((real(X[k, I])) 2 + (image(X[k, I])) 2 ); 0 ⁇ k ⁇ 40;
  • the energy of the filter subbands or all the filter subbands of some of the auditory sensitive ones are accumulated to obtain the frame energy parameters.
  • the human ear is relatively insensitive to sounds of extremely low frequencies (e.g., below 100 Hz) and high frequencies (e.g., above 20 kHz).
  • the filter subbands arranged in accordance with the frequency from low to high are considered to be
  • the second sub-band to the penultimate sub-band is the auditory-sensitive main filter group sub-band, and the partial or all auditory-sensitive filter sub-band energy is accumulated to obtain the frame energy parameter 1, and the calculation expression is as follows:
  • e _sb _end ne — sb — start
  • e _ sb _ end is the end sub-band index, which takes a value greater than 6, less than the total number of sub-bands.
  • the value of the frame energy parameter 1 is added to the weighting value of the energy of the filter bank sub-band that is not used in calculating the frame energy parameter 1.
  • the frame energy parameter 2 is obtained, and the calculation expression is as follows:
  • E t2 E + e _ scaleX ⁇ ⁇ E sh + scale! ⁇ ⁇ E sh [n]
  • e_ ⁇ //e2 is the weighted scale factor, and its value range is [o , 1] as the total number of subbands t
  • the spectral center of gravity characteristic parameter is a ratio of a weighted accumulated value of all or part of the subband signal energy to an unweighted accumulated value
  • the spectral center-of-gravity characteristic parameters are calculated according to the energy of each filter bank sub-band, and the spectral center-of-gravity characteristic parameter is a ratio of the weighted sum of the filter group sub-band energy sum and the direct added sum of the sub-band energy or by other
  • the spectral center of gravity characteristic parameter values are smoothed and filtered.
  • the spectral center of gravity feature parameters can be implemented using the following substeps:
  • the subband interval used for the calculation of the spectral center of gravity feature parameters is as shown in Table 1.
  • the two spectral center-of-gravity characteristic parameter values are calculated, which are the first interval spectral center-of-gravity characteristic parameter and the second interval spectral gravity center characteristic parameter, respectively.
  • ⁇ , 0 - ⁇ 2 Ddtal, i3 ⁇ 4/to 2 are each a small offset value with a range of (0, 1). Among them is the spectral center of gravity number index.
  • Sp _ centej[2] sp_ center ⁇ [ ⁇ spc _sm _ scale +sp_ cente ⁇ [0]•( ⁇ -spc_sm_ scale)
  • is the spectral center of gravity parameter smoothing filter scale factor
  • J 2 means the previous one The smooth spectral center of gravity characteristic parameter value of the frame, whose initial value is 1.6
  • the time domain stability characteristic parameter is a variance of a variance of a plurality of adjacent two frame energy amplitude values and a desired ratio of squares of a plurality of adjacent two frame energy amplitude superposition values, or the ratio is multiplied by a coefficient;
  • the time domain stability characteristic parameter is calculated from the latest frame energy parameters of several frame signals.
  • the time domain stability characteristic parameter is calculated by using the frame energy parameter of the latest 40 frame signal. The calculation steps are:
  • Amp tl [n] ⁇ E t2 (n) +e_ offset; 0 ⁇ n ⁇ 40; where e - is an offset value with a range of [0, 0.1]
  • Amp t2 (n) Amp (-2/7) + Amp (-2/ - l) 0 ⁇ n ⁇ 20;
  • the spectral flatness characteristic parameter is a ratio of a geometric mean of some spectral magnitudes to an arithmetic mean, or the ratio is multiplied by a coefficient;
  • the spectral amplitude is divided into several frequency bands, and the spectral flatness of each frequency band of the current frame is calculated, and the spectral flatness characteristic parameter of the current frame is obtained.
  • the spectrum amplitude is divided into three frequency bands, and the spectral flatness characteristics of the three frequency bands are calculated, and the implementation steps are as follows:
  • the spectral flatness of each sub-band is calculated separately, and the spectral flatness characteristic parameter of the current frame is obtained.
  • the calculation expressions of the respective spectral flatness characteristic parameter values of the current frame are as follows:
  • the spectral flatness characteristic parameter of the current frame is smoothed to obtain the final spectral flatness characteristic parameter of the current frame.
  • sSMR ⁇ k smr _ scale ⁇ sSMR x (k) + (1 - smr _ scale) ⁇ SMR(k); 0 ⁇ k ⁇ 3 where 5 is the smoothing factor and its value range is [0.6, 1] ⁇ ?_ ) is the value of the kth spectral flatness characteristic parameter of the previous frame.
  • the tonal characteristic parameter is the correlation value of the intra-frame spectral difference coefficient of the two frames before and after the calculation. To, or continue to smooth filter the correlation value.
  • the tonal characteristic parameters are calculated according to the spectral amplitude, wherein the tonal characteristic parameters can be calculated according to all spectral amplitudes or partial spectral amplitudes.
  • a frequency coefficient of 3 to 61 is selected as an example to calculate a tonal characteristic parameter.
  • the process is as follows:
  • spec _dif[n - 3] X DFT AMP (" + 1) _ X DFT _AMP ("); 3 ⁇ " ⁇ 62;
  • pre-spec_dif is the non-negative spectral difference coefficient of the previous frame.
  • Tonality _ rate! tonal scale ⁇ tonality— ratel_ x + (1—tonal _ scale) ⁇ tonality— ratel
  • a / _s fe is the tonal characteristic parameter smoothing factor, its value range is [ol , 1] , to " a — rate2 -i is the second tonal characteristic parameter value of the previous frame, and its initial value is The range is [0, 1].
  • Step 104 Calculate the tonality signal flag, refer to the flow of the tonal signal calculation in the third embodiment of the present invention.
  • Step 105 Full-band background noise energy estimated from the previous frame, frame energy of the current frame The parameter is calculated to obtain an average full-band signal-to-noise ratio;
  • SNR2 log, - F where is the estimated total background noise energy of the previous frame.
  • the principle of obtaining the full background noise energy of the previous frame is the same as the principle of obtaining the full-band background noise energy of the current frame.
  • Step 106 Acquire the number of consecutively activated audio frames
  • the number of consecutively activated audio frames is continuous—speech—num2 can be calculated by the VAD decision result.
  • the initial value is set to 0.
  • vad_flag flag is 1, continuous—speech—num2 force port 1; vad—flag is 0.
  • continuous_ speech—num2 is set to 0.
  • Step 107 obtain a final joint VAD decision result according to the number of consecutively activated audio frames, the average full-band signal-to-noise ratio, the tonal signal flag, and at least two existing VAD decision results;
  • the following two implementation methods are taken as an example, and the implementation process may have other joint methods. Selecting a logical operation of the decision result of at least two existing VADs as a joint VAD decision result when any of the following conditions is satisfied, and selecting one of the at least two existing VAD decision results when the at least one of the following conditions is not satisfied
  • the VAD decision result is the result of the joint VAD decision, wherein the logical operation refers to an OR operation or an AND operation:
  • Condition 1 The average full-band signal-to-noise ratio is greater than the signal-to-noise ratio threshold.
  • Condition 2 continuous_speech—num2 is greater than the number of consecutive active audio frames and the average full band The signal to noise ratio is greater than the signal to noise ratio threshold.
  • Condition 3 The tonal signal flag is set to 1.
  • the final joint VAD decision result based on the number of consecutive active tone frames, the average full-band signal-to-noise ratio, the tonal signal flag, and at least two existing VAD decisions includes:
  • the joint VAD decision result is 1 when any of the following conditions is satisfied, and the logical operation of the at least two existing VAD decision results is selected as an output when at least one of the following conditions is not satisfied, wherein the logical operation means "or" Operation or "and” operation: Condition 1: At least two existing VAD decisions are all 1
  • Condition 2 The sum of at least two existing VAD decisions is greater than the joint decision threshold, and the tonal signal flag is set to 1 .
  • Condition 3 continuous—speech—num2 is greater than the number of consecutive active audio frames and the average full-band signal-to-noise ratio is greater than the signal-to-noise ratio threshold, and the tonal signal flag is set to 1.
  • Steps 101 to 106 in the embodiment of the present invention have no strict timing relationship (where the sequence order of steps 102, 103, and 104 cannot be reversed) as long as the continuous activation tone required for the joint VAD decision of the embodiment of the present invention can be obtained.
  • the number of frames, the average full-band signal-to-noise ratio, the tonality signal flag, and the scheme of at least two existing VAD decision results are all within the scope of protection of the embodiments of the present invention.
  • a calculation method of the background noise energy of the previous frame is given.
  • the background noise energy of the previous frame is used to calculate the average full-band signal-to-noise ratio.
  • the calculation process of the background noise energy of the previous frame is the same as the calculation process of the background noise energy of the current frame.
  • the embodiment of the present invention provides a calculation method of the full-band background noise energy of the current frame.
  • Step 201 Obtain a subband signal and a spectrum amplitude of the current frame. For the calculation method, see step 102.
  • Step 202 Calculate a current frame energy parameter, a spectral center of gravity characteristic parameter, and a spectral density parameter according to the subband signal. The value of the time domain stability characteristic parameter; the value of the spectral flatness characteristic parameter and the tonal characteristic parameter are calculated according to the spectrum amplitude, and the calculation method is shown in step 103.
  • Step 203 Calculate a background noise identifier of the current frame according to the current frame frame energy parameter, the spectral center of gravity feature parameter, the time domain stability feature parameter, the spectral flatness characteristic parameter, and the tonal feature parameter.
  • the background noise flag is used to indicate whether the current frame is a noise signal. If it is a noise signal, the background noise flag is set to 1, otherwise it is set to 0.
  • the current background noise signal is determined, it is determined that any of the following conditions is true, then it is determined that the current frame is not a noise signal:
  • the time domain stability characteristic parameter 1 t-Stab -rateG is greater than a set time domain stability threshold
  • Condition 2 The smoothed filter value of the first interval spectral center of gravity feature parameter value is greater than a set spectral centroid threshold value, and the time domain stability feature parameter value is also greater than the set time domain stability threshold value;
  • Condition 3 The tonal characteristic parameter or its smoothed filtered value is greater than a set threshold value of the tonal characteristic parameter, and the time domain stability characteristic parameter 1 t-StaWe-rateG value is greater than its set time domain stability Threshold value
  • Condition 5 Determine the value of the frame energy parameter ⁇ is greater than the set frame energy threshold - W.
  • a background noise flag background_flag is used to indicate whether the current frame is background noise, and if the current frame is determined to be background noise, the background noise flag background_flag is set to 1, otherwise the background noise flag background_flag is set. Is 0.
  • the current frame is detected as a noise signal according to the time domain stability characteristic parameter, the spectral gravity center characteristic parameter, the spectral flatness characteristic parameter, the tonal characteristic parameter, and the current frame energy parameter. If it is not a noise signal, set the background noise flag background_flag to 0.
  • condition 1 Determine whether the time domain stability characteristic parameter lt_stabl e _rat e G is greater than a set first time domain stability threshold value ⁇ 1 . If yes, it is determined that the current frame is not a noise signal and the background_flag is set to zero.
  • First time domain stability threshold in the embodiment of the present invention /t— — fe—1 ranges from [0 8 , J 6] ; for condition 2: determines whether the first interval smooth spectral center of gravity feature parameter value is greater than a set first spectral center-of-gravity threshold cewtef—tM, And the value of the time domain stability characteristic parameter ltjtablejateO is also greater than the second time domain stability threshold 1 t-stab ⁇ -rate- 111 ⁇ If yes, it is determined that the current frame is not a noise signal, and the background_flag is set to zero.
  • the range of values is [1.6, 4]; the range of lt_stable_rate_thr2 is ( 0 , 0 1) o
  • condition 4 determining whether the value of the first spectral flatness characteristic parameter 53 ⁇ 4 ⁇ [ 0 ] is less than the set first spectral flatness threshold value ⁇ — ⁇ 1 , determining the second spectral flatness characteristic parameter 5 Whether the value is smaller than the set second spectral flatness threshold value sSMR_thr2, and determining whether the value of the third spectral flatness characteristic parameter ⁇ [ 2 ] is smaller than the set third spectral flatness threshold value ⁇ 3 . If the above conditions are satisfied at the same time, it is determined that the current frame is not background noise. Background—flag is assigned a value of 0.
  • condition 5 it is judged whether the value of the frame energy parameter ⁇ is greater than the set frame energy threshold E - thrl , and if the above condition is satisfied, it is determined that the current frame is not background noise. Background—flag is assigned a value of 0. The value is taken according to the dynamic range of the frame energy parameter.
  • Step 204 Calculate the tonal signal flag according to the tonal feature parameter, the spectral center of gravity feature parameter, the time domain stability feature parameter, and the spectral flatness feature parameter.
  • Step 205 Calculate the current frame background noise energy according to the frame energy parameter, the background noise identifier, and the tonal signal flag. Proceed as follows:
  • ⁇ t—sum ⁇ t—sum—— , ⁇ + ⁇ t,l ' ⁇
  • A-- 1 is the cumulative value of the background noise energy of the previous frame
  • N f —- ⁇ is the cumulative number of background noise energy frames calculated in the previous frame.
  • the total background noise energy is obtained by the ratio of the accumulated value of the background noise energy _ and the cumulative number of frames ⁇ :
  • the tonality flag of the tonality flag is equal to 1 and the value of the frame energy parameter is less than the value of the background noise energy characteristic parameter ⁇ ⁇ multiplied by a gain factor
  • E t — E t — ⁇ S ⁇ n + delta; where , the value range is [0.3, 1].
  • the embodiment of the present invention provides an activation sound detection method, which can be combined with the technical solutions provided by the first embodiment and the second embodiment of the present invention to calculate a tonality signal flag, including:
  • Whether the current frame is a tonal signal is determined according to the tonal characteristic parameter, the time domain stability characteristic parameter, the spectral flatness characteristic parameter, and the spectral gravity center characteristic parameter.
  • determining whether it is a tonal signal do the following: 1.
  • a value of 1 for the tonality_frame indicates that the current frame is a tonal frame, and 0 indicates that the current frame is a non-tonal frame. ;
  • tonality _ decision _ thrl is [0 5 , 0 7]
  • range of tonality _ rate ⁇ is [0.7 , 0.99].
  • the spectral center of gravity characteristic parameter value is greater than a setting
  • the first spectral center of gravity determines the threshold value ⁇ c-1, and the spectral flatness characteristic parameters of each subband are smaller than the corresponding preset spectral flatness threshold values, that is, the first spectral flatness characteristic parameter sSMR ⁇ Less than a set first spectral flatness determination threshold sSMF-decision_thr ⁇ or second spectral flatness parameter is less than a set second spectral flatness determination threshold sSMF - decision - thr2 or third spectrum
  • the average parameter ⁇ MR[ 2 ] is smaller than a set third spectral flatness determination threshold 5 ⁇ - 6 ⁇ 0 "- ⁇ 3 ; then the current frame is determined to be a tonal frame, and the tonal frame
  • ⁇ -fifeciw'ow-t zrl takes a value range of [0 01 , 0 25]
  • S P C —decision _thrl is [1 0 , 1 8]
  • sSMF—decision —thr2 is [0 6, Q 9]
  • sSMF _ decision _ thr3 is [Q 7 , o 98]
  • the tonality characteristic parameter tonallt y- de g ree is updated, wherein the tonality parameter tonallty - degree initial value is set when the active tone detecting device starts working, and the value range is [ 0, 1].
  • the tonality characteristic parameter na ⁇ -degree is calculated differently:
  • Tonality—degree tonality—degree—, ⁇ td scale A + td scale B;
  • tonaHty gre ⁇ is the characteristic parameter of the degree of tonality of the previous frame. Its initial value ranges from [0, 1].
  • Td - scale - A is the attenuation coefficient, its value range is [0, 1]; td - scale - B is the cumulative coefficient, which ranges from [0, 1].
  • the current frame is a tonal signal, otherwise, the current frame is determined to be a non-tonal signal.
  • the embodiment of the present invention further provides an activation sound detecting device.
  • the device includes: a joint decision module 301, configured to continuously activate a sound frame number, an average full band signal to noise ratio, and a tonality signal flag.
  • the final combined VAD decision result is obtained with at least two existing VAD decision results.
  • the device further includes a parameter obtaining module 302.
  • the structure of the parameter obtaining module 302 is as shown in FIG. 4, and includes:
  • the first parameter obtaining unit 3021 is configured to obtain a subband signal and a spectrum amplitude of the current frame.
  • the second parameter obtaining unit 3022 is configured to calculate a frame energy parameter, a spectral center of gravity feature parameter, and a time domain of the current frame according to the subband signal.
  • the third parameter obtaining unit 3023 is configured to calculate the values of the spectral flatness characteristic parameter and the tonal characteristic parameter according to the spectral amplitude. For the calculation method, refer to Embodiment 3 of the present invention.
  • the fourth parameter obtaining unit 3024 is configured to calculate the tonal signal flag according to the tonal feature parameter, the spectral center of gravity feature parameter, the time domain stability feature parameter, and the spectral flatness feature parameter.
  • the parameter obtaining module 302 further includes:
  • the fifth parameter obtaining unit 3025 is configured to obtain the background noise energy estimated by the previous frame; or the calculation method or refer to the second embodiment of the present invention.
  • the sixth parameter obtaining unit 3026 is configured to estimate the background noise energy according to the previous frame.
  • the amount, the frame energy parameter of the current frame is calculated to obtain the average full-band signal to noise ratio.
  • the parameter obtaining module 302 further includes:
  • the seventh parameter obtaining unit 3027 is configured to determine that the number of consecutive active sound frames is 0 when the current frame is the first frame.
  • the current number of consecutive active audio frames is calculated by the previous combined VAD decision result.
  • the parameter obtaining module 302 further includes:
  • the eighth parameter obtaining unit 3028 is configured to obtain at least two existing VAD decision results.
  • Embodiments of the present invention provide an activation tone detecting method and apparatus, and obtain a final joint VAD decision result according to the number of consecutively activated audio frames, the average full-band signal-to-noise ratio, the tonal signal flag, and at least two existing VAD decision results.
  • the VAD decision is comprehensively based on various parameters, which improves the accuracy of the VAD decision and solves the problem of inaccurate VAD detection.
  • steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into one or more integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit. Module to achieve.
  • embodiments of the invention are not limited to any particular combination of hardware and software.
  • the devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
  • each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium.
  • the above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • Embodiments of the present invention provide an activation tone detecting method and apparatus, and obtain a final joint VAD decision result according to the number of consecutively activated audio frames, the average full-band signal-to-noise ratio, the tonal signal flag, and at least two existing VAD decision results.
  • the VAD decision is comprehensively based on various parameters, which improves the accuracy of the VAD decision and solves the problem of inaccurate VAD detection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Mathematical Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种激活音检测方法和装置,所述方法包括:根据连续激活音帧个数、平均全带信噪比、调性信号标志和至少两种已有VAD判决结果得到最终联合VAD判决结果(107)。该方法和装置适用于语音业务,解决了相关VAD检测不准确的问题,实现了高准确性的VAD判决。

Description

激活音检测方法和装置 技术领域
本发明涉及通信领域, 尤其涉及一种激活音检测方法和装置。
背景技术
正常的语音通话中, 用户有时在说话, 有时在听, 这个时候就会在通话 过程出现非激活音阶段, 正常情况下通话双方总的非语音激活阶段要超过通 话双方总的语音编码时长的 50%。 在非激活音阶段, 只有背景噪声, 背景噪 声通常没有任何有用信息。 利用这一事实, 在语音频信号处理过程中, 通过 激活音检测 (VAD )算法检测出激活音和非激活音, 并釆用不同的方法分别 进行处理。 现代的 4艮多语音编码标准, 如 AMR、 AMR-WB, 都支持 VAD功 能。在效率方面,这些编码器的 VAD并不能在所有的典型背景噪声下都达到 很好的性能。 特别是在非稳定噪声下, 这些编码器的 VAD效率都较低。 而对 于音乐信号, 这些 VAD有时候会出现错误检测, 导致相应的处理算法出现 明显的质量下降。 另外, 相关的 VAD技术会存在判决不准确的情况, 例如有 的 VAD技术在语音段之前几帧检测不准, 有的 VAD在语音段之后几帧检测 不准确。
发明内容
本发明实施例提供了一种激活音检测方法和装置,解决了相关 VAD检测 不准确的问题。 一种激活音检测方法, 包括:
根据连续激活音的帧个数、 平均全带信噪比、 调性信号标志和至少两种 已有 VAD判决结果得到最终的联合 VAD判决结果。
优选的, 所述方法还包括:
获得当前帧的子带信号及频谱幅值;
根据子带信号计算得到当前帧的帧能量参数、 谱重心特征参数和时域稳 定度特征参数的值;
根据频谱幅值计算得到谱平坦度特征参数和调性特征参数的值; 根据调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱平坦度 特征参数计算所述调性信号标志。
优选的, 所述方法还包括:
获取前一帧估计得到的背景噪声能量;
根据所述前一帧估计得到的背景噪声能量、 当前帧的帧能量参数计算得 到所述平均全带信噪比。
优选的, 所述获取前一帧估计得到的背景噪声能量包括:
获得前一帧的子带信号及频谱幅值;
根据前一帧子带信号计算得到前一帧的帧能量参数、 谱重心特征参数、 时域稳定度特征参数的值;
根据前一帧频谱幅值计算得到前一帧谱平坦度特征参数和调性特征参 数;
根据前一帧的帧能量参数、 谱重心特征参数、 时域稳定度特征参数、 谱 平坦度特征参数、 调性特征参数计算得到前一帧的背景噪声标识;
根据前一帧调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱 平坦度特征参数计算前一帧调性信号标志; 根据前一帧的背景噪声标识、 帧能量参数、 调性信号标志、 前第二帧的 全带背景噪声能量, 得到前一帧全带背景噪声能量。
优选的, 所述帧能量参数是各个子带信号能量的加权叠加值或直接叠加 值;
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权 累加值的比值, 或所述比值进行平滑滤波得到的值;
所述时域稳定度特征参数是多个相邻两帧能量幅值叠加值的方差和多个 相邻两帧能量幅值叠加值平方的期望的比值, 或所述比值乘上一个系数; 所述谱平坦度特征参数是一个或多个频谱幅值的几何平均数和算术平均 数的比值, 或所述比值乘上一个系数;
调性特征参数是通过计算前后两帧信号的帧内频谱差分系数的相关系数 得到的 , 或继续对所述相关系数进行平滑滤波得到的。
优选的, 根据调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱平坦度特征参数计算所述调性信号标志包括:
A )在当前帧信号为非调性信号, 用一个调性帧标志 tonality— frame来指 示当前帧是否为调性帧;
B )在下述条件之一被满足时执行步骤 C ) , 在下述两个条件均不满足时 执行步骤 D ) :
条件 1 : 调性特征参数 to"fl/ -ratel的值或其平滑滤波后的值大于对应的 设定的第一调性特征参数判定门限值;
条件 2: 调性特征参数 to"fl/ -ratel的值或其平滑滤波后的值大于对应的 设定的第二调性特征参数门限值;
C )判断当前帧是否为调性帧,并根据判断结果设置所述调性帧标志的值: 在满足全部以下条件时判定所述当前帧为调性帧, 在任意一个或多个以 下条件不满足时判定所述当前帧为非调性帧并执行步骤 D ) :
条件 1 : 所述时域稳定度特征参数值小于一个设定的第一时域稳定度判 定门限值; ,
条件 2: 谱重心特征参数值大于一个设定的第一谱重心判定门限值; 条件 3: 各子带的谱平坦度特征参数均小于各自对应的预设的谱平坦度 判定门限值时; 判定当前帧为调性帧, 设置所述调性帧标志的值;
D )根据所述调性帧标志对调性程度特征参数 tonallty-degree进行更新, 其 中调性程度参数 tonality_degree初始值在激活音检测开始工作时进行设置;
E )根据更新后的所述调性程度特征参数 tonallty_degree判断所述当前帧是 否为调性信号, 并设置调性标志 tonality—flag的值。
优选的, 在当前的调性帧标志指示所述当前帧为调性帧时, 釆用以下表 达式对调性程度特征参数 tonality_degree进行更新: tonality _ degree = tonality _ degree^ * td _ scale _ A + td _ scale _ B ,
其中, tonaHty gre^为前一帧的调性程度特征参数, 其初始值取值范围 为 [0 , 1] , td-scale-A为衰减系数, td-scale-B为累加系数。
优选的, 在调性程度特征参数 ton^ty-deg^大于设定的调性程度门限值 时, 判定当前帧为调性信号;
在调性程度特征参数 tonality-degree小于或等于设定的调性程度门限值 时, 判定当前帧为非调性信号。
优选的, 该方法还包括:
在当前帧为第二帧及第二帧以后的语音帧时,通过前一联合 VAD判决结 果计算当前的连续激活音帧个数 continuous— speech_num2:
当联合 VAD标志 vad flag标志为 1时 continuous— speech— num2力口 1 ; 当 vad— flag判为 0时, continuous— speech— num2置 0。
优选的, 在当前帧为第一帧时, 所述连续激活音帧个数为 0。 活音帧, 所述已有 VAD判决结果或联合 VAD判决结果为 0时表示为非激活 音帧, 所述根据连续激活音帧个数、 平均全带信噪比、 调性信号标志和至少 两种已有 VAD判决结果得到最终联合 VAD判决结果包括:
当满足以下任意一个条件时选择至少两种已有 VAD 的判决结果的逻辑 运算作为联合 VAD判决结果,当不满足下列至少一个条件时选择所述至少两 种已有 VAD判决结果中的一个已有 VAD判决结果作为联合 VAD判决结果, 其中, 所述逻辑运算是指 "或" 运算或者 "和" 运算:
条件 1 : 平均全带信噪比大于信噪比阔值,
条件 2: continuous— speech— num2大于连续激活音帧个数阔值且平均全带 信噪比大于信噪比阔值,
条件 3 : 调性信号标志设置为 1。 音帧, 所述根据连续激活音帧个数、 平均全带信噪比、 调性信号标志和至少 两种已有 VAD判决结果得到最终联合 VAD判决结果包括:
当满足以下任一条件时所述联合 VAD判决结果为 1 , 当不满足下列至少 一个条件时选择所述至少两个已有 VAD判决结果的逻辑运算作为输出, 其 中, 逻辑运算是指 "或" 运算或者 "和" 运算:
条件 1 : 至少两个已有 VAD判决结果全部为 1 ,
条件 2: 至少两个已有 VAD判决结果之和大于联合判决阔值, 并且调性 信号标志设置为 1 ,
条件 3: continuous— speech— num2大于连续激活音帧个数阔值且平均全带 信噪比大于信噪比阔值, 调性信号标志设置为 1。
本发明实施例还提供了一种激活音检测装置, 包括:
联合判决模块, 设置为根据连续激活音帧个数、 平均全带信噪比、 调性 信号标志和至少两种已有 VAD判决结果得到最终的联合 VAD判决结果。
优选的, 所述装置还包括参数获取模块, 所述参数获取模块包括: 第一参数获取单元, 设置为获得当前帧的子带信号及频谱幅值; 第二参数获取单元,设置为根据子带信号计算得到当前帧的帧能量参数、 谱重心特征参数和时域稳定度特征参数的值;
第三参数获取单元, 设置为根据频谱幅值计算得到谱平坦度特征参数和 调性特征参数的值;
第四参数获取单元, 设置为根据调性特征参数、 谱重心特征参数、 时域 稳定度特征参数、 谱平坦度特征参数计算所述调性信号标志。
优选的, 所述参数获取模块还包括:
第五参数获取单元, 设置为获取前一帧估计得到的背景噪声能量; 第六参数获取单元, 设置为根据所述前一帧估计得到的背景噪声能量、 当前帧的帧能量参数计算得到所述平均全带信噪比。
优选的, 所述参数获取模块还包括: 第七参数获取单元, 设置为在当前帧为第一帧时, 确定所述连续激活音 帧个数为 0,
在当前帧为第二帧及第二帧以后的语音帧时,通过前一联合 VAD判决结 果计算当前的连续激活音†j¾个数 continuous— speech— num2:
当联合 VAD标志 vad flag标志为 1时 continuous— speech— num2力口 1 ; 当 vad— flag判为 0时, continuous— speech— num2置 0。
本发明实施例提供了一种激活音检测方法和装置, 根据连续激活音帧个 数、平均全带信噪比、调性信号标志和至少两种已有 VAD判决结果得到最终 联合 VAD判决结果, 实现了根据多种参数综合进行 VAD判决,提高了 VAD 判决的准确性, 解决了 VAD检测不准确的问题。
附图概述
图 1为本发明的实施例一提供的一种激活音检测方法的流程图; 图 2为本发明的实施例二提供的一种激活音检测方法的流程图; 图 3为本发明的实施例四提供的一种激活音检测装置的结构示意图; 图 4为图 3中参数获取模块 302的结构示意图。
本发明的较佳实施方式
为了解决 VAD检测不准确的问题,本发明的实施例提供了一种激活音检 测方法。 下文中将结合附图对本发明的实施例进行详细说明。 在不冲突的情 况下, 本申请中的实施例及实施例中的特征可以相互任意组合。
下面结合附图, 对本发明的实施例一进行说明。
本发明实施例提供了一种激活音检测方法,使用该方法完成 VAD的流程 如图 1所示, 包括:
步骤 101 : 获取至少两种已有的 VAD的判决结果; 步骤 102: 获得当前帧的子带信号及频谱幅值;
本发明实施例中以帧长为 20ms, 釆样率为 32kHz的音频流为例说明。在 其它帧长和釆样率条件下, 本发明实施例提供的激活音检测方法同样适用。
将当前帧时域信号输入滤波器组单元, 进行子带滤波计算, 得到滤波器 组子带信号。
本发明实施例中釆用一个 40通道的滤波器组,本发明实施例提供的技术 方案对于釆用其他通道数的滤波器组同样适用。
将当前帧时域信号输入 40 通道的滤波器组, 进行子带滤波计算, 得到 16个时间样点上 40个子带的滤波器组子带信号 Α^, 0< :<40 , 0</<16 , 其中 为滤波器组子带的索引, 其值表示系数对应的子带, 为各个子带的 时间样点索引, 其实现步骤如下:
1: 将最近的 640个音频信号样值存储在数据緩存中。
2: 将数据緩存中的数据移 40个位置, 把最早的 40个釆样值移出数据緩 存, 并把 40个新的样点存入到 0到 39的位置上。
将緩存中的数据 X乘上窗系数, 得到数组 计算表达式如下: ζ[η] = χ[η] · Wqmf[n];0 <n< 640; 其中 W f为滤波器组窗系数。
釆用以下的伪代码计算得到一个 80点的数据 u,
for ( «<80; «++) { =0;
for ( J=0; <8;
u[n]+ = z[n + j* 80];
釆用下面的方程计算得到数组 r和 i: r[n] = u[n]-u[79-n]
,0<77<40
i[n] = u[n] + u[79-n] 釆用下面的方程计算得到第一个时间样点上 40 个复数子带样值, [ /] = R(k) + il (k\ 0≤ A < 40 ,其中 和 I (k)分别为滤波器组子带信号 第/个 时间样点上系数的实部和虚部, 其计算表达式如下:
Figure imgf000010_0001
3: 重复 2的计算过程, 直到将本帧的所有数据都经过滤波器组滤波, 最 后的输出结果即为滤波器组子带信号 Z]。
4: 完成上面计算过程后, 得到 40个子带的 16个时间样点的滤波器组子 带信号; 0≤A<40, 0≤/<16。 然后, 对滤波器组子带信号进行时频变换, 并计算得到频谱幅值。
其中对全部滤波器组子带或部分滤波器组子带进行时频变换, 计算频谱 幅值, 都可以实现本发明实施例。 本发明实施例所述的时频变换方法可以是 DFT、 FFT、 DCT或 DST。 本发明实施例釆用 DFT为例, 说明其实现方法。 计算过程如下:
对索引为 0到 9的每个滤波器组子带上的 16个时间样点数据进行 16点 的 DFT 变换, 提高频谱分辨率, 并计算各个频点的幅值, 得到频谱幅值
时频变换计算表达式如下:
15
^τ[』 =∑ [ 16 ;0<^<9;0< <16; 计算各个频点的幅值过程如下:
首先, 计算数组^^ 在各个点上的能量, 计算表达式如下:
X [k, j] = (real (XDFT [k, j]f + (image(XDFT [k,j])2;0≤k<\0;0≤j<\6; ^ 中 pow \.k - Ά) , i ge(XDFTP0W [k, j])分另,】表示频谱系数 k - Ά的实部 和虚部。 如果 为偶数, 则釆用以下方程计算各个频点上的频谱幅值: XDFT_AMP [8 ·
Figure imgf000011_0001
A < 10; 0≤ < 8; 如果 为奇数, 则釆用以下方程计算各个频点上的频谱幅值:
XDFT_AMP
Figure imgf000011_0002
0≤ 10; 0≤ < 8; 即为时频变换后的频谱幅值。
步骤 103: 根据子带信号计算得到当前帧的帧能量参数、 谱重心特征参 数和时域稳定度特征参数的值, 根据频谱幅值计算得到谱平坦度特征参数和 调性特征参数的值;
帧能量参数可釆用现有技术方法获得, 优选的, 各参数釆用如下方法获 付:
所述帧能量参数是各个子带信号能量的加权叠加值或直接叠加值:
1、 根据滤波器组子带信号 ^^]计算各滤波器组子带的能量, 计算表达 式如下:
15
E b [k] = Y ((real(X[k, I]))2 + (image(X[k, I]))2 ); 0≤k < 40;
2、将部分听觉比较敏感的滤波器组子带或所有的滤波器组子带的能量累 加, 得到帧能量参数。
其中根据心理听觉模型,人耳对极低频(如 100Hz以下)和高频(如 20kHz 以上) 声音会比较不敏感, 本发明实施例认为按照频率从低到高排列的滤波 器组子带, 从第二个子带到倒数第二个子带为听觉比较敏感的主要滤波器组 子带, 将部分或全部听觉比较敏感的滤波器组子带能量累加得到帧能量参数 1 , 计算表达式如下:
e _sb _end n e— sb— start 其中, 为起始子带索引, 其取值范围为 [0, 6]。 e _ sb _ end 为结束子带索引, 其取值大于 6, 小于子带总数。 帧能量参数 1的值加上部分或全部在计算帧能量参数 1时未使用的滤波 器组子带的能量的加权值, 得到帧能量参数 2, 其计算表达式如下:
Et2 = E + e _ scaleX · ^ Esh + scale! · ^ Esh [n] 其中 e_^//e2为加权比例因子, 其取值范围分别为 [o , 1] 为子带总个数 t
所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权 累加值的比值;
根据各个滤波器组子带的能量计算得到谱重心特征参数, 谱重心特征参 数是通过求滤波器组子带能量加权相加的和与子带能量的直接相加的和的比 值或通过对其他谱重心特征参数值进行平滑滤波得到的。
谱重心特征参数可以釆用如下子步骤实现:
1、 将用于谱重心特征参数计算的子带区间划分如表 1所示。
表 1
Figure imgf000012_0002
2、 釆用表 1的谱重心特征参数计算区间划分方式和以下表达式, 计算得 到两个谱重心特征参数值, 分别为第一区间谱重心特征参数和第二区间谱重 心特征参数。 ― ,0 - < 2
Figure imgf000012_0001
Ddtal , i¾/to2分别为一个小的偏置值, 取值范围为 (0, 1 ) 。 其中 为 谱重心编号索引。
3、 对第一区间谱重心特征参数 -ee"^[Q]进行平滑滤波运算, 得到平滑 谱重心特征参数值, 即第一区间谱重心特征参数值的平滑滤波值, 计算过程 下:
sp _ centej[2] = sp_ center^ [Ί\· spc _sm _ scale +sp_ cente}[0]•(\-spc_sm_ scale) 其中, ― 为谱重心参数平滑滤波比例因子, J2]表示 上一帧的平滑谱重心特征参数值, 其初始值为 1.6
所述时域稳定度特征参数是若干相邻两帧能量幅值叠加值的方差和若干 相邻两帧能量幅值叠加值平方的期望的比值, 或该比值乘上一个系数;
由最新的若干帧信号的帧能量参数计算得到时域稳定度特征参数。 在本 发明实施例中釆用最新的 40 帧信号的帧能量参数计算得到时域稳定度特征 参数。 计算步骤为:
首先, 计算得到最近 40帧信号的能量幅值, 计算方程如下:
Amptl [n] = ^Et2(n) +e_ offset; 0≤ n < 40; 其中, e— 为一个偏置值, 其取值范围为 [0, 0.1]
其次, 依次将当前帧到前面第 40帧的相邻两帧的能量幅值相加, 得到 20个幅值叠加值。 计算方程如下:
Ampt2 (n) = Amp (-2/7) + Amp (-2/ - l) 0≤ n < 20;
其中, "=0时, "表示当前帧的能量幅值, "<o时, "表示当前帧 往前的 n帧的能量幅值。
最后, 通过计算最近的 20个幅值叠加值的方差和平均能量的比值, 得到 时域稳定度特征参数1 td-stable-rateQ。 计算表达式如下: ltd stable rateO
Figure imgf000013_0001
所述谱平坦度特征参数是某些频谱幅值的几何平均数和算术平均数的比 值, 或该比值乘上一个系数;
将频谱幅值 划分成若干个频带,并计算当前帧各个频带的谱平坦 度, 得到当前帧的谱平坦度特征参数。
本发明实施例将频谱幅值划分成 3个频带, 并计算这 3个频带的谱平坦 度特征, 其实现步骤如下:
首先, 将 按照下表 2的索引划分为 3个频带。
表 2
Figure imgf000014_0002
其次, 分别计算各个子带的谱平坦度, 得到当前帧的谱平坦度特征参数。 当前帧的各个谱平坦度特征参数值的计算表达式如下:
Figure imgf000014_0001
fieq_band_end(k) -fieq_band_start(k) +1 ^ ^^ - ( )
最后, 对当前帧的谱平坦度特征参数进行平滑滤波, 得到当前帧最终的 谱平坦度特征参数。
sSMR{k) = smr _ scale · sSMR x (k) + (1— smr _ scale) · SMR(k);0 < k < 3 其中5 为平滑因子, 其取值范围为 [0.6 , 1] , ^^?_ )为上一帧 的第 k个谱平坦度特征参数的值。
调性特征参数是通过计算前后两帧信号的帧内频谱差分系数的相关值得 到的, 或继续对该相关值进行平滑滤波得到的。
前后两帧信号的帧内频谱差分系数的相关值的计算方法如下:
根据频谱幅值计算得到调性特征参数, 其中调性特征参数可以根据所有 频谱幅值或部分频谱幅值计算得到。
其计算步骤如下:
1、 将部分(不小于 8个频谱系数)或全部频谱幅值跟相邻的频谱幅值做 差分运算, 并将差分结果小于 0的值置 0, 得到一组非负的频谱差分系数。
本发明实施例选择位置索引为 3到 61的频点系数为例,计算调性特征参 数。 过程如下:
将频点 3到频点 61的相邻频谱幅值做差分运算, 表达式如下: spec _dif[n - 3] = XDFT AMP (" + 1) _ XDFT_AMP ("); 3≤ " < 62;
将¾^_ 中小于 0的变量置零。
2、求取步骤 1计算得到的当前帧非负的频谱差分系数和前一帧非负的频 谱差分系数的相关系数, 得到第一调性特征参数值。 计算表达式如下:
^ spec dif[n] . pre spec dif[n]
tonality rate! - . ":。
^ ― 56 56
、 /∑ sPec ^/[w]2 ·∑ pre spec dif[nf 其中, pre—spec—dif为前一帧的非负的频谱差分系数。
3、 对第一调性特征参数值进行平滑运算, 得到第二调性特征参数值。 计算方程如下:
tonality _ rate! = tonal scale · tonality— ratel_x + (1— tonal _ scale) · tonality— ratel
to"a/ _s fe为调性特征参数平滑因子, 其取值范围为 [o.l , 1] , to"arate2-i为前一帧的第二调性特征参数值, 其初始值取值范围为 [0, 1]。
步骤 104: 计算调性信号标志, 参考本发明的实施例三中调性信号计算 的流程。
步骤 105: 根据前一帧估计得到的全带背景噪声能量、 当前帧的帧能量 参数计算得到平均全带信噪比;
前一帧的全带背景噪声能量获得方法参见实施例 2。
根据估计得到的上一帧全带背景噪声能量(见实施例 2 )和当前帧的帧 能量参数, 计算全带信噪比 SVR2 :
SNR2 = log, - F 其中 为估计得到的上一帧全带背景噪声能量, 得到上一帧全带背景 噪声能量原理与得到当前帧的全带背景噪声能量的原理相同。
计算最近若干个帧的全带信噪比 SVR2的平均值, 得到平均全带信噪比
SNR2_lt _ave 步骤 106: 获取连续激活音帧的个数;
连续激活音帧个数 continuous— speech— num2可以通过 VAD判决结果进行 计算,初始值设为 0,当 VAD标志 vad— flag标志为 1时 continuous— speech— num2 力口 1; vad— flag判为 0时, continuous— speech— num2置 0。
步骤 107: 根据连续激活音帧个数、 平均全带信噪比、 调性信号标志和 至少两种已有 VAD判决结果得到最终的联合 VAD判决结果;
活音帧。 需要说明的是, 以 1、 0值代表激活音帧和非激活音帧仅是一种标记 方式,以其他值或其他方式标记区分 VAD判决的不同结果的方案均在本发明 实施例的保护范围之内。
下面以两种实现方法为例进行说明, 实施过程可以有其他的联合方法。 当满足以下任意一个条件时选择至少两种已有 VAD 的判决结果的逻辑 运算作为联合 VAD判决结果,当不满足下列至少一个条件时选择所述至少两 种已有 VAD判决结果中的一个已有 VAD判决结果作为联合 VAD判决结果, 其中, 所述逻辑运算是指 "或" 运算或者 "和" 运算:
条件 1 : 平均全带信噪比大于信噪比阔值,
条件 2: continuous— speech— num2大于连续激活音帧个数阔值且平均全带 信噪比大于信噪比阔值,
条件 3: 调性信号标志设置为 1。
述根据连续激活音帧个数、 平均全带信噪比、 调性信号标志和至少两种已有 VAD判决结果得到最终联合 VAD判决结果包括:
当满足以下任一条件时所述联合 VAD判决结果为 1 , 当不满足下列至少 一个条件时选择所述至少两个已有 VAD判决结果的逻辑运算作为输出, 其 中, 逻辑运算是指 "或" 运算或者 "和" 运算: 条件 1 : 至少两个已有 VAD判决结果全部为 1 ,
条件 2: 至少两个已有 VAD判决结果之和大于联合判决阔值, 并且调性 信号标志设置为 1 ,
条件 3: continuous— speech— num2大于连续激活音帧个数阔值且平均全带 信噪比大于信噪比阔值, 调性信号标志设置为 1。
本发明实施例中的步骤 101至步骤 106并无严格的时序关系 (其中步骤 102、 103和 104的时序顺序不可颠倒) , 只要能获得本发明实施例进行联合 VAD判决时所需的连续激活音帧个数、 平均全带信噪比、 调性信号标志和至 少两种已有 VAD判决结果的方案, 均属于本发明的实施例的保护范围。
下面结合附图, 对本发明的实施例二进行说明。
如图 2所示, 给出了一种前一帧背景噪声能量的计算方法, 前一帧的背 景噪声能量用于计算平均全带信噪比。 前一帧的背景噪声能量的计算流程和 当前帧的背景噪声能量的计算流程相同, 本发明实施例给出了当前帧的全带 背景噪声能量的计算方法。
步骤 201 : 获得当前帧的子带信号及频谱幅值, 计算方法见步骤 102。 步骤 202: 根据子带信号计算得到当前的帧能量参数、 谱重心特征参数、 时域稳定度特征参数的值; 根据频谱幅值计算得到谱平坦度特征参数和调性 特征参数的值, 计算方法见步骤 103。
步骤 203: 根据当前帧帧能量参数、 谱重心特征参数、 时域稳定度特征 参数、 谱平坦度特征参数、 调性特征参数计算得到当前帧的背景噪声标识。 背景噪声标识用于表示当前帧是否是噪声信号, 如果是噪声信号, 则背景噪 声标识设置为 1 , 否则设置为 0。
优选的, 假定当前是背景噪声信号, 判断以下任一条件成立, 则判定当 前帧不是噪声信号:
条件 1 : 所述时域稳定度特征参数1 t-Stab -rateG大于一个设定的时域稳 定度门限值;
条件 2: 第一区间谱重心特征参数值的平滑滤波值大于一个设定的谱重 心门限值, 且时域稳定度特征参数值也大于设定的时域稳定度门限值;
条件 3 : 调性特征参数或其平滑滤波后的值大于一个设定的调性特征参 数门限值,且时域稳定度特征参数1 t-StaWe-rateG值大于其设定的时域稳定度门 限值;
条件 4: 各子带的谱平坦度特征参数或各自平滑滤波后的值均小于各自 对应的设定的谱平坦度门限值;
条件 5: 判定帧能量参数 ^的值大于设定的帧能量门限值 - W。
本发明实施例通过一个背景噪声标识 background— flag来指示当前帧是否 是背景噪声, 并约定如果判定当前帧为背景噪声, 则设置背景噪声标识 background— flag为 1 , 否则设置背景噪声标识 background— flag为 0。
根据时域稳定度特征参数、 谱重心特征参数、 谱平坦度特征参数、 调性 特征参数、 当前帧能量参数检测当前帧是否为噪声信号。 如果不是噪声信号, 则将背景噪声标识 background— flag置 0。
过程如下:
对于条件 1: 判断时域稳定度特征参数 lt_stable_rateG是否大于一个设定 的第一时域稳定度门限值 ^1。 如果是, 则判定当前帧不是噪 声信号, 并将 background— flag置 0。 本发明实施例中第一时域稳定度门限值 /t— — fe— 1取值范围为 [0 8 , J 6] ; 对于条件 2: 判断第一区间平滑谱重心特征参数值是否大于一个设定的 第一谱重心门限值 cewtef— tM ,并且时域稳定度特征参数 ltjtablejateO的值 也大于第二时域稳定度门限值1 t-stab^-rate-111^ 如果是, 则判定当前帧不是 噪声信号, 并将 background— flag置 0。 的取值范围为 [1.6, 4]; lt_stable_rate_thr2的取值范围为 (0, 0 1] o 对于条件 3: 判断调性特征参数 towa/ -rate2的值是否大于一个第一调性 特征参数门限值 towa// - rate- 时域稳定度特征参数1 t-StaWe-rateG值是否大 于设定的第三时域稳定度门限值 lt_stable_rate_thr3 , 如果上述条件同时成立, 则判定当前帧不是背景噪声, background— flag赋值为 0。 t隱 lity—mte—thrl取 值范围为 [0.4, 0.66]。 IstaWejate-1111"3的取值范围为 [0.06, 0.3]。
对于条件 4: 判断第一谱平坦度特征参数 ^[0]的值是否小于设定的第 一谱平坦度门限值 ^^—^1 , 判断第二谱平坦度特征参数5 即]的值是否小 于设定的第二谱平坦度门限值 sSMR—thr2 , 判断第三谱平坦度特征参数 ^^[2]的值是否小于设定的第三谱平坦度门限值^ 3。如果上述条件同 时成立, 则判定当前帧不是背景噪声。 background— flag赋值为 0。 门限值 sSMR—thr\、 sSMR—thr2、 ^MR— 3的取值范围为 [0 88, 0.98]。 判断第一谱平 坦度特征参数^ [0]的值是否小于设定的第四谱平坦度门限值 , 判断第二谱平坦度特征参数 sSM^的值是否小于设定的第五谱平坦度门限值 sSMR—thr ,判断第二谱平坦度特征参数 的值是否小于设定的第六谱平 坦度门限值^ 6。如果上述任一条件成立,则判定当前帧不是背景噪声。 background— flag赋值为 0。 sSMR—thr sSMR _thr5 ^ MR— 6的取值范围为
[0.80, 0.92]
对于条件 5 : 判断帧能量参数 ^的值是否大于设定的帧能量门限值 E-thrl , 如果上述条件成立, 则判定当前帧不是背景噪声。 background— flag 赋值为 0。 根据帧能量参数的动态范围进行取值。
步骤 204: 根据调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱平坦度特征参数计算调性信号标志; 其步骤见本发明的实施例三中的调性 信号计算流程。 步骤 205: 根据帧能量参数、 背景噪声标识、 调性信号标志计算当前帧 背景噪声能量。 步骤如下:
1、如果当前帧的背景噪声标识为 1 ,则更新背景噪声能量累加值 和 背景噪声能量累计帧数 ^∞1 , 计算方程如下:
κ t—sum = κ t—sum—— ,\ + κ t,l '■
Af = Af + 1 +
Et _ counter Et _ counter— -1 '
其中 A-—1为前一帧的背景噪声能量累加值, Nf —-\为前一帧计算 得到的背景噪声能量累计帧数。
2、 全带背景噪声能量由背景噪声能量累加值 _ 和累计帧数^^^的 比值得到:
判断 是否等于 64, 如果^_∞1 等于 64, 则分别将背景噪声能量 累加值 和累计帧数 乘 0.75。
3、 根据调性信号标志、 帧能量参数、 全带背景噪声能量的值对背景噪声 能量累加值进行调整。 计算过程如下:
如果调性标志 tonality—flag等于 1并且帧能量参数 的值小于背景噪声 能量特征参数 ΕΆ的值乘以一个增益系数 ,
则, Et— = Et— · S^n + delta; 其中, 的取值范围为 [0.3 , 1]。
下面对本发明的实施例三进行说明。
本发明实施例提供了一种激活音检测方法, 能够与本发明的实施例一和 实施例二所提供的技术方案相结合, 计算调性信号标志, 包括:
根据调性特征参数、 时域稳定度特征参数、 谱平坦度特征参数、 谱重心 特征参数判断当前帧是否为调性信号。 判断是否为调性信号时, 执行以下操作: 1、 用一个调性帧标志 tonality— frame来指示当前帧是否为调性帧; 本发明实施例中 tonality— frame的值为 1表示当前帧为调性帧, 0表示当 前帧为非调性帧;
2、 判断调性特征参数 to"fl/ - ratel或其平滑滤波后 tonality _rate2的值是否 大于对应的设定的第一调性特征参数判定门限值 t alit - decision - thrl或第二 调性特征参数判定门 ntonality—decisionthr2 , 如果上述条件有一个成立则 执行步骤 3 , 否则执行步骤 4;
其中 , tonality _ decision _ thrl的取值范围为 [0 5 , 0 7] , tonality _ rate\的取 值范围为 [0.7 , 0.99]。
3、 如果时域稳定度特征参数值1 t-Stab -rateG小于一个设定的第一时域稳 定度判定门限值 !t—血 Me -decision— thrl . 谱重心特征参数值 大于一 个设定的第一谱重心判定门限值 ^c— 1 , 且各子带的谱平坦度特征 参数均小于各自对应的预设的谱平坦度门限值, 即, 第一谱平坦度特征参数 sSMR^小于一个设定的第一谱平坦度判定门限值 sSMF—decision—thr\或第二 谱平坦度参数 小于一个设定的第二谱平坦度判定门限值 sSMF— decision— thr2或第三谱平均度参数 ^MR[2]小于一个设定的第三谱平坦 度判定门限值5^ -6^^0"-^ 3 ; 则判定当前帧为调性帧, 设置调性帧标志 tonality— frame 的值为 1 , 否则判定当前帧为非调性帧, 设置调性帧标志 tonality— frame的值为 0。 并继续执行步骤 4。
其中, 〃― fifeciw'ow— t zrl 々取值范围为 [0 01 , 0 25] , SPC—decision _thrl 为 [1 0 , 1 8] , sSMF decision thrl 6, 0 9] , sSMF—decision—thr2为 [0 6, Q 9] , sSMF _ decision _ thr3为 [Q 7 , o 98]
4、根据调性帧标志 tonality— frame对调性程度特征参数 tonallty-degree进行 更新,其中调性程度参数 tonallty-degree初始值在激活音检测装置开始工作时进 行设置, 取值范围为 [0, 1]。 不同的情况下, 调性程度特征参数 na^-degree 计算方法不同:
如果当前的调性帧标志指示当前帧为调性帧, 则釆用以下表达式对调性 程度特征参数 tonality_degree进行更新: tonality—degree = tonality—degree—, · td scale A + td scale B;
其中, tonaHty gre^为前一帧的调性程度特征参数。其初始值取值范围 为 [0 , 1]。 td-scale-A为衰减系数, 其取值范围为 [0 , 1] ; td-scale-B为累加系数, 其取值范围为 [0 , 1]。
5、 根据更新后的调性程度特征参数 tonality_degree判断当前帧是否为调性 信号, 并设置调性标志 tonality—flag的值;
若调性程度特征参数 tonallty-degree大于设定的调性程度门限值,则判定当 前帧为调性信号, 否则, 判定当前帧为非调性信号。
下面结合附图, 对本发明的实施例四进行说明。
本发明实施例还提供了一种激活音检测装置, 如图 3所示, 该装置包括: 联合判决模块 301 , 设置为根据连续激活音帧个数、 平均全带信噪比、 调性信号标志和至少两种已有 VAD判决结果得到最终的联合 VAD判决结 果。
优选的, 所述装置还包括参数获取模块 302 , 所述参数获取模块 302的 结构如图 4所示, 包括:
第一参数获取单元 3021 , 设置为获得当前帧的子带信号及频谱幅值; 第二参数获取单元 3022 , 设置为根据子带信号计算得到当前帧的帧能量 参数、 谱重心特征参数和时域稳定度特征参数的值;
第三参数获取单元 3023 , 设置为根据频谱幅值计算得到谱平坦度特征参 数和调性特征参数的值; 计算方法可参照本发明的实施例三。
第四参数获取单元 3024 , 设置为根据调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱平坦度特征参数计算所述调性信号标志。
优选的, 所述参数获取模块 302还包括:
第五参数获取单元 3025 , 设置为获取前一帧估计得到的背景噪声能量; 计算方法或参考本发明的实施例二。
第六参数获取单元 3026 , 设置为根据所述前一帧估计得到的背景噪声能 量、 当前帧的帧能量参数计算得到所述平均全带信噪比。
优选的, 所述参数获取模块 302还包括:
第七参数获取单元 3027 , 设置为在当前帧为第一帧时, 确定所述连续激 活音帧个数为 0,
在当前帧为第二帧及第二帧以后的语音帧时,通过前一联合 VAD判决结 果计算当前的连续激活音帧个数 continuous— speech— num2: 当联合 VAD标志 vad flag标志为 1时 continuous— speech— num2力口 1 ;
当 vad— flag判为 0时, continuous— speech— num2置 0。
优选的, 所述参数获取模块 302还包括:
第八参数获取单元 3028, 设置为获取至少两种已有 VAD判决结果。
本发明的实施例提供了一种激活音检测方法和装置, 根据连续激活音帧 个数、 平均全带信噪比、调性信号标志、 至少两种已有 VAD判决结果得到最 终联合 VAD判决结果, 实现了根据多种参数综合进行 VAD判决, 提高了 VAD判决的准确性, 解决了 VAD检测不准确的问题。
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计 算机程序流程来实现, 所述计算机程序可以存储于计算机可读存储介质中, 所述计算机程序在相应的硬件平台上(如***、 设备、 装置、 器件等)执行, 在执行时, 包括方法实施例的步骤之一或其组合。
可选地, 上述实施例的全部或部分步骤也可以使用集成电路来实现, 这 些步骤可以被分别制作成一个或多个集成电路模块, 或者将它们中的多个模 块或步骤制作成单个集成电路模块来实现。 这样, 本发明不实施例限制于任 何特定的硬件和软件结合。
上述实施例中的各装置 /功能模块 /功能单元可以釆用通用的计算装置来 实现, 它们可以集中在单个的计算装置上, 也可以分布在多个计算装置所组 成的网络上。 上述实施例中的各装置 /功能模块 /功能单元以软件功能模块的形式实现 并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。 上述提到的计算机可读取存储介质可以是只读存储器, 磁盘或光盘等。
任何熟悉本技术领域的技术人员在本发明实施例揭露的技术范围内, 可 轻易想到变化或替换, 都应涵盖在本发明实施例的保护范围之内。 因此, 本 发明实施例的保护范围应以权利要求所述的保护范围为准。
工业实用性
本发明实施例提供了一种激活音检测方法和装置, 根据连续激活音帧个 数、平均全带信噪比、调性信号标志和至少两种已有 VAD判决结果得到最终 联合 VAD判决结果, 实现了根据多种参数综合进行 VAD判决,提高了 VAD 判决的准确性, 解决了 VAD检测不准确的问题。

Claims

权 利 要 求 书
1、 一种激活音检测方法, 包括:
根据连续激活音帧个数、 平均全带信噪比、 调性信号标志和至少两种已 有激活音检测 (VAD )判决结果得到最终的联合 VAD判决结果。
2、 根据权利要求 1所述的激活音检测方法, 所述方法还包括: 获得当前帧的子带信号及频谱幅值;
根据子带信号计算得到当前帧的帧能量参数、 谱重心特征参数和时域稳 定度特征参数的值;
根据频谱幅值计算得到谱平坦度特征参数和调性特征参数的值; 根据调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱平坦度 特征参数计算所述调性信号标志。
3、 根据权利要求 1所述的激活音检测方法, 所述方法还包括: 获取前一帧估计得到的背景噪声能量;
根据所述前一帧估计得到的背景噪声能量、 当前帧的帧能量参数计算得 到所述平均全带信噪比。
4、 根据权利要求 3所述的激活音检测方法, 其中, 所述获取前一帧估计 得到的背景噪声能量包括:
获得前一帧的子带信号及频谱幅值;
根据前一帧子带信号计算得到前一帧帧能量参数、 谱重心特征参数、 时 域稳定度特征参数的值;
根据前一帧频谱幅值计算得到前一帧谱平坦度特征参数和调性特征参 数;
根据前一帧的帧能量参数、 谱重心特征参数、 时域稳定度特征参数、 谱 平坦度特征参数、 调性特征参数计算得到前一帧的背景噪声标识;
根据前一帧调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱 平坦度特征参数计算前一帧调性信号标志; 根据前一帧的背景噪声标识、 帧能量参数、 调性信号标志、 前第二帧的 全带背景噪声能量, 得到前一帧全带背景噪声能量。
5、 根据权利要求 4所述的激活音检测方法, 其中,
所述帧能量参数是各个子带信号能量的加权叠加值或直接叠加值; 所述谱重心特征参数是所有或部分子带信号能量的加权累加值和未加权 累加值的比值, 或所述比值进行平滑滤波得到的值;
所述时域稳定度特征参数是多个相邻两帧能量幅值叠加值的方差和多个 相邻两帧能量幅值叠加值平方的期望的比值, 或所述比值乘上一个系数; 所述谱平坦度特征参数是某些频谱幅值的几何平均数和算术平均数的比 值, 或所述比值乘上一个系数;
调性特征参数是通过计算前后两帧信号的帧内频谱差分系数的相关系数 得到的 , 或继续对所述相关系数进行平滑滤波得到的。
6、 根据权利要求 2所述的激活音检测方法, 其中, 根据调性特征参数、 谱重心特征参数、 时域稳定度特征参数、 谱平坦度特征参数计算所述调性信 号标志包括:
A )在当前帧信号为非调性信号时, 用一个调性帧标志 tonality— frame来 指示当前帧是否为调性帧;
B )在下述条件之一被满足时执行步骤 C ) , 在下述两个条件均不满足时 执行步骤 D ) :
条件 1 : 调性特征参数 to"fl/ -ratel的值或其平滑滤波后的值大于对应的 设定的第一调性特征参数判定门限值;
条件 2: 调性特征参数 to"fl/ -ratel的值或其平滑滤波后的值大于对应的 设定的第二调性特征参数门限值;
C )判断当前帧是否为调性帧 ,并根据判断结果设置所述调性帧标志的值: 在满足全部以下条件时判定所述当前帧为调性帧, 在任意一个或多个以 下条件不满足时判定所述当前帧为非调性帧并执行步骤 D ) : 条件 1 : 所述时域稳定度特征参数值小于一个设定的第一时域稳定度判 定门限值;
条件 2: 谱重心特征参数值大于一个设定的第一谱重心判定门限值; 条件 3 : 各子带的谱平坦度特征参数均小于各自对应的预设的谱平坦度 判定门限值时; 判定当前帧为调性帧, 设置所述调性帧标志的值;
D )根据所述调性帧标志对调性程度特征参数 tonallty-degree进行更新, 其 中调性程度参数 tonality_degree初始值在激活音检测开始工作时进行设置;
E )根据更新后的所述调性程度特征参数 tonallty_degree判断所述当前帧是 否为调性信号, 并设置调性标志 tonality—flag的值。
7、 根据权利要求 6所述的激活音检测方法, 其中, 在当前的调性帧标志 指示所述当前帧为调性帧时, 釆用以下表达式对调性程度特征参数 tonality_degree进行更新: tonality _ degree = tonality _ degree^ · ίά _ scale _ A + td _ scale _ B ,
其中, tGn^ty-^gree-i为前一帧的调性程度特征参数, 其初始值取值范围 为 [0, 1] , td-scale-A为衰减系数, td-scale-B为累加系数。
8、 根据权利要求 6所述的激活音检测方法, 其中,
在调性程度特征参数 tonality-degree大于设定的调性程度门限值时, 判定 当前帧为调性信号;
在调性程度特征参数 tonalit -degfee小于或等于设定的调性程度门限值 时, 判定当前帧为非调性信号。
9、 根据权利要求 1所述的激活音检测方法, 所述方法还包括: 在当前帧为第二帧及第二帧以后的语音帧时,通过前一联合 VAD判决结 果计算当前的连续激活音†j¾个数 continuous— speech— num2:
当联合 VAD标志 vad flag标志为 1时 continuous— speech— num2力口 1 ; 当 vad— flag判为 0时, continuous— speech— num2置 0。
10、 根据权利要求 9所述的激活音检测方法, 所述方法还包括: 在当前帧为第一帧时, 所述连续激活音帧个数为 0。
11、根据权利要求 1所述的激活音检测方法, 其中, 所述已有 VAD判决 结果或联合 VAD判决结果为 1时表示为激活音帧, 所述已有 VAD判决结果 或联合 VAD判决结果为 0时表示为非激活音帧,所述根据连续激活音帧个数、 平均全带信噪比、调性信号标志和至少两种已有 VAD判决结果得到最终联合 VAD判决结果包括:
当满足以下任意一个条件时选择至少两种已有 VAD 的判决结果的逻辑 运算作为联合 VAD判决结果,当不满足下列至少一个条件时选择所述至少两 种已有 VAD判决结果中的一个已有 VAD判决结果作为联合 VAD判决结果, 其中, 所述逻辑运算是指 "或" 运算或者 "和" 运算:
条件 1 : 平均全带信噪比大于信噪比阔值,
条件 2: continuous— speech— num2大于连续激活音帧个数阔值且平均全带 信噪比大于信噪比阔值,
条件 3: 调性信号标志设置为 1。
12、根据权利要求 1所述的激活音检测方法, 其中, 所述已有 VAD判决 结果或联合 VAD判决结果为 1时表示为激活音帧, 所述已有 VAD判决结果 或联合 VAD判决结果为 0时表示为非激活音帧, , 所述根据连续激活音帧个 数、平均全带信噪比、调性信号标志和至少两种已有 VAD判决结果得到最终 联合 VAD判决结果包括:
当满足以下任一条件时所述联合 VAD判决结果为 1 , 当不满足下列至少 一个条件时选择所述至少两个已有 VAD判决结果的逻辑运算作为输出, 其 中, 逻辑运算是指 "或" 运算或者 "和" 运算: 条件 1 : 至少两个已有 VAD判决结果全部为 1 ,
条件 2: 至少两个已有 VAD判决结果之和大于联合判决阔值, 并且调性 信号标志设置为 1 ,
条件 3: continuous— speech— num2大于连续激活音帧个数阔值且平均全带 信噪比大于信噪比阔值, 调性信号标志设置为 1。
13、 一种激活音检测装置, 包括:
联合判决模块, 设置为根据连续激活音帧个数、 平均全带信噪比、 调性 信号标志和至少两种已有 VAD判决结果得到最终的联合 VAD判决结果。
14、 根据权利要求 13所述的激活音检测装置, 其特征在于, 该装置还包 括参数获取模块, 所述参数获取模块包括:
第一参数获取单元, 设置为获得当前帧的子带信号及频谱幅值; 第二参数获取单元,设置为根据子带信号计算得到当前帧的帧能量参数、 谱重心特征参数和时域稳定度特征参数的值;
第三参数获取单元, 设置为根据频谱幅值计算得到谱平坦度特征参数和 调性特征参数的值; 第四参数获取单元, 设置为根据调性特征参数、 谱重心特征参数、 时域 稳定度特征参数、 谱平坦度特征参数计算所述调性信号标志。
15、 根据权利要求 14所述的激活音检测装置, 其中, 所述参数获取模块 还包括:
第五参数获取单元, 设置为获取前一帧估计得到的背景噪声能量; 第六参数获取单元, 设置为根据所述前一帧估计得到的背景噪声能量、 当前帧的帧能量参数计算得到所述平均全带信噪比。
16、 根据权利要求 14所述的激活音检测装置, 其中, 所述参数获取模块 还包括:
第七参数获取单元, 设置为在当前帧为第一帧时, 确定所述连续激活音 帧个数为 0,
在当前帧为第二帧及第二帧以后的语音帧时,通过前一联合 VAD判决结 果计算当前的连续激活音†j¾个数 continuous— speech— num2:
当联合 VAD标志 vad flag标志为 1时 continuous— speech— num2力口 1 ; 当 vad— flag判为 0时, continuous— speech— num2置 0。
PCT/CN2014/077704 2013-08-30 2014-05-16 激活音检测方法和装置 WO2014177084A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US14/915,246 US9978398B2 (en) 2013-08-30 2014-05-16 Voice activity detection method and device
JP2016537092A JP6412132B2 (ja) 2013-08-30 2014-05-16 音声活動検出方法及び装置
EP14791094.7A EP3040991B1 (en) 2013-08-30 2014-05-16 Voice activation detection method and device
PL14791094T PL3040991T3 (pl) 2013-08-30 2014-05-16 Sposób i urządzenie do wykrywania aktywacji głosowej
KR1020167005654A KR101831078B1 (ko) 2013-08-30 2014-05-16 보이스 활성화 탐지 방법 및 장치

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310390795.7A CN104424956B9 (zh) 2013-08-30 2013-08-30 激活音检测方法和装置
CN201310390795.7 2013-08-30

Publications (1)

Publication Number Publication Date
WO2014177084A1 true WO2014177084A1 (zh) 2014-11-06

Family

ID=51843162

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/077704 WO2014177084A1 (zh) 2013-08-30 2014-05-16 激活音检测方法和装置

Country Status (7)

Country Link
US (1) US9978398B2 (zh)
EP (1) EP3040991B1 (zh)
JP (1) JP6412132B2 (zh)
KR (1) KR101831078B1 (zh)
CN (1) CN104424956B9 (zh)
PL (1) PL3040991T3 (zh)
WO (1) WO2014177084A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015117410A1 (zh) * 2014-07-18 2015-08-13 中兴通讯股份有限公司 激活音检测的方法及装置
CN109801646A (zh) * 2019-01-31 2019-05-24 北京嘉楠捷思信息技术有限公司 一种基于融合特征的语音端点检测方法和装置
CN112908350A (zh) * 2021-01-29 2021-06-04 展讯通信(上海)有限公司 一种音频处理方法、通信装置、芯片及其模组设备
CN115862685A (zh) * 2023-02-27 2023-03-28 全时云商务服务股份有限公司 一种实时语音活动的检测方法、装置和电子设备

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102013111784B4 (de) * 2013-10-25 2019-11-14 Intel IP Corporation Audioverarbeitungsvorrichtungen und audioverarbeitungsverfahren
US9953661B2 (en) * 2014-09-26 2018-04-24 Cirrus Logic Inc. Neural network voice activity detection employing running range normalization
CN106328169B (zh) * 2015-06-26 2018-12-11 中兴通讯股份有限公司 一种激活音修正帧数的获取方法、激活音检测方法和装置
CN105654947B (zh) * 2015-12-30 2019-12-31 中国科学院自动化研究所 一种获取交通广播语音中路况信息的方法及***
CN107305774B (zh) * 2016-04-22 2020-11-03 腾讯科技(深圳)有限公司 语音检测方法和装置
US10755718B2 (en) * 2016-12-07 2020-08-25 Interactive Intelligence Group, Inc. System and method for neural network based speaker classification
IT201700044093A1 (it) * 2017-04-21 2018-10-21 Telecom Italia Spa Metodo e sistema di riconoscimento del parlatore
CN107393559B (zh) * 2017-07-14 2021-05-18 深圳永顺智信息科技有限公司 检校语音检测结果的方法及装置
CN109427345B (zh) * 2017-08-29 2022-12-02 杭州海康威视数字技术股份有限公司 一种风噪检测方法、装置及***
CN109859749A (zh) * 2017-11-30 2019-06-07 阿里巴巴集团控股有限公司 一种语音信号识别方法和装置
CN114999535A (zh) * 2018-10-15 2022-09-02 华为技术有限公司 在线翻译过程中的语音数据处理方法及装置
CN111292758B (zh) * 2019-03-12 2022-10-25 展讯通信(上海)有限公司 语音活动检测方法及装置、可读存储介质
KR20200114019A (ko) 2019-03-27 2020-10-07 주식회사 공훈 음성의 피치 정보에 기초한 화자 식별 방법 및 그 장치
EP3800640A4 (en) * 2019-06-21 2021-09-29 Shenzhen Goodix Technology Co., Ltd. VOICE DETECTION METHOD, VOICE DETECTION DEVICE, VOICE PROCESSING CHIP AND ELECTRONIC DEVICE
US11823706B1 (en) * 2019-10-14 2023-11-21 Meta Platforms, Inc. Voice activity detection in audio signal
CN111739562B (zh) * 2020-07-22 2022-12-23 上海大学 一种基于数据选择性和高斯混合模型的语音活动检测方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1473321A (zh) * 2000-09-09 2004-02-04 英特尔公司 用于综合电信处理的话音激活检测器
CN102044242A (zh) * 2009-10-15 2011-05-04 华为技术有限公司 语音激活检测方法、装置和电子设备
CN102687196A (zh) * 2009-10-08 2012-09-19 西班牙电信公司 用于检测语音段的方法
CN102741918A (zh) * 2010-12-24 2012-10-17 华为技术有限公司 用于话音活动检测的方法和设备
CN103117067A (zh) * 2013-01-19 2013-05-22 渤海大学 一种低信噪比下语音端点检测方法
CN103180900A (zh) * 2010-10-25 2013-06-26 高通股份有限公司 用于话音活动检测的***、方法和设备

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5884255A (en) 1996-07-16 1999-03-16 Coherent Communications Systems Corp. Speech detection system employing multiple determinants
JP4123835B2 (ja) 2002-06-13 2008-07-23 松下電器産業株式会社 雑音抑圧装置および雑音抑圧方法
US7860718B2 (en) 2005-12-08 2010-12-28 Electronics And Telecommunications Research Institute Apparatus and method for speech segment detection and system for speech recognition
US8990073B2 (en) * 2007-06-22 2015-03-24 Voiceage Corporation Method and device for sound activity detection and sound signal classification
CN102044243B (zh) * 2009-10-15 2012-08-29 华为技术有限公司 语音激活检测方法与装置、编码器
CN102576528A (zh) 2009-10-19 2012-07-11 瑞典爱立信有限公司 用于语音活动检测的检测器和方法
US8626498B2 (en) 2010-02-24 2014-01-07 Qualcomm Incorporated Voice activity detection based on plural voice activity detectors
CN102884575A (zh) * 2010-04-22 2013-01-16 高通股份有限公司 话音活动检测
ES2665944T3 (es) * 2010-12-24 2018-04-30 Huawei Technologies Co., Ltd. Aparato para realizar una detección de actividad de voz
JP5737808B2 (ja) * 2011-08-31 2015-06-17 日本放送協会 音響処理装置およびそのプログラム
US9111531B2 (en) 2012-01-13 2015-08-18 Qualcomm Incorporated Multiple coding mode signal classification
US9099098B2 (en) * 2012-01-20 2015-08-04 Qualcomm Incorporated Voice activity detection in presence of background noise

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1473321A (zh) * 2000-09-09 2004-02-04 英特尔公司 用于综合电信处理的话音激活检测器
CN102687196A (zh) * 2009-10-08 2012-09-19 西班牙电信公司 用于检测语音段的方法
CN102044242A (zh) * 2009-10-15 2011-05-04 华为技术有限公司 语音激活检测方法、装置和电子设备
CN103180900A (zh) * 2010-10-25 2013-06-26 高通股份有限公司 用于话音活动检测的***、方法和设备
CN102741918A (zh) * 2010-12-24 2012-10-17 华为技术有限公司 用于话音活动检测的方法和设备
CN103117067A (zh) * 2013-01-19 2013-05-22 渤海大学 一种低信噪比下语音端点检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3040991A4 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015117410A1 (zh) * 2014-07-18 2015-08-13 中兴通讯股份有限公司 激活音检测的方法及装置
US10339961B2 (en) 2014-07-18 2019-07-02 Zte Corporation Voice activity detection method and apparatus
CN109801646A (zh) * 2019-01-31 2019-05-24 北京嘉楠捷思信息技术有限公司 一种基于融合特征的语音端点检测方法和装置
CN112908350A (zh) * 2021-01-29 2021-06-04 展讯通信(上海)有限公司 一种音频处理方法、通信装置、芯片及其模组设备
CN112908350B (zh) * 2021-01-29 2022-08-26 展讯通信(上海)有限公司 一种音频处理方法、通信装置、芯片及其模组设备
CN115862685A (zh) * 2023-02-27 2023-03-28 全时云商务服务股份有限公司 一种实时语音活动的检测方法、装置和电子设备
CN115862685B (zh) * 2023-02-27 2023-09-15 全时云商务服务股份有限公司 一种实时语音活动的检测方法、装置和电子设备

Also Published As

Publication number Publication date
EP3040991A4 (en) 2016-09-14
PL3040991T3 (pl) 2021-08-02
EP3040991A1 (en) 2016-07-06
CN104424956B (zh) 2018-09-21
KR101831078B1 (ko) 2018-04-04
US9978398B2 (en) 2018-05-22
JP2016529555A (ja) 2016-09-23
CN104424956A (zh) 2015-03-18
KR20160039677A (ko) 2016-04-11
JP6412132B2 (ja) 2018-10-24
EP3040991B1 (en) 2021-04-14
CN104424956B9 (zh) 2022-11-25
US20160203833A1 (en) 2016-07-14

Similar Documents

Publication Publication Date Title
WO2014177084A1 (zh) 激活音检测方法和装置
CN105261375B (zh) 激活音检测的方法及装置
CN103854662B (zh) 基于多域联合估计的自适应语音检测方法
CN112992188B (zh) 一种激活音检测vad判决中信噪比门限的调整方法及装置
CN102074245B (zh) 基于双麦克风语音增强装置及语音增强方法
CN103026407B (zh) 带宽扩展器
US9959886B2 (en) Spectral comb voice activity detection
WO2016206273A1 (zh) 一种激活音修正帧数的获取方法、激活音检测方法和装置
US9437213B2 (en) Voice signal enhancement
Wu et al. A pitch-based method for the estimation of short reverberation time
CN109346106B (zh) 一种基于子带信噪比加权的倒谱域基音周期估计方法
Fortune et al. Speech classification for enhancing single channel blind dereverberation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14791094

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14915246

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2016537092

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20167005654

Country of ref document: KR

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2014791094

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014791094

Country of ref document: EP