CN1623186A

CN1623186A - Voice activity detector and validator for noisy environments

Info

Publication number: CN1623186A
Application number: CNA038026821A
Authority: CN
Inventors: 道格拉斯·拉尔夫·伊利; 霍利·路易斯·凯莱赫; 戴维·约翰·本杰明·皮尔斯
Original assignee: Motorola Inc
Current assignee: Motorola Mobility LLC; Google Technology Holdings LLC
Priority date: 2002-01-24
Filing date: 2003-01-10
Publication date: 2005-06-01
Anticipated expiration: 2023-01-10
Also published as: KR100976082B1; GB2384670A; GB0201585D0; GB2384670B; KR20040075959A; JP2010061151A; KR20090127182A; FI124869B; WO2003063138A1; JP2005516247A; FI20041013A; CN1307613C

Abstract

A communication unit (100) includes an audio processing unit (109) having a voice activity detection mechanism (130, 135). The voice activity detection mechanism (130, 135) measures an energy acceleration of a singal input to the communication unit (100) and determines whether said input signal is speech or noise, based on said masurement. A method of detecting voice and a method of deciding whether an input signal is voice or noise are also described. Using an energy acceleration based voice activity detector and validator, particularly for noisy environments, provides the advantages of noise robustness, fast response and independence of the level of input speech.

Description

The speech activity detector and the validator that are used for noise circumstance

Technical field

The present invention relates to the detection (being commonly referred to voice activity detection (VAD)) of the voice in the noise circumstance.The present invention is applicable to the energy rate of acceleration measurement of the voice signal in (but being not limited to) speech detection system.

Background technology

Many voice communication systems, for example global system for mobile communications (GSM) cellular telephony standard and land relay wireless (TETRA) system at individual mobile radio subscriber uses Audio Processing Unit to come the Code And Decode speech pattern.In this voice communication system, speech coder is the suitable digital format of analog voice mode switch for transmission usefulness.Voice decoder is converted to the audio frequency simulation speech pattern to the audio digital signals that receives.

Be used to detect the method and apparatus of voice activity known in the art.Speech activity detector (VAD) is worked under the hypothesis voice only are present in hypothesis in the part of sound signal.This hypothesis is normally correct, because many sound signals only have quiet or ground unrest at interval.

Speech activity detector can be used for many purposes.These comprise the whole transmission activity that suppresses in the transmission system when not having voice, thereby have saved power and channel width potentially.When VAD detects speech activity and proceeds, can restart the transmission activity.

Speech activity detector can also be used in combination with the voice memory device, and the audio-frequency unit that comprises voice and " no voice " part are distinguished.The part that comprises voice was stored in the memory device afterwards and " no voice " part is dropped.

Be used to detect the existing method of speech at least in part based on the method that is used to detect and estimate the power of voice signal.Whether the power of estimation and a constant or an adaptive threshold are the judgements of voice to make this signal relatively.The major advantage of these methods is its low complex degree, and this makes them be applicable to the enforcement of reduction process resource.The major defect of this method is that ground unrest detects " voice " when may cause not having actually " voice " unintentionally.In addition, because ambiguous, in esse " voice " may not be detected, and owing to ground unrest causes being difficult to detect.

The certain methods that is used to detect speech activity is directed to the noise mobile environment and based on the auto adapted filtering of voice signal.This had reduced the noise content from this signal before conclusive judgement.Because this method is used for different speakers and different environment, so frequency spectrum and noise level may change.Therefore, input filter and thresholding are normally adaptive, to follow the tracks of these variations.

The example of these methods provides in being respectively applied for GSM standard 06.42 speech activity detector (VAD) of half rate, full rate and enhanced full rate voice Traffic Channel.Another this method is " the Multi-Boundary Voice ActivityDetection Algorithm (many boundaries voice activity detection algorithms) " that G.729 ITU is advised among the appendix B.These methods are very accurate in noise circumstance, but implement very complicated.

All these methods all need input speech signal.During being applied in the voice decompression, some of employing voice decompression scheme need to carry out speech detection.

People's such as Benyassine european patent application No.EP-A-0785419 relates to a kind of method that is used for voice activity detection, and this method may further comprise the steps:

(i) from the incoming call voice signal of every frame, extract the predetermined parameter that collects, and

(ii) measuring collection according to the deviation that extracts from the parameter of predetermined collection comes the incoming call voice signal of every frame is made the judgement of frame speech.

VAD in the cellular system setovers, to guarantee comprising that the wireless device of audio coder ﹠ decoder (codec) and RF circuit etc. is activated when a side speaks, these voice are sent to the opposing party in ground unrest and other damage environment.But this causes occurring data transmission when a side does not speak.The cost of this method is to have reduced battery life a little and increased interference to the co-channel users in other unit of this system a little.These are second (or higher) rank effects basically.

In these systems, limited resources not can be used for the design of duplex calls.Usually the up-link on different carrier can consistently be used whole bandwidth fully simultaneously with downlink.

Known in the field of the invention, some voice activities or speech begin detecting device (VAD/VOD) to be attempted to use and distinguishes voiced speech (voiced speech) such as the characteristics of speech sounds of harmonic structure (for example passing through auto-correlation).But, in noise, because the destruction of phonetic structure or because the structure in the noise, these structure designators may lose efficacy.This for example can be engine, tire or air-conditioning noise in the automobile.At last, these methods on detecting aspect the unvoiced speech (unvoiced speech) a little less than.

Its alternative only is to use the frame energy level to detect voice.These voice for high s/n ratio (SNR) condition are gratifying, wherein, any thresholding that is higher than noise level can be set represent voice.But this method lost efficacy in a lot of actual noise conditions.

For non-normalized database or in actual applications, the noise level in example set is probably than the speech level height in another example set, and this makes can not be provided with threshold value.The existing method that overcomes this problem is to get about first mean value of 100 milliseconds of language, supposes that this represents noise, thereby creates the certain threshold that is used for this language.But in addition, this is not enough for nonstationary noise, and wherein this noise may depart from initial estimate rapidly, wherein this noise have high variance or wherein first a few frames in fact comprise be not the supposition noise voice.

Therefore, need a kind of speech activity detector and validator through improving that is used for noise circumstance, it can relax above-mentioned shortcoming.

Summary of the invention

According to a first aspect of the invention, provide a kind of communication unit as claimed in claim 1.

According to a second aspect of the invention, provide a kind of detection as claimed in claim 11 to be input to the method for the voice signal in the communication unit.

According to a third aspect of the invention we, providing a kind of signal of determining to be input in the communication unit as claimed in claim 14 is the voice or the method for noise.

Described in others of the present invention such as its dependent claims.

In a word, the present invention is intended to have or do not exist voice by using the energy rate of acceleration to measure the situation that (being preferably the energy amplitude measurement) solves the nonstationary noise of any amplitude with expression.

Description of drawings

With reference now to accompanying drawing, exemplary embodiment of the present invention is described, in the accompanying drawings:

Fig. 1 shows the block scheme of the communication unit that is applicable to the voice activity detection of carrying out the preferred embodiments of the present invention and checking;

Fig. 2 shows the process flow diagram based on the speech activity detector of energy rate of acceleration that is used for noise circumstance according to a preferred embodiment of the invention;

Fig. 3 shows the process flow diagram that the voice activity based on the energy rate of acceleration that is used for noise circumstance is according to a preferred embodiment of the invention verified; And

Fig. 4 shows buffer operation according to a preferred embodiment of the invention.

Embodiment

Voiced speech has higher relatively energy rate of acceleration value because voiced speech begin depend on or the activity of vibration or static vocal cords.Similarly, the beginning of voiceless sound (for example plosive) also has the high-energy rate of acceleration.

The inventor recognizes, in the representational territory that obvious phonetic feature arranged, and for example arrowband power spectrum or Mel frequency spectrum, the energy rate of acceleration of gained is much higher than nonstationary noise.Unique main exception is impact noise (for example applauding).

Therefore, according to a preferred embodiment of the invention, the inventor found by the energy in the frequency zones of concentrating the basic fundamental tone may contain voice signal, and can separate with these noise ranges in addition.Specifically, the present inventor advises using the non-architectural feature of voice, i.e. energy rate of acceleration (or rate of acceleration of some tolerance of reflection speech energy or its component).

Specifically, the advantageous applications for inventive concept described herein is at present just by ETSI (ETSI) defined distributed sound identification (DSR) standard: " SpeechProcessing; Transmission and Quality aspects (STQ); Distributed speechrecognition; Front-end feature extraction algorithm; Compressionalgorithm (speech processes, transmission and quality aspect (STQ); Distributed sound identification; The front end feature extraction algorithm; Compression algorithm) ", ETSI ES 201 108 vl.1.2 (2000-04), in April, 2000.

With reference now to Fig. 1,, shows the block scheme of the audio user unit 100 that is applicable to the inventive concept of supporting the preferred embodiments of the present invention.

According to the wireless audio communication unit the preferred embodiments of the present invention are described, for example can be in the wireless audio communication unit that is used for moving and providing under third generation collaborative project (3GPP) standard of following cellular radio Communication system the DSR ability.But, any electron device that the inventive concept about voice activity detection and checking described herein is equally applicable to respond voice signal and can benefits from the voice activity detection circuit through improving, this is also within the scope of the invention.

As known in the art, audio user unit 100 comprises the antenna 102 that preferably is connected to duplexer filter, duplexer or circulator 104, and circulator 104 makes the reception chain in the audio user unit 100 and sends between the chain isolates.

Receiver chain comprises receiver front end circuit 106 (reception, filtering and intermediate frequency or base-band frequency conversion effectively are provided).Front-end circuit 106 is connected to signal processing function piece (generally being realized by digital signal processor (DSP)) 108.Signal processing function piece 108 is carried out signal demodulation, error correction and format.Be connected to Audio Processing functional block 109 from signal processing function piece 108 data recovered, it formats received signal with suitable manner, to be sent to audio frequency acoustical generator/display 111.

In different embodiments of the invention, signal processing function piece 108 is with in Audio Processing functional block 109 can be arranged on identical physical equipment.Controller 114 is placed the information flow and the running status of the assembly of controlling subscriber unit 100.

As for sending chain, this consists essentially of audio input device 120, its Audio Processing functional block 109 that is connected in series, signal processing function piece 108, transmitter/modulation circuitry 122 and power amplifier 124.Processor 108, transmitter/modulation circuitry 122 and power amplifier 124 be response controller operationally.Power amplifier output is connected to duplexer filter, duplexer or circulator 104 and antenna 102, to launch final radiofrequency signal.

Specifically, Audio Processing functional block 109 comprises that voice activity (or speech begins) detects (VAD) functional block 130, and it operatively is connected to voice activity decision function piece 135.According to a preferred embodiment of the invention, vad function piece 130 and voice activity decision function piece 135 are applicable to that the speech that provides through improving detects and decision mechanism, and its operation will further be described according to Fig. 2 and Fig. 3.Should be noted that speech activity detector functional block 130 comprises the detection-phase of being made up of three measurements frame by frame.These three frequency range measurements comprise:

(i) entire spectrum;

(ii) frequency spectrum frequency sub-band; And

(iii) frequency spectrum variance.

Then, voice activity decision function piece 135 comes enforcement of judgment execute a judgement according to the impact damper of measuring, and analyzes its phonetic likelihood.The application of the conclusive judgement in judgement stage can trace back to the frame the earliest in the impact damper.

In a preferred embodiment of the invention, timer/counter 118 also is applicable to the detection of execution graph 2 and Fig. 3 and the timing function in the determination processing.

Signal processor function piece 108, Audio Processing functional block 109, vad function piece 130 and voice activity decision function piece 135 can be implemented as processing components different, that operatively be connected.In addition, one or more processors can be used for realizing the processing operation of one or more correspondences.In another alternative embodiment, the above-mentioned functions piece can be implemented as the mixing of hardware, software or fastener components, uses special IC (ASIC) and/or processor, for example digital signal processor (DSP).

Certainly, the various elements in the audio user unit 100 can be implemented as separately or the integrated component form, so final structure is optional result.

In order to realize this purpose, there is the method for the energy rate of acceleration indication that obtains use in a preferred embodiment of the invention.

(i) method of theoretical ideal is accurately to ask the second derivative (double-differentiate) of energy level on the successive frame of language, shown in disclosed application US 6009391 formerly.The shortcoming of this method is that this may cause delay, because need analyze a plurality of frames of every side of this frame when analyzing.

(ii) estimate can be by relatively obtaining short-time average value and instantaneous value, for example in the zero-lag of energy rate of acceleration:

The use frame is average:

\tilde{A} = \frac{x_{t}}{(x_{t} + x_{t - 1} + \cdot \cdot \cdot + x_{t - n}) / (n + 1)} - - - [1]

Or use and roll on average:

\tilde{A} = \frac{x_{t}}{({ax}_{t} + {bx}_{t - 1} + \cdot \cdot \cdot + {kx}_{t - n})} - - - [2]

Under each situation, this method is returned its value that can be interpreted as ' rate of deceleration '＜' 1 '＜' rate of acceleration '.Can find then Empirical value and the denominator length that voice and noise are distinguished best.

The present inventor recognizes that preferred best solution is to find out the denominator that can follow the tracks of nonstationary noise fast, but it is oversize for the tracking speech begins.Value sequence for the average suggestion of rolling is a=0.2, b=0.8 * a, c=0.8 * b etc., and it can be expressed as recursion simply:

d _t＝0.2x _t+0.8d _t-1 [3]

Then:

A＝x _t/d _t [4]

Preferred VAD and parameter initialization system in the detection-phase summarize in the process flow diagram of Fig. 2.In nonstationary noise, energy threshold is not the reliable indication of voice when long.Similarly, under the strong noise condition, the structure of voice (for example homophonic) can not entirely depend on indication, because it may be subjected to noise corrupted, perhaps construct noise may make detecting device obscure.Therefore, preferred speech activity detector uses noise robustness (noise-robust) feature of voice, promptly begins relevant energy rate of acceleration with voice.

With reference now to Fig. 2,, shows flow process Figure 200 that preferred detection is handled.As noted before, this processing comprises frame by frame to be analyzed.Preferred VAD mechanism relates to the measurement of ' entire spectrum ' and handles.Whether initial estimated frames counter determines it less than ' N ', and it defines the number of caching frame, shown in step 205.As the example of preferred embodiment, ' N ' is set to ' 15 ', supposes that being set at every frame increases progressively for example 10 milliseconds.If frame counter is less than ' N ' in step 205, then upgrade the rolling mean value of initial acceleration rate test, as step 210.If frame counter is not less than in step 205 ' N ', then skips steps 210.

Then, make estimation energy rate of acceleration and whether measure determining within one or more specified limit, shown in step 235.If the energy rate of acceleration is measured within one or more specified limit in step 235, then the result with further energy rate of acceleration test upgrades rolling mean value, as step 240.If the measurement of energy rate of acceleration is not within one or more specified limit in step 235, then skips steps 240.

Whether then, make estimation energy rate of acceleration measures greater than the determining of appointed threshold, shown in step 260.If the energy rate of acceleration is measured greater than appointed threshold in step 260, think that then this frame is a speech frame, as step 265.If the measurement of energy rate of acceleration is not more than appointed threshold in step 260, think that then this frame is a noise frame, as step 270.

Increase progressively frame counter then, as step 275, and this processing begins repetition from step 205.

As improvement to this processing, substitute or in addition, can also carry out entire spectrum and measure processing, the subarea shown in

optional step

215 and 245 is measured and is handled.The specific subarea of frequency spectrum is chosen as the subarea that comprises basic fundamental tone probably.

In handle in this subarea,, make and check that whether the energy rate of acceleration measure greater than the determining of threshold value, shown in step 220 when the rolling mean time that in step 210, in entire spectrum is measured, upgrades the test of initial acceleration rate.If this energy rate of acceleration is measured greater than this threshold value in step 220, then hang up the processing of other parameter of initialization, shown in step 225.If this energy rate of acceleration measurement is not more than this threshold value in step 220, then upgrade the initialization of other parameter, as step 230.This processing is back to step 235 then, as shown.

In step 235, make estimation energy rate of acceleration whether measure make after the determining within one or more specified limit another preferred definite.Estimate that this rate of deceleration value determines that whether it be ' height ' in step 250, and if like this, the rolling of then upgrading the test of energy rate of acceleration lentamente is average, shown in step 255.This processing is back to the entire spectrum method in step 260 then.

By such mode, the higher signal to noise ratio (snr) of subarea detecting device makes it have higher noise robustness.But it is subjected to the influence of disadvantageous microphone and speaker's variation and band limited noise easily.Therefore, this measurement should not depend on all environment.Therefore, the preferred embodiments of the present invention have merged the subarea detecting device, measure to strengthen entire spectrum.

Another measurement is handled ' rate of acceleration ' of the value variance in the latter half of preferably using the frequency spectrum of every frame for example and is carried out.This variance is measured the structure in the latter half that detects frequency spectrum, makes it extremely sensitive to voiced speech.The method that handle in the subarea is followed in the variance measurement, and the latter half of frequency spectrum is the specific subarea of selecting.This variance is measured and has further been replenished the entire spectrum measuring method, and it can detect voiceless sound and plosive voice better.

All these three measurements are taken out its original input from the spectral representation of the filter gain that produced by phase one of dual S filter, be that Motorola Inc. and invention people are described in the U.S. Patent application of US 09/427497 of Yan-Ming Chen as the application people.As mentioned above, each measures the different aspect that uses these data.

Specifically, the entire spectrum detecting device uses the spectral representation of the Mel filtering of the filter gain that the known phase one by dual Wei Na wave filter produces.Single input value be by to the Mel bank of filters and carry out square obtaining.

In a preferred embodiment of the invention, the processing of entire spectrum detecting device below all frames have been used, as described below:

Step 1 is with following mode initialization Noise Estimation pursuit gain (Tracker):

If frame number＜15 and rate of acceleration＜2.5,

Pursuit gain=MAX (pursuit gain, input) then.

If voice took place in the importing time of 15 frames, then the measurement of energy rate of acceleration prevents that pursuit gain is updated.

If current input is identical with the noise valuation, then step 2 is upgraded pursuit gain in the following manner:

If input＜pursuit gain * upper limit and

Input＞pursuit gain * lower limit,

Then pursuit gain=a * pursuit gain+(1-a) * input

Step 3 provides fail safe mechanism to the example that has voice in those first a few frames or do not have a big noise content of feature.This causes the wrong strong noise valuation of gained to reduce.Step 3 is preferably carried out in the following manner:

If input＜pursuit gain * minimum (Floor),

Then pursuit gain=b * pursuit gain+(1-b) * input

If current input comparison-tracking value is big by 165%, then step 4 is returned in the following manner, determines as value of true voice:

If input＞pursuit gain * thresholding,

Then export value of true, otherwise output value of false.

Instantaneous input is continuously the function of the energy rate of acceleration of input with the ratio of average pursuit gain in short-term.

Wherein, in above-mentioned:

A=0.8 and b=0.97;

The upper limit be 150% and lower limit be 75%;

Minimum is 50%; And

Thresholding is 165%.

Should be noted that if this value greater than the upper limit or between lower limit and minimum, is not then upgraded.In addition, as noted before, the input of energy rate of acceleration can be calculated according to following mode:

Import the secondary differentiate continuously or estimating by two average ratios that roll following the tracks of input.

Should be noted that the average ratio of quick and slow self-adaptation rolling has reflected the energy rate of acceleration of continuous input.

For example, employed contribution rate for this average is above:

(i) 0 * average+1 * input, and

(ii) ((frame number-1) * average+1 * input)/frame number,

It is more and more responsive that the energy rate of acceleration is measured first 15 frames.

This frequency sub-band detecting device preferably uses the average of second, third and the 4th Mel bank of filters that measure from ' entire spectrum '.Then, the processing below this detecting device has been used all frames in the manner as described below:

(i) input=p * current input+(1-p) * previous input;

If (ii) frame number＜15,

Pursuit gain=MAX (pursuit gain, input) then;

If (iii) input＜pursuit gain * upper limit and

Input＞pursuit gain * lower limit,

Then pursuit gain=a * pursuit gain+(1-a) * input;

If (iv) input＜pursuit gain * minimum,

Then pursuit gain=b * pursuit gain+(1-b) * input

If (v) input＞pursuit gain * thresholding,

Then export value of true, otherwise output value of false.

Wherein, in the subarea is measured:

p＝0.75。

Except equaling 3.25 thresholding, to measure for entire spectrum, all other parameters are all identical.

Measure for the frequency spectrum variance, comprise that the variance of the value of the latter half frequency that the narrow-band spectrum of every frame gain is represented is used as input.Then, this detecting device to entire spectrum measurement used identical processing.

This variance is calculated as:

\frac{1}{N} Σ_{i = 0}^{N - 1} W_{i}^{2} - {(Σ_{i = 0}^{N - 1} W_{i})}^{2} / N^{2} - - - [5]

Wherein:

N=FFT length/4, and

w _iIt is the value represented of narrow-band spectrum of gain.

According to a preferred embodiment of the invention, above these three measurements described in detail be provided for the VAD decision algorithm, shown in the process flow diagram of Fig. 3.Input is provided for impact damper continuously, and it provides contextual analysis.This makes frame delay equal buffer length and deducts a frame.

With reference now to Fig. 3,, shows the process flow diagram 300 that the voice activity checking based on rate of acceleration that is used for noise circumstance is according to a preferred embodiment of the invention handled.

For the N=7 frame buffer, nearest true/vox capitis input is stored on the position N in the data buffer, shown in step 305.Decision logic is used the step below several, and preferably uses each step:

Step 1:

V _N=measure 1 or measure 2 or measure 3

If any one in these three measurements returned true voice indication, then import V _NBe defined as value of true (T).

Step 2:

The longest continuous sequence of value of true value in this algorithm search impact damper is as step 310.Therefore, for example, for sequence ' TTFTTTF ', M equals 3.

Step 3:

If M 〉=S _PAnd T＜L _S, T=L _S

Wherein, S _PBe equal to first thresholding in the step 315.If the maximum length sequence of true (T) speech value equals or exceeds first thresholding, i.e. S in step 315 _P=3 or how continuous value of true value, then impact damper is judged as and comprises the voice of ' may (possible) '.If in step 320, determine also not have (or surpassing), then in step 325, start for example L _SThe short timer T (time _ 1) of=5 frames.

Step 4:

If M 〉=S _LAnd F＞F _S, T=L _M, otherwise T=L _L

Wherein, S _LEqual second thresholding in the step 330.If there is S _L=4 or how continuous value of true value, judge once more that then impact damper comprises the voice of ' may (likely) '.If be in initial importing F safety period as determined present frame F in the step 335 _SOutside, then in step 340, start for example L _MThe middle timer T of=22 frames.Otherwise, in step 345, use for example L _LThe long timer T of the fault secure of=40 frames.Voice in language use this layout can make the initial noise of VAD overvalued when occurring in early days.

Step 5:

If M＜S _PAnd T＞0, T--;

In step 350, determined to exist if should handle less than S _PGreater than zero, then timer successively decreases in step 360 in step 355 for=3 continuous value of true value and timer.

Step 6:

If T＞0, output value of true, otherwise output value of false;

If timer greater than zero, then should be handled the judgement of output value of true voice, shown in step 370 in step 365.In addition, if timer is not more than zero in step 365, then should handle output ' noise ' judgement, shown in step 375.

Step 7:

Frame++, impact damper to shifting left and being back to step 1.

Prepare next frame in step 380, impact damper is to shifting left, holding next input, as according to shown in Figure 4.The frame that is applied to come out from this impact damper adjudicated in these output voice.In step 305, the next true/false input that is input in the data buffer is repeated this processing then.

Execution is handled according to aforesaid energy rate of acceleration and is made the replacement mechanism of voice or noise decision also within limit of consideration of the present invention.For example, this decision mechanism may not be based on one or more timers, and may be fully whether one or more energy rate of acceleration thresholdings enter a judgement according to surpassing.

With reference now to Fig. 4,, illustrates in greater detail the example of buffer operation 400 according to a preferred embodiment of the invention.We suppose that first threshold setting is three continuous value of true values." t ", suppose that (the frame #7) 425 that have only current input and previous input (frame #6) 420 are value of true at 410 o'clock.Therefore, when this impact damper displacement, first frame (frame #1) 415 will be marked as vacation.

' t+1 ' 430 o'clock, the 3rd value of true input (frame #8) 450 was received, with two value of trues inputs 440 and 445 before augmenting.Therefore, when this impact damper displacement, next output frame (frame #2) 435 will be marked as value of true.

Should be noted that in above-mentioned determination processing, unique constraint is:

(i) time _ 1＜time _ 2＜time _ 3, and

(ii) thresholding _ 1＜thresholding _ 2.

Suppose and have only these three inputs (frame #6, frame #7 and frame #8) to be value of true, then whole output sequence is:

F T T T T T T T T T T

1 2 3 4 5 6 7 8 9 10?11

T T T T T T F F F F F

12?13?14?15?16?17?18?19?20?21?22

Wherein, because the impact damper import feature, frame #2-#5 is designated as value of true.Frame #6-#8 indicates value of true, as the position of initial value of true phonetic entry of reality.Because the impact damper export function, frame #9-#12 is designated as value of true.Postpone in response to employed timer, frame #13-#18 indicates value of true.When all frames in the language all were transfused to, impact damper shifted out value of false clauses and subclauses (frame #19-#L _M) up to emptying.

Buffer length and delay timer can be dynamically adjusted for satisfying the demand of voice communication unit, and this also within the scope of the invention.Equally, using ' N ' is that the delay timer of the preferred embodiment of 8 buffer length and 5 frames is just for indicative purpose.But, should be noted that buffer length ' N ' should always be defined as N 〉=S _L

Except as himself VAD, the energy rate of acceleration of carrying out in the method step of Fig. 2 is measured the initialization that can be used to verify other parameter, and this is also within limit of consideration of the present invention.For example, the frequency spectrum extraction scheme initial valuation that comes the requirement noise according to head ten frames (typically being 100 milliseconds) of voice.Even in stationary noise, some incidents may take place and cause initial valuation invalid.The example of this incident comprises:

(a) going up tiltedly of signal:

Because various possibilities, when valuation, the beginning of record may ' be gone up tiltedly ' to full in this cycle and is worth.Going up oblique reason fully comprises: impact damper in the digital display circuit is filled, the capacity in the simulation system or take the lead to connect.The influence of these incidents makes this valuation invalid.Therefore, the energy rate of acceleration is measured and can be used to detect this this error of going up tiltedly and prevent.

(b) burr in the initialize signal:

Common ' burr ' is accompanied by the complete action of PTT (PTT) button on the user radio unit and takes place, and wherein, electrically contacts and seldom occurs in before the button percussion switch back.As mentioned above, when this incident took place, the energy rate of acceleration was measured and can be used to hang up the valuation processing, shown in the step 225 of Fig. 2.

(c) voice in the initialize signal:

Another common event is that specifically for the PTT system, the user begins speech immediately when pressing the PTT button.In this way, after beginning, voice electrically contact.The energy rate of acceleration is measured and can be discerned this point and to hang up initialization based on noise, shown in the step 225 of Fig. 2, perhaps forces the operational failure valuation.

In a word, the communication unit that comprises the audio treatment unit with voice activity detection mechanism is described.Voice activity detection mechanism provide the signal input that inputs to communication unit the energy rate of acceleration indication and determine that according to described indication described input signal is voice or noise.

In addition, the method that detection has been input to the voice signal in the communication unit is described.This method may further comprise the steps: indication is input to the rate of acceleration of the input signal of communication unit; And determine that according to described indication step described input signal is voice or noise.

In addition, the signal that judgement has been input in the communication unit is that the voice or the method for noise are described.This method may further comprise the steps: adjudicating described input signal according to the energy rate of acceleration is voice or noise, for example uses the frame of some input signals average or roll average.

Therefore, should be appreciated that the advantage that aforesaid speech activity detector and the validator based on the energy rate of acceleration that is used for noise circumstance provides noise robustness and responded fast.Because preferred embodiment uses the measurement that depends on the energy rate of acceleration, rather than absolute measurement, so inventive concept described herein can be applied to the voice of any incoming level.

Though specific the and preferred realization of embodiments of the invention is described above, should be understood that those skilled in the art is easy to use variation and the modification that falls into this inventive concept within the scope of the present invention.

Therefore, speech activity detector and the validator through improving that is used for noise circumstance is described, wherein, eliminated basically with prior art and arranged the above-mentioned shortcoming that is associated.

Claims

1. a communication unit (100), it comprises and has voice activity detection mechanism (130,135) audio treatment unit (109), described communication unit (100) is characterised in that, described voice activity detection mechanism (130,135) measure the energy rate of acceleration that is input to the signal in the described communication unit (100), and determine that according to described measurement described input signal is voice or noise.

2. communication unit as claimed in claim 1 (100), wherein, described voice activity detection mechanism comprises speech activity detector functional block (130), it is to being input to the detection frame by frame of the signal execution speech in the described voice activity detection mechanism (130,135).

3. communication unit as claimed in claim 2 (100), wherein, the described detection frame by frame comprises at one or more in the following frequency range and carries out the energy rate of acceleration and measure being input to signal in the described voice activity detection mechanism (130,135):

(i) entire spectrum

(ii) frequency spectrum frequency sub-band; And

(iii) frequency spectrum variance.

4. communication unit as claimed in claim 3 (100), wherein, described voice activity detection mechanism comprises voice activity decision function piece (135), whether it may be operably coupled to described speech activity detector functional block (130), be voice to adjudicate described input signal according to the buffer operation of one or more described measurements.

5. communication unit as claimed in claim 4 (100), wherein, described voice activity decision function piece (135) uses the frame of a plurality of described input signals average or roll whether on average adjudicate input signal be voice.

6. as each the described communication unit (100) in the claim 2 to 5, wherein,, think that then incoming frame is speech frame (265) if described energy rate of acceleration measures the energy rate of acceleration value greater than energy rate of acceleration thresholding.

7. communication unit as claimed in claim 6 (100) wherein, determines that incoming frame is the frame that the application of the judgement (265) of speech frame can trace back to the front in the impact damper of input signal.

8. as claim 6 or the described communication unit of claim 7 (100), wherein, if for a plurality of successive frames, described energy rate of acceleration measures the energy rate of acceleration value greater than energy rate of acceleration thresholding, thinks that then incoming frame is speech frame (370).

9. when depending on claim 3, as each the described communication unit (100) in the claim 3 to 8, wherein, if select the subarea of input signal spectrum, then this selection is based on the basic fundamental tone that the subarea most possibly comprises voice signal and makes.

10. as the described communication unit of each claim (100) of front, wherein, described voice activity detection mechanism (130,135) uses the rate of acceleration of the correlated characteristic of speech energy to verify the parameter initialization of the correlated measure of other speech or noise, for example frequency spectrum extraction scheme.

11. a detection inputs to the method for the voice signal in the communication unit, it is characterized in that, comprises following steps:

Measurement inputs to rate of acceleration or the variation in the energy of the input signal in the described communication unit; And

Determine that according to described measuring process (315,330,350) described input signal is voice (370) or noise (375).

12. voice signal detection method as claimed in claim 11 is characterized in that, further comprises following steps:

To inputing to the detection frame by frame of the signal execution speech in the described communication unit.

13. voice signal detection method as claimed in claim 12, wherein, the described detection frame by frame may further comprise the steps:

At one or more following frequency ranges, described input signal is carried out the energy rate of acceleration measures:

(i) entire spectrum

(ii) frequency spectrum frequency sub-band; And

(iii) frequency spectrum variance.

14. the signal that a judgement inputs in the communication unit is the voice or the method for noise, preferably according to each claim in the claim 11 to 13 of front, the method is characterized in that, further comprises following steps:

According to the energy rate of acceleration in the energy measurement of described input signal or change that to adjudicate (315,330,350) described input signal be voice (370) or noise (375), for example use the frame of a plurality of input signals average or roll on average.

15. the signal that judgement as claimed in claim 14 inputs in the communication unit is the voice or the method for noise, wherein, described decision steps comprises:

If described energy rate of acceleration measures energy rate of acceleration value greater than energy rate of acceleration thresholding, determine that then incoming frame is speech frame (265); And

The frame of the front in the described impact damper of determining to be applied to input signal with reviewing.