CN1623186A - Voice activity detector and validator for noisy environments - Google Patents

Voice activity detector and validator for noisy environments Download PDF

Info

Publication number
CN1623186A
CN1623186A CNA038026821A CN03802682A CN1623186A CN 1623186 A CN1623186 A CN 1623186A CN A038026821 A CNA038026821 A CN A038026821A CN 03802682 A CN03802682 A CN 03802682A CN 1623186 A CN1623186 A CN 1623186A
Authority
CN
China
Prior art keywords
voice
frame
acceleration
communication unit
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA038026821A
Other languages
Chinese (zh)
Other versions
CN1307613C (en
Inventor
道格拉斯·拉尔夫·伊利
霍利·路易斯·凯莱赫
戴维·约翰·本杰明·皮尔斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Mobility LLC
Google Technology Holdings LLC
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Publication of CN1623186A publication Critical patent/CN1623186A/en
Application granted granted Critical
Publication of CN1307613C publication Critical patent/CN1307613C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Telephone Function (AREA)

Abstract

A communication unit (100) includes an audio processing unit (109) having a voice activity detection mechanism (130, 135). The voice activity detection mechanism (130, 135) measures an energy acceleration of a singal input to the communication unit (100) and determines whether said input signal is speech or noise, based on said masurement. A method of detecting voice and a method of deciding whether an input signal is voice or noise are also described. Using an energy acceleration based voice activity detector and validator, particularly for noisy environments, provides the advantages of noise robustness, fast response and independence of the level of input speech.

Description

The speech activity detector and the validator that are used for noise circumstance
Technical field
The present invention relates to the detection (being commonly referred to voice activity detection (VAD)) of the voice in the noise circumstance.The present invention is applicable to the energy rate of acceleration measurement of the voice signal in (but being not limited to) speech detection system.
Background technology
Many voice communication systems, for example global system for mobile communications (GSM) cellular telephony standard and land relay wireless (TETRA) system at individual mobile radio subscriber uses Audio Processing Unit to come the Code And Decode speech pattern.In this voice communication system, speech coder is the suitable digital format of analog voice mode switch for transmission usefulness.Voice decoder is converted to the audio frequency simulation speech pattern to the audio digital signals that receives.
Be used to detect the method and apparatus of voice activity known in the art.Speech activity detector (VAD) is worked under the hypothesis voice only are present in hypothesis in the part of sound signal.This hypothesis is normally correct, because many sound signals only have quiet or ground unrest at interval.
Speech activity detector can be used for many purposes.These comprise the whole transmission activity that suppresses in the transmission system when not having voice, thereby have saved power and channel width potentially.When VAD detects speech activity and proceeds, can restart the transmission activity.
Speech activity detector can also be used in combination with the voice memory device, and the audio-frequency unit that comprises voice and " no voice " part are distinguished.The part that comprises voice was stored in the memory device afterwards and " no voice " part is dropped.
Be used to detect the existing method of speech at least in part based on the method that is used to detect and estimate the power of voice signal.Whether the power of estimation and a constant or an adaptive threshold are the judgements of voice to make this signal relatively.The major advantage of these methods is its low complex degree, and this makes them be applicable to the enforcement of reduction process resource.The major defect of this method is that ground unrest detects " voice " when may cause not having actually " voice " unintentionally.In addition, because ambiguous, in esse " voice " may not be detected, and owing to ground unrest causes being difficult to detect.
The certain methods that is used to detect speech activity is directed to the noise mobile environment and based on the auto adapted filtering of voice signal.This had reduced the noise content from this signal before conclusive judgement.Because this method is used for different speakers and different environment, so frequency spectrum and noise level may change.Therefore, input filter and thresholding are normally adaptive, to follow the tracks of these variations.
The example of these methods provides in being respectively applied for GSM standard 06.42 speech activity detector (VAD) of half rate, full rate and enhanced full rate voice Traffic Channel.Another this method is " the Multi-Boundary Voice ActivityDetection Algorithm (many boundaries voice activity detection algorithms) " that G.729 ITU is advised among the appendix B.These methods are very accurate in noise circumstance, but implement very complicated.
All these methods all need input speech signal.During being applied in the voice decompression, some of employing voice decompression scheme need to carry out speech detection.
People's such as Benyassine european patent application No.EP-A-0785419 relates to a kind of method that is used for voice activity detection, and this method may further comprise the steps:
(i) from the incoming call voice signal of every frame, extract the predetermined parameter that collects, and
(ii) measuring collection according to the deviation that extracts from the parameter of predetermined collection comes the incoming call voice signal of every frame is made the judgement of frame speech.
VAD in the cellular system setovers, to guarantee comprising that the wireless device of audio coder ﹠ decoder (codec) and RF circuit etc. is activated when a side speaks, these voice are sent to the opposing party in ground unrest and other damage environment.But this causes occurring data transmission when a side does not speak.The cost of this method is to have reduced battery life a little and increased interference to the co-channel users in other unit of this system a little.These are second (or higher) rank effects basically.
In these systems, limited resources not can be used for the design of duplex calls.Usually the up-link on different carrier can consistently be used whole bandwidth fully simultaneously with downlink.
Known in the field of the invention, some voice activities or speech begin detecting device (VAD/VOD) to be attempted to use and distinguishes voiced speech (voiced speech) such as the characteristics of speech sounds of harmonic structure (for example passing through auto-correlation).But, in noise, because the destruction of phonetic structure or because the structure in the noise, these structure designators may lose efficacy.This for example can be engine, tire or air-conditioning noise in the automobile.At last, these methods on detecting aspect the unvoiced speech (unvoiced speech) a little less than.
Its alternative only is to use the frame energy level to detect voice.These voice for high s/n ratio (SNR) condition are gratifying, wherein, any thresholding that is higher than noise level can be set represent voice.But this method lost efficacy in a lot of actual noise conditions.
For non-normalized database or in actual applications, the noise level in example set is probably than the speech level height in another example set, and this makes can not be provided with threshold value.The existing method that overcomes this problem is to get about first mean value of 100 milliseconds of language, supposes that this represents noise, thereby creates the certain threshold that is used for this language.But in addition, this is not enough for nonstationary noise, and wherein this noise may depart from initial estimate rapidly, wherein this noise have high variance or wherein first a few frames in fact comprise be not the supposition noise voice.
Therefore, need a kind of speech activity detector and validator through improving that is used for noise circumstance, it can relax above-mentioned shortcoming.
Summary of the invention
According to a first aspect of the invention, provide a kind of communication unit as claimed in claim 1.
According to a second aspect of the invention, provide a kind of detection as claimed in claim 11 to be input to the method for the voice signal in the communication unit.
According to a third aspect of the invention we, providing a kind of signal of determining to be input in the communication unit as claimed in claim 14 is the voice or the method for noise.
Described in others of the present invention such as its dependent claims.
In a word, the present invention is intended to have or do not exist voice by using the energy rate of acceleration to measure the situation that (being preferably the energy amplitude measurement) solves the nonstationary noise of any amplitude with expression.
Description of drawings
With reference now to accompanying drawing, exemplary embodiment of the present invention is described, in the accompanying drawings:
Fig. 1 shows the block scheme of the communication unit that is applicable to the voice activity detection of carrying out the preferred embodiments of the present invention and checking;
Fig. 2 shows the process flow diagram based on the speech activity detector of energy rate of acceleration that is used for noise circumstance according to a preferred embodiment of the invention;
Fig. 3 shows the process flow diagram that the voice activity based on the energy rate of acceleration that is used for noise circumstance is according to a preferred embodiment of the invention verified; And
Fig. 4 shows buffer operation according to a preferred embodiment of the invention.
Embodiment
Voiced speech has higher relatively energy rate of acceleration value because voiced speech begin depend on or the activity of vibration or static vocal cords.Similarly, the beginning of voiceless sound (for example plosive) also has the high-energy rate of acceleration.
The inventor recognizes, in the representational territory that obvious phonetic feature arranged, and for example arrowband power spectrum or Mel frequency spectrum, the energy rate of acceleration of gained is much higher than nonstationary noise.Unique main exception is impact noise (for example applauding).
Therefore, according to a preferred embodiment of the invention, the inventor found by the energy in the frequency zones of concentrating the basic fundamental tone may contain voice signal, and can separate with these noise ranges in addition.Specifically, the present inventor advises using the non-architectural feature of voice, i.e. energy rate of acceleration (or rate of acceleration of some tolerance of reflection speech energy or its component).
Specifically, the advantageous applications for inventive concept described herein is at present just by ETSI (ETSI) defined distributed sound identification (DSR) standard: " SpeechProcessing; Transmission and Quality aspects (STQ); Distributed speechrecognition; Front-end feature extraction algorithm; Compressionalgorithm (speech processes, transmission and quality aspect (STQ); Distributed sound identification; The front end feature extraction algorithm; Compression algorithm) ", ETSI ES 201 108 vl.1.2 (2000-04), in April, 2000.
With reference now to Fig. 1,, shows the block scheme of the audio user unit 100 that is applicable to the inventive concept of supporting the preferred embodiments of the present invention.
According to the wireless audio communication unit the preferred embodiments of the present invention are described, for example can be in the wireless audio communication unit that is used for moving and providing under third generation collaborative project (3GPP) standard of following cellular radio Communication system the DSR ability.But, any electron device that the inventive concept about voice activity detection and checking described herein is equally applicable to respond voice signal and can benefits from the voice activity detection circuit through improving, this is also within the scope of the invention.
As known in the art, audio user unit 100 comprises the antenna 102 that preferably is connected to duplexer filter, duplexer or circulator 104, and circulator 104 makes the reception chain in the audio user unit 100 and sends between the chain isolates.
Receiver chain comprises receiver front end circuit 106 (reception, filtering and intermediate frequency or base-band frequency conversion effectively are provided).Front-end circuit 106 is connected to signal processing function piece (generally being realized by digital signal processor (DSP)) 108.Signal processing function piece 108 is carried out signal demodulation, error correction and format.Be connected to Audio Processing functional block 109 from signal processing function piece 108 data recovered, it formats received signal with suitable manner, to be sent to audio frequency acoustical generator/display 111.
In different embodiments of the invention, signal processing function piece 108 is with in Audio Processing functional block 109 can be arranged on identical physical equipment.Controller 114 is placed the information flow and the running status of the assembly of controlling subscriber unit 100.
As for sending chain, this consists essentially of audio input device 120, its Audio Processing functional block 109 that is connected in series, signal processing function piece 108, transmitter/modulation circuitry 122 and power amplifier 124.Processor 108, transmitter/modulation circuitry 122 and power amplifier 124 be response controller operationally.Power amplifier output is connected to duplexer filter, duplexer or circulator 104 and antenna 102, to launch final radiofrequency signal.
Specifically, Audio Processing functional block 109 comprises that voice activity (or speech begins) detects (VAD) functional block 130, and it operatively is connected to voice activity decision function piece 135.According to a preferred embodiment of the invention, vad function piece 130 and voice activity decision function piece 135 are applicable to that the speech that provides through improving detects and decision mechanism, and its operation will further be described according to Fig. 2 and Fig. 3.Should be noted that speech activity detector functional block 130 comprises the detection-phase of being made up of three measurements frame by frame.These three frequency range measurements comprise:
(i) entire spectrum;
(ii) frequency spectrum frequency sub-band; And
(iii) frequency spectrum variance.
Then, voice activity decision function piece 135 comes enforcement of judgment execute a judgement according to the impact damper of measuring, and analyzes its phonetic likelihood.The application of the conclusive judgement in judgement stage can trace back to the frame the earliest in the impact damper.
In a preferred embodiment of the invention, timer/counter 118 also is applicable to the detection of execution graph 2 and Fig. 3 and the timing function in the determination processing.
Signal processor function piece 108, Audio Processing functional block 109, vad function piece 130 and voice activity decision function piece 135 can be implemented as processing components different, that operatively be connected.In addition, one or more processors can be used for realizing the processing operation of one or more correspondences.In another alternative embodiment, the above-mentioned functions piece can be implemented as the mixing of hardware, software or fastener components, uses special IC (ASIC) and/or processor, for example digital signal processor (DSP).
Certainly, the various elements in the audio user unit 100 can be implemented as separately or the integrated component form, so final structure is optional result.
In order to realize this purpose, there is the method for the energy rate of acceleration indication that obtains use in a preferred embodiment of the invention.
(i) method of theoretical ideal is accurately to ask the second derivative (double-differentiate) of energy level on the successive frame of language, shown in disclosed application US 6009391 formerly.The shortcoming of this method is that this may cause delay, because need analyze a plurality of frames of every side of this frame when analyzing.
(ii) estimate can be by relatively obtaining short-time average value and instantaneous value, for example in the zero-lag of energy rate of acceleration:
The use frame is average:
A ~ = x t ( x t + x t - 1 + · · · + x t - n ) / ( n + 1 ) - - - [ 1 ]
Or use and roll on average:
A ~ = x t ( ax t + bx t - 1 + · · · + kx t - n ) - - - [ 2 ]
Under each situation, this method is returned its value that can be interpreted as ' rate of deceleration '<' 1 '<' rate of acceleration '.Can find then Empirical value and the denominator length that voice and noise are distinguished best.
The present inventor recognizes that preferred best solution is to find out the denominator that can follow the tracks of nonstationary noise fast, but it is oversize for the tracking speech begins.Value sequence for the average suggestion of rolling is a=0.2, b=0.8 * a, c=0.8 * b etc., and it can be expressed as recursion simply:
d t=0.2x t+0.8d t-1 [3]
Then:
A=x t/d t [4]
Preferred VAD and parameter initialization system in the detection-phase summarize in the process flow diagram of Fig. 2.In nonstationary noise, energy threshold is not the reliable indication of voice when long.Similarly, under the strong noise condition, the structure of voice (for example homophonic) can not entirely depend on indication, because it may be subjected to noise corrupted, perhaps construct noise may make detecting device obscure.Therefore, preferred speech activity detector uses noise robustness (noise-robust) feature of voice, promptly begins relevant energy rate of acceleration with voice.
With reference now to Fig. 2,, shows flow process Figure 200 that preferred detection is handled.As noted before, this processing comprises frame by frame to be analyzed.Preferred VAD mechanism relates to the measurement of ' entire spectrum ' and handles.Whether initial estimated frames counter determines it less than ' N ', and it defines the number of caching frame, shown in step 205.As the example of preferred embodiment, ' N ' is set to ' 15 ', supposes that being set at every frame increases progressively for example 10 milliseconds.If frame counter is less than ' N ' in step 205, then upgrade the rolling mean value of initial acceleration rate test, as step 210.If frame counter is not less than in step 205 ' N ', then skips steps 210.
Then, make estimation energy rate of acceleration and whether measure determining within one or more specified limit, shown in step 235.If the energy rate of acceleration is measured within one or more specified limit in step 235, then the result with further energy rate of acceleration test upgrades rolling mean value, as step 240.If the measurement of energy rate of acceleration is not within one or more specified limit in step 235, then skips steps 240.
Whether then, make estimation energy rate of acceleration measures greater than the determining of appointed threshold, shown in step 260.If the energy rate of acceleration is measured greater than appointed threshold in step 260, think that then this frame is a speech frame, as step 265.If the measurement of energy rate of acceleration is not more than appointed threshold in step 260, think that then this frame is a noise frame, as step 270.
Increase progressively frame counter then, as step 275, and this processing begins repetition from step 205.
As improvement to this processing, substitute or in addition, can also carry out entire spectrum and measure processing, the subarea shown in optional step 215 and 245 is measured and is handled.The specific subarea of frequency spectrum is chosen as the subarea that comprises basic fundamental tone probably.
In handle in this subarea,, make and check that whether the energy rate of acceleration measure greater than the determining of threshold value, shown in step 220 when the rolling mean time that in step 210, in entire spectrum is measured, upgrades the test of initial acceleration rate.If this energy rate of acceleration is measured greater than this threshold value in step 220, then hang up the processing of other parameter of initialization, shown in step 225.If this energy rate of acceleration measurement is not more than this threshold value in step 220, then upgrade the initialization of other parameter, as step 230.This processing is back to step 235 then, as shown.
In step 235, make estimation energy rate of acceleration whether measure make after the determining within one or more specified limit another preferred definite.Estimate that this rate of deceleration value determines that whether it be ' height ' in step 250, and if like this, the rolling of then upgrading the test of energy rate of acceleration lentamente is average, shown in step 255.This processing is back to the entire spectrum method in step 260 then.
By such mode, the higher signal to noise ratio (snr) of subarea detecting device makes it have higher noise robustness.But it is subjected to the influence of disadvantageous microphone and speaker's variation and band limited noise easily.Therefore, this measurement should not depend on all environment.Therefore, the preferred embodiments of the present invention have merged the subarea detecting device, measure to strengthen entire spectrum.
Another measurement is handled ' rate of acceleration ' of the value variance in the latter half of preferably using the frequency spectrum of every frame for example and is carried out.This variance is measured the structure in the latter half that detects frequency spectrum, makes it extremely sensitive to voiced speech.The method that handle in the subarea is followed in the variance measurement, and the latter half of frequency spectrum is the specific subarea of selecting.This variance is measured and has further been replenished the entire spectrum measuring method, and it can detect voiceless sound and plosive voice better.
All these three measurements are taken out its original input from the spectral representation of the filter gain that produced by phase one of dual S filter, be that Motorola Inc. and invention people are described in the U.S. Patent application of US 09/427497 of Yan-Ming Chen as the application people.As mentioned above, each measures the different aspect that uses these data.
Specifically, the entire spectrum detecting device uses the spectral representation of the Mel filtering of the filter gain that the known phase one by dual Wei Na wave filter produces.Single input value be by to the Mel bank of filters and carry out square obtaining.
In a preferred embodiment of the invention, the processing of entire spectrum detecting device below all frames have been used, as described below:
Step 1 is with following mode initialization Noise Estimation pursuit gain (Tracker):
If frame number<15 and rate of acceleration<2.5,
Pursuit gain=MAX (pursuit gain, input) then.
If voice took place in the importing time of 15 frames, then the measurement of energy rate of acceleration prevents that pursuit gain is updated.
If current input is identical with the noise valuation, then step 2 is upgraded pursuit gain in the following manner:
If input<pursuit gain * upper limit and
Input>pursuit gain * lower limit,
Then pursuit gain=a * pursuit gain+(1-a) * input
Step 3 provides fail safe mechanism to the example that has voice in those first a few frames or do not have a big noise content of feature.This causes the wrong strong noise valuation of gained to reduce.Step 3 is preferably carried out in the following manner:
If input<pursuit gain * minimum (Floor),
Then pursuit gain=b * pursuit gain+(1-b) * input
If current input comparison-tracking value is big by 165%, then step 4 is returned in the following manner, determines as value of true voice:
If input>pursuit gain * thresholding,
Then export value of true, otherwise output value of false.
Instantaneous input is continuously the function of the energy rate of acceleration of input with the ratio of average pursuit gain in short-term.
Wherein, in above-mentioned:
A=0.8 and b=0.97;
The upper limit be 150% and lower limit be 75%;
Minimum is 50%; And
Thresholding is 165%.
Should be noted that if this value greater than the upper limit or between lower limit and minimum, is not then upgraded.In addition, as noted before, the input of energy rate of acceleration can be calculated according to following mode:
Import the secondary differentiate continuously or estimating by two average ratios that roll following the tracks of input.
Should be noted that the average ratio of quick and slow self-adaptation rolling has reflected the energy rate of acceleration of continuous input.
For example, employed contribution rate for this average is above:
(i) 0 * average+1 * input, and
(ii) ((frame number-1) * average+1 * input)/frame number,
It is more and more responsive that the energy rate of acceleration is measured first 15 frames.
This frequency sub-band detecting device preferably uses the average of second, third and the 4th Mel bank of filters that measure from ' entire spectrum '.Then, the processing below this detecting device has been used all frames in the manner as described below:
(i) input=p * current input+(1-p) * previous input;
If (ii) frame number<15,
Pursuit gain=MAX (pursuit gain, input) then;
If (iii) input<pursuit gain * upper limit and
Input>pursuit gain * lower limit,
Then pursuit gain=a * pursuit gain+(1-a) * input;
If (iv) input<pursuit gain * minimum,
Then pursuit gain=b * pursuit gain+(1-b) * input
If (v) input>pursuit gain * thresholding,
Then export value of true, otherwise output value of false.
Wherein, in the subarea is measured:
p=0.75。
Except equaling 3.25 thresholding, to measure for entire spectrum, all other parameters are all identical.
Measure for the frequency spectrum variance, comprise that the variance of the value of the latter half frequency that the narrow-band spectrum of every frame gain is represented is used as input.Then, this detecting device to entire spectrum measurement used identical processing.
This variance is calculated as:
1 N Σ i = 0 N - 1 W i 2 - ( Σ i = 0 N - 1 W i ) 2 / N 2 - - - [ 5 ]
Wherein:
N=FFT length/4, and
w iIt is the value represented of narrow-band spectrum of gain.
According to a preferred embodiment of the invention, above these three measurements described in detail be provided for the VAD decision algorithm, shown in the process flow diagram of Fig. 3.Input is provided for impact damper continuously, and it provides contextual analysis.This makes frame delay equal buffer length and deducts a frame.
With reference now to Fig. 3,, shows the process flow diagram 300 that the voice activity checking based on rate of acceleration that is used for noise circumstance is according to a preferred embodiment of the invention handled.
For the N=7 frame buffer, nearest true/vox capitis input is stored on the position N in the data buffer, shown in step 305.Decision logic is used the step below several, and preferably uses each step:
Step 1:
V N=measure 1 or measure 2 or measure 3
If any one in these three measurements returned true voice indication, then import V NBe defined as value of true (T).
Step 2:
Figure A0380268200172
The longest continuous sequence of value of true value in this algorithm search impact damper is as step 310.Therefore, for example, for sequence ' TTFTTTF ', M equals 3.
Step 3:
If M 〉=S PAnd T<L S, T=L S
Wherein, S PBe equal to first thresholding in the step 315.If the maximum length sequence of true (T) speech value equals or exceeds first thresholding, i.e. S in step 315 P=3 or how continuous value of true value, then impact damper is judged as and comprises the voice of ' may (possible) '.If in step 320, determine also not have (or surpassing), then in step 325, start for example L SThe short timer T (time _ 1) of=5 frames.
Step 4:
If M 〉=S LAnd F>F S, T=L M, otherwise T=L L
Wherein, S LEqual second thresholding in the step 330.If there is S L=4 or how continuous value of true value, judge once more that then impact damper comprises the voice of ' may (likely) '.If be in initial importing F safety period as determined present frame F in the step 335 SOutside, then in step 340, start for example L MThe middle timer T of=22 frames.Otherwise, in step 345, use for example L LThe long timer T of the fault secure of=40 frames.Voice in language use this layout can make the initial noise of VAD overvalued when occurring in early days.
Step 5:
If M<S PAnd T>0, T--;
In step 350, determined to exist if should handle less than S PGreater than zero, then timer successively decreases in step 360 in step 355 for=3 continuous value of true value and timer.
Step 6:
If T>0, output value of true, otherwise output value of false;
If timer greater than zero, then should be handled the judgement of output value of true voice, shown in step 370 in step 365.In addition, if timer is not more than zero in step 365, then should handle output ' noise ' judgement, shown in step 375.
Step 7:
Frame++, impact damper to shifting left and being back to step 1.
Prepare next frame in step 380, impact damper is to shifting left, holding next input, as according to shown in Figure 4.The frame that is applied to come out from this impact damper adjudicated in these output voice.In step 305, the next true/false input that is input in the data buffer is repeated this processing then.
Execution is handled according to aforesaid energy rate of acceleration and is made the replacement mechanism of voice or noise decision also within limit of consideration of the present invention.For example, this decision mechanism may not be based on one or more timers, and may be fully whether one or more energy rate of acceleration thresholdings enter a judgement according to surpassing.
With reference now to Fig. 4,, illustrates in greater detail the example of buffer operation 400 according to a preferred embodiment of the invention.We suppose that first threshold setting is three continuous value of true values." t ", suppose that (the frame #7) 425 that have only current input and previous input (frame #6) 420 are value of true at 410 o'clock.Therefore, when this impact damper displacement, first frame (frame #1) 415 will be marked as vacation.
' t+1 ' 430 o'clock, the 3rd value of true input (frame #8) 450 was received, with two value of trues inputs 440 and 445 before augmenting.Therefore, when this impact damper displacement, next output frame (frame #2) 435 will be marked as value of true.
Should be noted that in above-mentioned determination processing, unique constraint is:
(i) time _ 1<time _ 2<time _ 3, and
(ii) thresholding _ 1<thresholding _ 2.
Suppose and have only these three inputs (frame #6, frame #7 and frame #8) to be value of true, then whole output sequence is:
F T T T T T T T T T T
1 2 3 4 5 6 7 8 9 10?11
T T T T T T F F F F F
12?13?14?15?16?17?18?19?20?21?22
Wherein, because the impact damper import feature, frame #2-#5 is designated as value of true.Frame #6-#8 indicates value of true, as the position of initial value of true phonetic entry of reality.Because the impact damper export function, frame #9-#12 is designated as value of true.Postpone in response to employed timer, frame #13-#18 indicates value of true.When all frames in the language all were transfused to, impact damper shifted out value of false clauses and subclauses (frame #19-#L M) up to emptying.
Buffer length and delay timer can be dynamically adjusted for satisfying the demand of voice communication unit, and this also within the scope of the invention.Equally, using ' N ' is that the delay timer of the preferred embodiment of 8 buffer length and 5 frames is just for indicative purpose.But, should be noted that buffer length ' N ' should always be defined as N 〉=S L
Except as himself VAD, the energy rate of acceleration of carrying out in the method step of Fig. 2 is measured the initialization that can be used to verify other parameter, and this is also within limit of consideration of the present invention.For example, the frequency spectrum extraction scheme initial valuation that comes the requirement noise according to head ten frames (typically being 100 milliseconds) of voice.Even in stationary noise, some incidents may take place and cause initial valuation invalid.The example of this incident comprises:
(a) going up tiltedly of signal:
Because various possibilities, when valuation, the beginning of record may ' be gone up tiltedly ' to full in this cycle and is worth.Going up oblique reason fully comprises: impact damper in the digital display circuit is filled, the capacity in the simulation system or take the lead to connect.The influence of these incidents makes this valuation invalid.Therefore, the energy rate of acceleration is measured and can be used to detect this this error of going up tiltedly and prevent.
(b) burr in the initialize signal:
Common ' burr ' is accompanied by the complete action of PTT (PTT) button on the user radio unit and takes place, and wherein, electrically contacts and seldom occurs in before the button percussion switch back.As mentioned above, when this incident took place, the energy rate of acceleration was measured and can be used to hang up the valuation processing, shown in the step 225 of Fig. 2.
(c) voice in the initialize signal:
Another common event is that specifically for the PTT system, the user begins speech immediately when pressing the PTT button.In this way, after beginning, voice electrically contact.The energy rate of acceleration is measured and can be discerned this point and to hang up initialization based on noise, shown in the step 225 of Fig. 2, perhaps forces the operational failure valuation.
In a word, the communication unit that comprises the audio treatment unit with voice activity detection mechanism is described.Voice activity detection mechanism provide the signal input that inputs to communication unit the energy rate of acceleration indication and determine that according to described indication described input signal is voice or noise.
In addition, the method that detection has been input to the voice signal in the communication unit is described.This method may further comprise the steps: indication is input to the rate of acceleration of the input signal of communication unit; And determine that according to described indication step described input signal is voice or noise.
In addition, the signal that judgement has been input in the communication unit is that the voice or the method for noise are described.This method may further comprise the steps: adjudicating described input signal according to the energy rate of acceleration is voice or noise, for example uses the frame of some input signals average or roll average.
Therefore, should be appreciated that the advantage that aforesaid speech activity detector and the validator based on the energy rate of acceleration that is used for noise circumstance provides noise robustness and responded fast.Because preferred embodiment uses the measurement that depends on the energy rate of acceleration, rather than absolute measurement, so inventive concept described herein can be applied to the voice of any incoming level.
Though specific the and preferred realization of embodiments of the invention is described above, should be understood that those skilled in the art is easy to use variation and the modification that falls into this inventive concept within the scope of the present invention.
Therefore, speech activity detector and the validator through improving that is used for noise circumstance is described, wherein, eliminated basically with prior art and arranged the above-mentioned shortcoming that is associated.

Claims (15)

1. a communication unit (100), it comprises and has voice activity detection mechanism (130,135) audio treatment unit (109), described communication unit (100) is characterised in that, described voice activity detection mechanism (130,135) measure the energy rate of acceleration that is input to the signal in the described communication unit (100), and determine that according to described measurement described input signal is voice or noise.
2. communication unit as claimed in claim 1 (100), wherein, described voice activity detection mechanism comprises speech activity detector functional block (130), it is to being input to the detection frame by frame of the signal execution speech in the described voice activity detection mechanism (130,135).
3. communication unit as claimed in claim 2 (100), wherein, the described detection frame by frame comprises at one or more in the following frequency range and carries out the energy rate of acceleration and measure being input to signal in the described voice activity detection mechanism (130,135):
(i) entire spectrum
(ii) frequency spectrum frequency sub-band; And
(iii) frequency spectrum variance.
4. communication unit as claimed in claim 3 (100), wherein, described voice activity detection mechanism comprises voice activity decision function piece (135), whether it may be operably coupled to described speech activity detector functional block (130), be voice to adjudicate described input signal according to the buffer operation of one or more described measurements.
5. communication unit as claimed in claim 4 (100), wherein, described voice activity decision function piece (135) uses the frame of a plurality of described input signals average or roll whether on average adjudicate input signal be voice.
6. as each the described communication unit (100) in the claim 2 to 5, wherein,, think that then incoming frame is speech frame (265) if described energy rate of acceleration measures the energy rate of acceleration value greater than energy rate of acceleration thresholding.
7. communication unit as claimed in claim 6 (100) wherein, determines that incoming frame is the frame that the application of the judgement (265) of speech frame can trace back to the front in the impact damper of input signal.
8. as claim 6 or the described communication unit of claim 7 (100), wherein, if for a plurality of successive frames, described energy rate of acceleration measures the energy rate of acceleration value greater than energy rate of acceleration thresholding, thinks that then incoming frame is speech frame (370).
9. when depending on claim 3, as each the described communication unit (100) in the claim 3 to 8, wherein, if select the subarea of input signal spectrum, then this selection is based on the basic fundamental tone that the subarea most possibly comprises voice signal and makes.
10. as the described communication unit of each claim (100) of front, wherein, described voice activity detection mechanism (130,135) uses the rate of acceleration of the correlated characteristic of speech energy to verify the parameter initialization of the correlated measure of other speech or noise, for example frequency spectrum extraction scheme.
11. a detection inputs to the method for the voice signal in the communication unit, it is characterized in that, comprises following steps:
Measurement inputs to rate of acceleration or the variation in the energy of the input signal in the described communication unit; And
Determine that according to described measuring process (315,330,350) described input signal is voice (370) or noise (375).
12. voice signal detection method as claimed in claim 11 is characterized in that, further comprises following steps:
To inputing to the detection frame by frame of the signal execution speech in the described communication unit.
13. voice signal detection method as claimed in claim 12, wherein, the described detection frame by frame may further comprise the steps:
At one or more following frequency ranges, described input signal is carried out the energy rate of acceleration measures:
(i) entire spectrum
(ii) frequency spectrum frequency sub-band; And
(iii) frequency spectrum variance.
14. the signal that a judgement inputs in the communication unit is the voice or the method for noise, preferably according to each claim in the claim 11 to 13 of front, the method is characterized in that, further comprises following steps:
According to the energy rate of acceleration in the energy measurement of described input signal or change that to adjudicate (315,330,350) described input signal be voice (370) or noise (375), for example use the frame of a plurality of input signals average or roll on average.
15. the signal that judgement as claimed in claim 14 inputs in the communication unit is the voice or the method for noise, wherein, described decision steps comprises:
If described energy rate of acceleration measures energy rate of acceleration value greater than energy rate of acceleration thresholding, determine that then incoming frame is speech frame (265); And
The frame of the front in the described impact damper of determining to be applied to input signal with reviewing.
CNB038026821A 2002-01-24 2003-01-10 Voice activity detector and validator for noisy environments Expired - Lifetime CN1307613C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0201585.7 2002-01-24
GB0201585A GB2384670B (en) 2002-01-24 2002-01-24 Voice activity detector and validator for noisy environments

Publications (2)

Publication Number Publication Date
CN1623186A true CN1623186A (en) 2005-06-01
CN1307613C CN1307613C (en) 2007-03-28

Family

ID=9929648

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB038026821A Expired - Lifetime CN1307613C (en) 2002-01-24 2003-01-10 Voice activity detector and validator for noisy environments

Country Status (6)

Country Link
JP (2) JP2005516247A (en)
KR (2) KR100976082B1 (en)
CN (1) CN1307613C (en)
FI (1) FI124869B (en)
GB (1) GB2384670B (en)
WO (1) WO2003063138A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011044853A1 (en) * 2009-10-15 2011-04-21 华为技术有限公司 Method and device for realizing trace of background noise in communication system
CN102884575A (en) * 2010-04-22 2013-01-16 高通股份有限公司 Voice activity detection
US8898058B2 (en) 2010-10-25 2014-11-25 Qualcomm Incorporated Systems, methods, and apparatus for voice activity detection
CN104575498A (en) * 2015-01-30 2015-04-29 深圳市云之讯网络技术有限公司 Recognition method and system of effective speeches
CN109841223A (en) * 2019-03-06 2019-06-04 深圳大学 A kind of acoustic signal processing method, intelligent terminal and storage medium
CN113614829A (en) * 2019-11-18 2021-11-05 谷歌有限责任公司 Adaptive energy limiting for transient noise suppression

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100657912B1 (en) * 2004-11-18 2006-12-14 삼성전자주식회사 Noise reduction method and apparatus
JP4758879B2 (en) * 2006-12-14 2011-08-31 日本電信電話株式会社 Temporary speech segment determination device, method, program and recording medium thereof, speech segment determination device, method
GB2450886B (en) 2007-07-10 2009-12-16 Motorola Inc Voice activity detector and a method of operation
CN102272826B (en) * 2008-10-30 2015-10-07 爱立信电话股份有限公司 Telephony content signal is differentiated
KR101196518B1 (en) 2011-04-05 2012-11-01 한국과학기술연구원 Apparatus and method for detecting voice activity in real-time
RU2544293C1 (en) * 2013-10-11 2015-03-20 Сергей Александрович Косарев Method of measuring physical quantity using mobile electronic device and external unit
US9953661B2 (en) * 2014-09-26 2018-04-24 Cirrus Logic Inc. Neural network voice activity detection employing running range normalization
JP2016167678A (en) * 2015-03-09 2016-09-15 株式会社リコー Communication device, communication system, log data storage method, and program
KR102453919B1 (en) 2022-05-09 2022-10-12 (주)피플리 Method, device and system for verifying of guide soundtrack related to cultural content based on artificial intelligence

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IT1209561B (en) * 1983-07-14 1989-08-30 Gte Laboratories Inc COMPLEMENTARY REVELATION OF THE WORD.
JP2559475B2 (en) * 1988-09-22 1996-12-04 積水化学工業株式会社 Voice detection method
JPH03114100A (en) * 1989-09-28 1991-05-15 Matsushita Electric Ind Co Ltd Voice section detecting device
JP3024447B2 (en) * 1993-07-13 2000-03-21 日本電気株式会社 Audio compression device
JP3109978B2 (en) * 1995-04-28 2000-11-20 松下電器産業株式会社 Voice section detection device
US5774849A (en) 1996-01-22 1998-06-30 Rockwell International Corporation Method and apparatus for generating frame voicing decisions of an incoming speech signal
JPH10171497A (en) * 1996-12-12 1998-06-26 Oki Electric Ind Co Ltd Background noise removing device
US5946649A (en) * 1997-04-16 1999-08-31 Technology Research Association Of Medical Welfare Apparatus Esophageal speech injection noise detection and rejection
JP3297346B2 (en) * 1997-04-30 2002-07-02 沖電気工業株式会社 Voice detection device
JPH10327089A (en) * 1997-05-23 1998-12-08 Matsushita Electric Ind Co Ltd Portable telephone set
JPH113091A (en) * 1997-06-13 1999-01-06 Matsushita Electric Ind Co Ltd Detection device of aural signal rise
US6032116A (en) * 1997-06-27 2000-02-29 Advanced Micro Devices, Inc. Distance measure in a speech recognition system for speech recognition using frequency shifting factors to compensate for input signal frequency shifts
FR2768544B1 (en) * 1997-09-18 1999-11-19 Matra Communication VOICE ACTIVITY DETECTION METHOD
JP4221537B2 (en) * 2000-06-02 2009-02-12 日本電気株式会社 Voice detection method and apparatus and recording medium therefor

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011044853A1 (en) * 2009-10-15 2011-04-21 华为技术有限公司 Method and device for realizing trace of background noise in communication system
US8095361B2 (en) 2009-10-15 2012-01-10 Huawei Technologies Co., Ltd. Method and device for tracking background noise in communication system
US8447601B2 (en) 2009-10-15 2013-05-21 Huawei Technologies Co., Ltd. Method and device for tracking background noise in communication system
CN102884575A (en) * 2010-04-22 2013-01-16 高通股份有限公司 Voice activity detection
US9165567B2 (en) 2010-04-22 2015-10-20 Qualcomm Incorporated Systems, methods, and apparatus for speech feature detection
US8898058B2 (en) 2010-10-25 2014-11-25 Qualcomm Incorporated Systems, methods, and apparatus for voice activity detection
CN104575498A (en) * 2015-01-30 2015-04-29 深圳市云之讯网络技术有限公司 Recognition method and system of effective speeches
CN104575498B (en) * 2015-01-30 2018-08-17 深圳市云之讯网络技术有限公司 Efficient voice recognition methods and system
CN109841223A (en) * 2019-03-06 2019-06-04 深圳大学 A kind of acoustic signal processing method, intelligent terminal and storage medium
CN109841223B (en) * 2019-03-06 2020-11-24 深圳大学 Audio signal processing method, intelligent terminal and storage medium
CN113614829A (en) * 2019-11-18 2021-11-05 谷歌有限责任公司 Adaptive energy limiting for transient noise suppression

Also Published As

Publication number Publication date
KR100976082B1 (en) 2010-08-16
GB2384670A (en) 2003-07-30
GB0201585D0 (en) 2002-03-13
GB2384670B (en) 2004-02-18
KR20040075959A (en) 2004-08-30
JP2010061151A (en) 2010-03-18
KR20090127182A (en) 2009-12-09
FI124869B (en) 2015-02-27
WO2003063138A1 (en) 2003-07-31
JP2005516247A (en) 2005-06-02
FI20041013A (en) 2004-09-22
CN1307613C (en) 2007-03-28

Similar Documents

Publication Publication Date Title
CN1623186A (en) Voice activity detector and validator for noisy environments
CN1106091C (en) Noise reducing method, noise reducing apparatus and telephone set
CN1746973A (en) Distributed speech recognition system and method
RU2251750C2 (en) Method for detection of complicated signal activity for improved classification of speech/noise in audio-signal
US20030061037A1 (en) Method and apparatus for identifying noise environments from noisy signals
CN1104710C (en) Method and device for making pleasant noice in speech digital transmitting system
CN1132152C (en) Method for executing automatic evaluation of transmission quality of audio signals
US9364669B2 (en) Automated method of classifying and suppressing noise in hearing devices
CN1145931C (en) Signal noise reduction by spectral substration using linear convolution and causal filtering
CN103229517B (en) A device comprising a plurality of audio sensors and a method of operating the same
CN1264138C (en) Method and arrangement for phoneme signal duplicating, decoding and synthesizing
CN1223109C (en) Enhancement of near-end voice signals in an echo suppression system
CN1969319A (en) Signal encoding
CN1662018A (en) Method and apparatus for multi-sensory speech enhancement on a mobile device
CN1622200A (en) Method and apparatus for multi-sensory speech enhancement
CN1675684A (en) Distributed speech recognition with back-end voice activity detection apparatus and method
CN1265217A (en) Method and appts. for speech enhancement in speech communication system
CN1750124A (en) Bandwidth extension of band limited audio signals
CN1113335A (en) Method for reducing noise in speech signal and method for detecting noise domain
KR20070042565A (en) Detection of voice activity in an audio signal
CN1210685C (en) Method for noise robust classification in speech coding
CN1174457A (en) Speech signal transmission method, and speech coding and decoding system
US20110238417A1 (en) Speech detection apparatus
US8280726B2 (en) Gender detection in mobile phones
CN1046366C (en) Discriminating between stationary and non-stationary signals

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: MOTOROLA MOBILE CO., LTD.

Free format text: FORMER OWNER: MOTOROLA INC.

Effective date: 20110113

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20110113

Address after: Illinois State

Patentee after: MOTOROLA MOBILITY, Inc.

Address before: Illinois, USA

Patentee before: Motorola, Inc.

C41 Transfer of patent application or patent right or utility model
C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: Illinois State

Patentee after: MOTOROLA MOBILITY LLC

Address before: Illinois State

Patentee before: MOTOROLA MOBILITY, Inc.

TR01 Transfer of patent right

Effective date of registration: 20160516

Address after: California, USA

Patentee after: Google Technology Holdings LLC

Address before: Illinois State

Patentee before: MOTOROLA MOBILITY LLC

CX01 Expiry of patent term

Granted publication date: 20070328

CX01 Expiry of patent term