WO2011010604A1 - 音声信号区間推定装置と音声信号区間推定方法及びそのプログラムと記録媒体 - Google Patents
音声信号区間推定装置と音声信号区間推定方法及びそのプログラムと記録媒体 Download PDFInfo
- Publication number
- WO2011010604A1 WO2011010604A1 PCT/JP2010/061999 JP2010061999W WO2011010604A1 WO 2011010604 A1 WO2011010604 A1 WO 2011010604A1 JP 2010061999 W JP2010061999 W JP 2010061999W WO 2011010604 A1 WO2011010604 A1 WO 2011010604A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- probability
- output
- voice
- gmm
- Prior art date
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 36
- 238000000034 method Methods 0.000 title claims description 56
- 238000009826 distribution Methods 0.000 claims abstract description 108
- 238000012545 processing Methods 0.000 claims abstract description 97
- 230000007704 transition Effects 0.000 claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims description 72
- 238000004458 analytical method Methods 0.000 claims description 40
- 230000001629 suppression Effects 0.000 claims description 38
- 230000008569 process Effects 0.000 claims description 30
- 230000000737 periodic effect Effects 0.000 claims description 22
- 238000012935 Averaging Methods 0.000 claims description 13
- 239000006185 dispersion Substances 0.000 claims description 11
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 238000005295 random walk Methods 0.000 claims description 4
- 230000007423 decrease Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 22
- 238000001228 spectrum Methods 0.000 description 16
- 238000001514 detection method Methods 0.000 description 13
- 239000013598 vector Substances 0.000 description 8
- 239000000203 mixture Substances 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 238000007493 shaping process Methods 0.000 description 5
- 238000005311 autocorrelation function Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000010354 integration Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000008707 rearrangement Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
Definitions
- the present invention relates to an audio signal interval estimation device and an audio signal interval estimation method for estimating an interval in which an audio signal exists from a signal including a plurality of acoustic signals, a program for executing the device on a computer, and a recording recording the program It relates to the medium.
- FIG. 22 shows a functional configuration for implementing the conventional speech signal section estimation method disclosed in Non-Patent Document 1 as a conventional speech signal section estimation apparatus 900, and its operation will be briefly described.
- the speech signal section estimation device 900 includes an acoustic signal analysis unit 90, a speech / non-speech state probability ratio calculation unit 95, and a speech signal section estimation unit 96.
- the acoustic signal analysis unit 90 further includes an acoustic feature amount extraction unit 91, a probability estimation unit 92, a parameter storage unit 93, and a GMM (Gaussian Mixture Model: storage normal distribution model) storage unit 94.
- the parameter storage unit 93 includes an initial noise probability model estimation buffer 931 and a noise probability model estimation buffer 931.
- the GMM storage unit 94 includes a silent GMM storage unit 940 and a clean speech GMM storage unit 941 that store a previously generated silent GMM and a clean speech GMM, respectively.
- Acoustic feature quantity extractor 91 extracts acoustic features O t of the acoustic digital signal A t including voice signals and noise signals. For example, a log mel spectrum or a cepstrum can be used as the acoustic feature amount.
- the probability estimation unit 92 generates a non-speech GMM and a speech GMM adapted to a noise environment using a silence GMM and a clean speech GMM, and generates all normal distributions in the non-speech GMM for the input acoustic feature amount O t .
- the non-speech output probability and the speech output probability of all normal distributions in the speech GMM are calculated.
- the speech / non-speech state probability ratio calculation unit 95 calculates a speech / non-speech state probability ratio using the non-speech output probability and the speech output probability.
- Speech signal interval estimation unit 96 an input audio signal from the audio / non-audio state probability ratio, to output only audio signals D S of the determination to example speech state or a non-speech state or a voice state.
- the conventional speech signal section estimation method estimates the speech section using all probability distribution models in the GMM. All probabilistic models are used because they were all considered important. This idea is disclosed, for example, in Non-Patent Document 2 as a method for detecting an audio signal section and suppressing noise. The idea of using all probability distributions is also apparent from the following equation (1) ⁇ ⁇ for calculating the filter gain of the noise suppression filter shown in Non-Patent Document 2.
- O t, j ) is the output probability of the kth normal distribution
- K represents the number of all distributions.
- the present invention has been made in view of such problems. Recent research results have shown that not all probability distributions need to be used for speech signal interval detection and noise suppression. Therefore, the present invention provides a speech signal section estimation device, a speech signal section estimation method, a computer program for executing the device, and a computer program for speeding up processing by not using an unnecessary distribution in the probability model (GMM).
- GMM probability model
- the speech signal section estimation device includes an acoustic signal analysis section and a section estimation information generation section.
- the acoustic signal analysis unit receives an acoustic digital signal including a speech signal and a noise signal, and uses a silent GMM and a clean speech GMM for each frame of the acoustic digital signal to generate a non-speech GMM and a speech GMM adapted to a noise environment.
- the non-speech output probability and the speech output probability of the remaining normal distribution are calculated by removing one or more normal distributions having the smallest output probability from the respective GMMs.
- the section estimation information generation unit calculates a speech / non-speech state probability ratio based on the state transition model of the speech state / non-speech state using the non-speech output probability and the speech output probability, and based on the calculated probability ratio Generate and output speech segment estimation information.
- the speech signal section device with noise suppression function of the present invention has the above-described configuration of the speech signal section estimation device, further the probability ratio output by the speech / non-speech state synthesis probability ratio calculation section, and the acoustic signal analysis section outputs And a noise suppression unit that generates a noise suppression filter using the output probability as input and suppresses noise of the acoustic digital signal.
- the acoustic signal analysis unit generates a non-speech and speech probability model suitable for a noise environment using a silence GMM and a clean speech GMM for each frame, and a required distribution.
- the output probability of only the probability model is calculated.
- voice signal area is determined only using the output probability. Therefore, the processing can be speeded up as compared with the conventional speech signal section estimation device using all probability models.
- the speech signal section estimation device with noise suppression function of the present invention adds a noise suppression unit to the speech signal section estimation device of the present invention to suppress noise of the input speech signal.
- FIG. The figure which shows the function structural example of the audio
- FIG. The figure which shows a part of function structural example of the probability model parameter estimation / probability calculation part 11.
- FIG. The figure which shows the remaining part of the function structural example of the probability model parameter estimation / probability calculation part 11.
- FIG. The figure which shows the operation
- FIG. The figure which shows the operation
- FIG. 1 It is a figure which shows the example of distribution of a probability value
- A is distribution of the output probability w Sort, t, 0, k after each normal distribution k of non-voice GMM, and B is the output probability w Sort, t after voice GMM.
- 1, k shows the distribution of k .
- FIG. The figure which shows the state transition model of an audio
- FIG. The figure which shows the function structural example of the audio
- FIG. The figure which shows the function structural example of the audio
- FIG. The figure which shows the operation
- FIG. The figure which shows the function structural example of the audio
- FIG. The figure which shows an experimental result.
- A is an acoustic input signal waveform
- B is a figure which shows the signal waveform of a noise suppression output.
- FIG. 1 shows a functional configuration example of the speech signal section estimation device 100 of the present invention.
- the operation flow is shown in FIG.
- the speech signal section estimation device 100 includes an acoustic signal analysis unit 10, a speech / non-speech state probability ratio calculation unit 95, a speech signal section estimation unit 96, and a control unit 20.
- the acoustic signal analysis unit 10 includes an acoustic feature amount extraction unit 91, a probability model parameter estimation / probability calculation unit 11, a GMM storage unit 94, and a parameter storage unit 93.
- the GMM storage unit 94 includes a silent GMM storage unit 940 and a clean speech GMM storage unit 941.
- the parameter storage unit 93 includes an initial noise probability model estimation buffer 930 and a noise probability model estimation buffer 931.
- the speech / non-speech state probability ratio calculation unit 95 and the speech signal section estimation unit 96 constitute a section estimation information generation unit 9.
- Acoustic signal A t is the input signal of the speech signal interval estimation device 100, an analog audio signal including a voice signal and the noise signal, such as sound digital signal discrete valued at a sampling frequency 8 kHz. t represents a frame number.
- an A / D converter that converts an analog acoustic signal into an acoustic digital signal is omitted.
- the audio signal section estimation device 100 is realized by a predetermined program being read into a computer constituted by, for example, a ROM, a RAM, a CPU, etc., and the CPU executing the program.
- the speech signal section estimation apparatus 100 differs from the conventional speech signal section estimation apparatus 900 shown in FIG. 22 only in the configuration and operation of a part of the acoustic signal analysis unit 10.
- the probability model parameter estimation / probability calculation unit 11 in the acoustic signal analysis unit 10 generates a non-speech GMM and a speech GMM adapted to a noise environment for each frame, and only a normal distribution required from each GMM is generated. It is characterized in that the output probability of is calculated.
- Sound signal analysis unit 10 as an input a discrete-valued audio digital signal A t, for example, the 200 sound digital signal and the in each frame one frame (e.g. 25 ms), is stored in the GMM storage unit 94
- a non-speech GMM and a speech GMM adapted to a noise environment are generated using a silence GMM and a clean speech GMM, and a non-speech output probability and a speech output probability of only a normal distribution required from each GMM are calculated ( Step S10).
- Acoustic feature quantity extractor 91 the audio signal and the fast Fourier transform and to the acoustic digital signal A t of frame t, including a noise signal, by applying a 12-dimensional mel filter bank analysis, 12-dimensional logarithmic Mel spectrum elements
- the vector O t ⁇ O t, 0 , O t, 1 ,..., O t, 11 ⁇ (acoustic feature quantity at frame t) is calculated and output.
- Subscript numbers 0, 1,..., 11 indicate vector element number l (el).
- the probability model parameter estimation / probability calculation unit 11 estimates a noise probability model parameter by applying a parallel nonlinear Kalman filter to the log mel spectrum input for each frame.
- the parallel nonlinear Kalman filter will be described later.
- the silent GMM storage unit 940 and the clean speech GMM storage unit 941 of the GMM storage unit 94 store the previously generated silent GMM and clean speech GMM, respectively.
- k represents the number of each normal distribution.
- the total number K of each distribution is 32, for example. The value of K is determined by the balance between accuracy and processing speed.
- the speech / non-speech state probability ratio calculation unit 95 calculates a speech / non-speech state probability ratio based on the state transition model of the speech state / non-speech state using those output probabilities (step S95).
- the speech signal section estimation unit 96 compares the speech / non-speech state probability ratio with a threshold value to determine whether the acoustic signal of the frame is a speech state or a non-speech state. Cut and output as section estimation information D ES (step S96).
- the speech signal section estimation unit 96 may output a signal indicating a speech state section and a non-speech state section in the sound signal alone or together with the input sound signal as the sound signal section information, or may be determined.
- the amplitude of the non-speech signal section in the sound signal may be set to 0 based on the signal indicating the section and output as the sound signal section information, or the non-speech state section may be removed as described above. It may be output as voice signal section information (with time). That is, the section estimation information generation unit 9 including the voice / non-speech state probability ratio calculation unit 95 and the voice signal section estimation unit 96 generates and outputs information about the voice signal section (speech signal section information).
- control unit 20 controls the operation of each unit of the speech signal section estimation apparatus 100.
- the acoustic signal analysis unit 10 calculates the output probability of only the required normal distribution. Then, for example, only the acoustic signal of the frame determined as the voice state based on the output probability is output as the section estimation information D ES . Therefore, since speech segment detection is performed using only a necessary probability model, the processing can be speeded up.
- the probability model parameter estimation / probability calculation unit 11 will be described and described in more detail.
- the probability model parameter estimation / probability calculation unit 11 includes a frame determination processing unit 110, an initial noise probability model estimation processing unit 111, a parameter prediction processing unit 112, a parameter update processing unit 113, and a probability model parameter generation estimation processing unit 114.
- the frame determination processing unit 110 receives the sound from the sound feature amount extraction unit 91.
- the feature amount O t is stored in the initial noise probability model estimation buffer 930 (step S930).
- the frame determination processing unit 110 causes the parameter prediction processing unit 112 to estimate the noise probability model parameters one frame before ⁇ N t ⁇ 1, l , ⁇ ⁇ N , t ⁇ 1, l are instructed to be read from the noise probability model estimation buffer 931 (step S931).
- the parameter prediction processing unit 112 is an initial noise probability model parameter N init l , ⁇ init N, l or an estimated value ⁇ N t-1, l , ⁇ ⁇ N, t-1 of the noise probability model parameter one frame before , l , the noise probability model parameter of the current frame t is predicted by the random walk process shown in equations (4) and (5) (step S112).
- N pred t, l and ⁇ pred N, t, l are predicted values of the noise probability model parameters in frame t, and ⁇ is a small value such as 0.0010, for example.
- ⁇ is a small value such as 0.0010, for example.
- the parameter update processing unit 113 configures the noise probability model parameters N pred t, l and ⁇ pred N, t, l predicted for the current frame t, the acoustic feature amount O t, and each GMM in the GMM storage unit 94.
- t, l and ⁇ pred N, t, l are updated (step S113).
- the update process is performed by applying the nonlinear Kalman filter shown in equations (8) to (13) for each frame.
- Equation (12) and (13) are the parameters of the normal distribution updated here.
- This nonlinear Kalman filter is a conventional technique.
- the probability model parameter generation estimation processing unit 114 includes a plurality of normal distribution parameters ⁇ S, j, k, l and ⁇ S, j, k, l stored in the GMM storage unit 94, and a parameter update processing unit 113.
- a parameter update processing unit 113 Using the updated normal distribution parameters ⁇ N t, j, k, l , ⁇ ⁇ N, t, j, k, l as input, non-voice GMM (noise + silence) and voice GMM suitable for the noise environment at frame t (Noise + clean voice) is generated (step S114).
- the non-voice GMM and the voice GMM are obtained by the following equations.
- ⁇ is the mean and ⁇ is the variance.
- the output probability calculation processing unit 115 obtains the output probability of the acoustic feature quantity O t by each of the non-voice GMM and the voice GMM generated by the probability model parameter generation estimation processing unit 114 by the following equation.
- the output probability b 1, j, k (O t ) of each normal distribution k is calculated by equation (17). Note that the subscript number 1 of the symbol b is a symbol for distinguishing from the output probability by the second acoustic signal analyzer in Example 2 described later.
- the output probability calculation unit 115 calculates the output probability w O, t, j, k obtained by normalizing the output probability of each normal distribution k with the output probabilities b 1, j (O t ) of the non-voice GMM and the voice GMM as Calculate and output in 19).
- the output probability b 1, j (O t ) of the non-voice GMM and the voice GMM is weighted by the quantity (step S116).
- FIG. 6 shows a processing flow of the probability weight calculation processing unit 116.
- FIG. 7 illustrates a method for obtaining the normal distribution index SortIdx t, j, k ′ after the rearrangement.
- FIG. 7A shows a normalized output probability w O, t, j, k before sorting and an index k of the normal distribution before sorting.
- FIG. 7B shows the post-sort normalized output probability w Sort, t, j, k ′ and the corresponding distribution index SortIdx t, j, k ′ after sorting in descending order. In this way, the normal distributions are arranged in descending order of normalized output probabilities (step S1160).
- the total change width of the normal distribution k of the non-speech GMM with respect to the sorted output probability w Sort, t, 0, k ′ with respect to k ′ Is small.
- the horizontal axis of FIG. 8A is the index k ′ of the normal distribution, and the vertical axis is the output probability w Sort, t, 0, k ′ after sorting. As described above, the characteristics gently decline as the index k ′ increases.
- the characteristic of the change in output probability is expressed using the kurtosis (quaternary statistic) which is a parameter representing the degree of kurtosis of this characteristic.
- the kurtosis Kurt t, j of the output probability w Sort, t, j, k ′ after sorting can be calculated by Expression (20).
- equation (21) represents the average value of the output probabilities w Sort, t, j, k ′ after sorting of all normal distributions, but this value is the same as the average value before sorting.
- the numerator in equation (20) is the sum of the output probabilities w Sort, t, j, k ' after sorting and the fourth power of the average difference for all output probabilities, and this value is also the corresponding difference before sorting. It is the same as the fourth power sum.
- the mean of the sum of squares represented by the equation (22), that is, the variance is the same as the corresponding value before sorting.
- the kurtosis Kurt t, j obtained by the equation (20) represents the degree of dispersion of output probabilities of all normal distributions before and after sorting.
- the degree of dispersion of output probabilities need not be limited to the definition of equation (20), and various definitions based on the sum of squares of the output probabilities and average values are possible. Since the probability weight calculation processing unit 116 needs to put a large weight on the normal distribution that can obtain a gentle characteristic with a small degree of sharpness, in the weight normalization process of step S1162, as shown in Expression (23), A probability weight w Kurt, t, j obtained by normalizing the inverse of degree Kurt t, j is obtained (step S1162).
- the probability weight calculation unit 116 obtains the probability weight w Kurt, t, j , the sorted output probability w Sort, t, j, k ′, and the corresponding normal distribution index SortIdx t, j, k ′ as the necessary distribution determination processing unit 117. Output to.
- the necessary distribution determination processing unit 117 removes a normal distribution having a small value of the sorted output probability w Sort, t, j, k ′ and extracts only a normal distribution having a sufficiently large value.
- the processing flow is shown in FIG. First, the sorted output probabilities w Sort, t, j, k ′ rearranged in descending order are sequentially added in descending order to obtain a cumulative value (step S1170). Next, the number R t, j of the corresponding distribution index having the minimum post-sort output probability value for which the accumulated value reaches a predetermined value 0 ⁇ X ⁇ 1 is obtained by the equation (24).
- a correspondence distribution index is determined in which the cumulative value of the sorted output probability w Sort, t, j, k ′ is 0.9 (step S1171).
- w Sort, t, j, 1 + w Sort, t, j, 2 + w Sort, t, j, 3 0.9 ⁇ X
- the corresponding distribution index SortIdx t, j, 1 to SortIdx t, j, 3 are selected.
- the output probabilities b 1, j O of non-voice GMM and voice GMM using the selected normal distributions SortIdx t, j, 1 to SortIdx t, j, (Rt, j) t ) is recalculated.
- the mixture weight w j, k (k SortIdx t, j, k ′ ), which is a GMM parameter, is normalized by the equation (25).
- step S1173 the output probability b 1, j (O t ) recalculated using the probability weight w Kurt, t, j according to the equation (28) is weighted.
- the first weighted average processing unit 118 uses the normal distribution parameters ⁇ N t, j, k, l and ⁇ ⁇ N, t, j, k, l updated by the parameter update processing unit 113 as probability weight calculation processing units.
- the noise parameter estimation result corresponding to the non-voice GMM and the voice GMM ⁇ N t, j, l , ⁇ by weighted averaging using the sorted output probability w Sort, t, j, k ′ obtained in 116 ⁇ N, t, j, l is obtained.
- the weighted average is calculated by the following formula.
- the noise parameter estimation results ⁇ N t, j, l and ⁇ ⁇ N, t, j, l obtained by the first weighted average processing unit 118 are converted into necessary distribution determination processing units.
- weighted averages are performed according to equations (31) and (32), respectively.
- noise parameter estimation results ⁇ N t, l and ⁇ ⁇ N, t, l in the frame t are obtained and used for estimating the noise parameters of the next frame.
- the noise parameter estimation results ⁇ N t, l and ⁇ ⁇ N, t, l obtained by the second weighted average processing unit 119 are stored in the noise probability model estimation buffer 931.
- the probability model parameter estimation / probability calculation unit 11 performs the processing described above, and uses the speech / non-speech probability b w, 1,0 (O t ), b w, 1, in the frame t as an output parameter of the acoustic signal analysis unit 10 . 1 (O t ) is output to the voice / non-voice state probability ratio calculation unit 95.
- FIG. 10 shows a functional configuration example of the voice / non-voice state probability ratio calculation unit 95.
- the voice / non-voice state probability ratio calculation unit 95 includes a probability calculation unit 950 and a parameter storage unit 951.
- the speech / non-speech state probability ratio calculation unit 95 receives speech / non-speech probabilities b w, 1,0 (O t ), b w, 1,1 (O t ) as inputs, and is expressed by a finite state machine in FIG.
- a speech state / non-speech state probability ratio is calculated based on the state transition model of the speech state / non-speech state.
- the finite state machine is a speech state / non-speech state transition model, and includes a non-speech state H 0 , a speech state H 1, and state transition probabilities a i, j (a 0,0 to a 1). , 1 ).
- i is the state number of the state transition source
- j is the state number of the state transition destination.
- the parameter storage unit 951 includes a probability ratio calculation buffer 951a and a state transition probability table 951b.
- the state transition probability table 951b has state transition probabilities a 0,0 to a to the non-voice state H 0 and the voice state H 1 . Has a value of 1,1 .
- State number 0 indicates a non-speech state
- state number 1 indicates a speech state
- probability calculating section 950 calculates the ratio L (t) between the speech state probability and the non-speech state probability using equation (33).
- Expression (35) is developed as the following expression by a recursive expression (first-order Markov process) considering the state of the past frame.
- the processing flow of the voice / non-voice state probability ratio calculation unit 95 is shown in FIG.
- the forward probability ⁇ t, j is calculated according to this operation flow.
- the probability calculation unit 950 extracts the state transition probability a i, j from the state transition probability table 951b, and calculates the forward probability ⁇ t, j of the frame t according to the equation (37) (step S951). Then, the probability calculation unit 950 further calculates the probability ratio L (t) using the equation (38), and stores the forward probability ⁇ t, j in the probability ratio calculation buffer 951a (step S952).
- FIG. 13 shows a functional configuration example of the audio signal section estimation unit 96.
- Speech signal interval estimation unit 96 a threshold processing unit 960, a voice signal segment shaping unit 961, the frame t of the acoustic signal A t belongs to the voice state as input speech state / non-speech state probability ratio L (t) Or whether it belongs to the non-voice state.
- the threshold processing unit 960 determines that the frame t belongs to the voice state, and outputs 1 and is less than the threshold TH If so, it is determined that the frame t belongs to the non-voice state, and 0 is output.
- the value of the threshold TH may be determined as a fixed value in advance or may be set adaptively according to the characteristics of the acoustic signal.
- the voice signal section shaping unit 961 performs error correction by performing shaping processing on the voice section estimation result obtained by the threshold processing unit 960.
- the error correction is determined as a voice section when a frame regarded as voice by the threshold processing unit 960 continues for a predetermined number of frames or more, for example, five or more frames.
- a frame that is regarded as non-speech is determined as a non-speech segment if it continues for a predetermined number or more.
- the predetermined number of frames may be set such that an arbitrary number can be set by a variable name such as a duration frame for detecting a voice section or an N duration frame for detecting a non-voice section.
- the corresponding segment when a short non-speech segment existing in the speech segment is detected, the corresponding segment may be regarded as a speech segment if the duration of the non-speech segment is equal to or less than a predetermined number of pause frames.
- the audio signal section shaping unit 961 By providing the audio signal section shaping unit 961, a voice section and a non-voice section with a small number of frames are not generated, so that the operation of signal section detection can be stabilized.
- a signal representing the speech segment and the non-speech segment detected in this way is output as segment estimation information D ES .
- Voice section if necessary, may be output as interval estimation information D ES
- the amplitude of all samples of each non-speech section in the sound signal A t is set to 0 by the detected non-speech section, may be output as interval estimation information D ES, extracts the detected speech section from the sound signal However , it may be output as the section estimation information D ES .
- the processing by the audio signal section shaping unit 961 may not be performed, and the estimation result of the threshold processing unit 960 may be directly output as D ES .
- FIG. 14 shows a functional configuration example of the speech signal section estimation device 200 of the present invention.
- the speech signal section estimation device 200 with respect to the speech signal section estimation device 100, has a signal averaging unit 50 that averages the acoustic digital signals At , ch of a plurality of channels for each frame, and a periodic component power and an aperiodic component power.
- the second acoustic signal analysis unit 60 is used to obtain the speech probability and the non-speech probability, and the speech / non-speech state probability ratio calculation unit 95 ′ of the section estimation information generation unit 9 is the output of the second acoustic signal analysis unit 60.
- the difference is that the speech state / non-speech state probability ratio L (t) is calculated using the signal as well. Operations for these different parts will be described.
- the signal averaging unit 50 cuts out a plurality of channels of sound signal inputs as a frame while moving the start point with a certain time width in the time axis direction. For example, an acoustic signal At , ch sampled at a sampling frequency of 8 kHz and sampled at 200 sample points (25 ms) is cut out for each channel while moving the start point by 80 sample points (10 ms). At this time, for example, the hamming window w (n) according to the following equation (39) is used to cut out (step S50).
- Len represents the number of sample points of the cut-out waveform of the frame.
- Len 200.
- the acoustic signal At , ch, n is averaged for each corresponding sample n according to the equation (40), and an averaged signal At , n that is a monaural signal is output (step S51).
- the signal averaging process (step S502) may be omitted.
- the signal averaging unit 50 By including the signal averaging unit 50, it is possible to significantly reduce the amount of memory used when processing multi-channel input acoustic signals.
- the power spectrum level averaging process is performed by calculating the power spectrum of the input acoustic signal for each channel using Fourier transform, instead of the averaging process of the input acoustic signal shown in the equation (40). To output the average power spectrum of each channel.
- FIG. 16 shows a functional configuration example of the second acoustic signal analysis unit 60.
- the operation flow is shown in FIG.
- the second acoustic signal analysis unit 60 includes a discrete Fourier transform unit 61, a power calculation unit 62, a fundamental frequency estimation unit 63, a periodic component power calculation unit 64, a subtraction unit 65, a division unit 66, and a probability calculation. Part 67.
- the discrete Fourier transform unit 61 performs a discrete Fourier transform on the averaged signal At , n to convert the averaged signal from a time domain signal to a frequency domain frequency spectrum (step S61).
- the frequency spectrum X t (k) of the average signal At , n is obtained by the equation (41).
- k represents a discrete point obtained by equally dividing the sampling frequency into M, and M is, for example, 256.
- the power calculation unit 62 calculates the average power ⁇ t of the average signal At , n from the frequency spectrum X t (k) output from the discrete Fourier transform unit 61 according to the equation (42) (step S62).
- the fundamental frequency estimation unit 63 receives the average power ⁇ t output from the power calculation unit 62 and the frequency spectrum X t (k) output from the discrete Fourier transform unit 61 as input, and an averaged signal At , n according to equation (43). to estimate the fundamental frequency f0 t of (step S63).
- f0 t is the bin number of the frequency spectrum corresponding to the estimated fundamental frequency
- argmax (*) is a function that outputs g that maximizes the inside of (*)
- v t is a function that represents the integer part of M / g.
- c t (g) becomes maximum within a constant search range of g in the coefficient of the autocorrelation function, for example, within a range of 16 ⁇ g ⁇ 160 corresponding to 50 Hz to 500 Hz when the sampling frequency is 8 kHz.
- g is detected.
- the resulting g represents the period length of the most dominant periodic component in the search range of the input signal.
- the input signal is a single complete period signal, for example, a sine wave
- the period length is The corresponding value.
- the periodic component power calculation unit 64 outputs the frequency spectrum X t (k) output from the discrete Fourier transform unit 61 , the average power ⁇ t of the averaged signal At , n output from the power calculation unit 62, and the fundamental frequency estimation. Using the fundamental frequency f0 t output from the unit 63 as an input, the power ⁇ ⁇ p t of the periodic component of the averaged signal At , n is estimated by equation (45) (step S64).
- f0 t is the bin number of the frequency spectrum corresponding to the estimated fundamental frequency
- v t is a function representing the integer part of M / g.
- the power of the periodic component ⁇ ⁇ p t can be estimated without using the frequency spectrum.
- ⁇ ⁇ p t thus obtained may be used as the output of the periodic component power calculation unit 64.
- the subtracting unit 65 subtracts the periodic component power ⁇ ⁇ p t output from the periodic component power calculating unit 64 from the power ⁇ t output from the power calculating unit 62 by Expression (48). estimating the power ⁇ [rho a t the aperiodic component other than (step S65).
- the periodic component power ⁇ ⁇ p t was first determined and then the non-periodic component ⁇ ⁇ a t.
- the subtraction unit 65 may determine the periodic component power ⁇ ⁇ p t .
- the dividing unit 66 takes their ratio by the equation (51) (Step S66) Output And
- the probability calculation unit 67 receives the ratio value output by the division unit 66 as an input, and the probability that the average signal belongs to the non-speech state and the probability b 2, j ( ⁇ t ) (speech / non-speech probability belonging to the speech state. ) Is calculated by the following equation (step S67).
- C 0 and C 1 are constant terms of a normal distribution, and are coefficients that are corrected so that the value when the term of exp is integrated becomes 1.
- the speech / non-speech state probability b w, 1, j (O t ) output from the acoustic signal analysis unit 10 is calculated in order to calculate the speech / non-speech state probability ratio L (t) using the equation (38). Then, the forward probability ⁇ t, j was obtained by equation (37).
- the speech / non-speech state probability ratio calculation unit 95 ′ according to the second embodiment calculates the speech / non-speech state probability ratio L (t) according to the equation (38).
- the forward probability ⁇ t, j is obtained. It differs from the speech / non-speech state probability ratio calculation unit 95 in the first embodiment in that it is calculated by the equation (54). The other operations are the same.
- the speech signal section estimation apparatus 200 of the second embodiment since the speech signal section estimation apparatus 100 also considers speech / non-speech probabilities based on estimation errors of periodic component power and aperiodic component power. Furthermore, it is possible to improve the accuracy of speech signal section estimation.
- FIG. 18 shows a functional configuration example of the speech signal section estimation device 300 with a noise suppression function of the present invention.
- the speech signal section estimation apparatus 300 with a noise suppression function adds the configuration of the noise suppression unit 70 to the speech signal section estimation apparatus 100 to suppress noise included in the speech section signal and the noise signal in the speech section. A signal is output.
- the noise suppression unit 70 includes the acoustic signal, the speech / non-speech probability b w, 1, j (O t ) output from the acoustic signal analysis unit 10 , and the speech state output from the speech / non-speech state probability ratio calculation unit 20. / as input non-speech state probability ratio L (t), for suppressing noise contained in the audio signal a t.
- FIG. 19 shows a functional configuration example of the noise suppression unit 70.
- the noise suppression unit 70 includes a silence filter coefficient generation unit 71, an audio filter coefficient generation unit 72, a filter coefficient integration unit 73, and a noise suppression filter application unit 74.
- the silence filter coefficient generation unit 71 and the speech filter coefficient generation unit 72 are respectively speech / non-speech GMM parameters ⁇ O, t, j, adapted to the noise environment in the frame t calculated in the acoustic signal analysis unit 10 .
- a filter coefficient Filter t for extracting a silence component or a speech component from k, l , ⁇ O, t, j, k, l and the corresponding distribution index SortIdx t, j, 1 to SortIdx t, j, (Rt, j) , j, l .
- the filter coefficient integration unit 73 receives the speech state / non-speech state probability ratio L (t) and receives the filter coefficients Filter t, 0, l obtained by the silence filter coefficient generation unit 71 and the speech filter coefficient generation unit 72. And Filter t, 1, l are integrated to obtain a final noise suppression filter coefficient Filter t, l by the following equation.
- L (t) is obtained by the following equation.
- the noise suppression filter application unit 74 converts the noise suppression filter coefficient Filter t, l obtained by the filter coefficient integration unit 73 into an impulse response filter coefficient filter t, n using the following equation.
- MelDCT m, n is a discrete cosine transform (DCT) coefficient weighted with a mel frequency.
- the calculation method of MelDCT m, n is, for example, the reference “ETSI ES 202 050 v.1.1.4,” Speech processing, Transmission and Quality aspects (STQ), Advanced Distributed Speech Recognition; Front-end feature extraction algorithm; Compression algorithms, "November 2005.p.18-p.19,” 5.1.9 Mel IDCT ". Therefore, the description is omitted.
- the multi-channel noise suppression speech st , ch, n is obtained by convolving the impulse response filter t, n with the multi-channel input acoustic signal At , ch, n as shown in the following equation.
- the noise suppression signal st, ch, n is an output signal of the noise suppression unit 74.
- the addition of the noise suppression unit 70 to the audio signal interval estimation device 200 allows the audio signal interval with a noise suppression function to be added.
- An estimation device can also be configured.
- the speech signal section estimation device 300 with the noise suppression function is configured by adding the configuration of the noise suppression unit 70 to the speech signal section estimation device 100, but the configuration of the noise suppression unit 70 is added to the speech signal section estimation device 200 described above. It is good also as a speech signal section estimation device with a noise suppression function to which is added.
- CENSREC-1-C designed for evaluation of speech signal interval detection.
- CENSREC-1-C contains two types of data: artificially created simulation data and real data recorded in the real environment. In this experiment, the sound quality such as the effects of noise and deformation in the real environment was examined. Evaluation was made using actual data to investigate the effects of deterioration.
- CENSREC-1-C is published in the reference document “CENSREC-1-C: Construction of a noisy speech interval detection evaluation base, Information Processing Society of Japan Research Report, SLP-63-1, pp.1-6, Oct. 2006.” It is shown.
- the actual data of CENSREC-1-C is recorded in two types of environment: the student cafeteria and the street.
- the SN ratio (SNR) is High SNR (noise level around 60dB (A)) and Low SNR (noise), respectively.
- (A) shows the measurement characteristics.
- Voice data is recorded as one file of voices spoken by a single speaker 8 to 10 times consecutive numbers of 1 to 12 digits at intervals of about 2 seconds, and 4 files per speaker in each environment. Is recorded. The number of speakers is 10 (5 for each gender) (however, the evaluation target is data for 9 people excluding 1 male).
- Each signal is a monaural signal discretely sampled at a sampling frequency of 8 kHz and a quantization bit number of 16 bits.
- the acoustic signal analysis process and the second acoustic signal analysis process were applied to this acoustic signal by setting the time length of one frame to 25 ms (200 sample points) and moving the start point of the frame every 10 ms (80 sample points).
- the parameter ⁇ used for obtaining the noise probability model parameter prediction value of the current frame in the parameter prediction processing unit 112 is set to 0.001.
- the threshold value X of the necessary distribution determination processing unit 117 is set to 0.9, and the values of the state transition probabilities a 0,0 , a 0,1 , a 1,0 , a 1,1 are set to 0.8, 0.2, 0.9, and 0.1, respectively. Set.
- the threshold TH value of the threshold processing unit 960 (FIG.
- the coefficient ⁇ was set to 0.0.
- the performance was evaluated using the following formula: interval detection correct rate Correct rate and interval detection accuracy accuracy.
- N is the total number of utterance intervals
- N c is the number of correct utterance intervals detected
- N f is the number of erroneous utterance intervals detected.
- the section detection correct rate Correct rate is a scale for evaluating how many utterance sections can be detected
- the section detection correct answer accuracy is a scale for evaluating how much utterance sections can be detected without excess or deficiency.
- FIG. 20 shows the evaluation results.
- A1 and A2 in FIG. 20 are baselines defined in the CENSREC-1-C database, and B1 and B2 in FIG. 20 are the results of the method disclosed in Non-Patent Document 2.
- C1 and C2 in FIG. Results are shown.
- the average of the section detection correct rate Correct rate is 90.43%, but this invention is improved by 1.6% to 92.03%.
- the average of accuracy of section detection accuracy is improved by 4.72% compared to Non-Patent Document 2.
- FIG. 21B shows a signal waveform of the noise suppression output obtained by the speech signal interval estimation device of the present invention.
- FIG. 21A shows an acoustic input signal waveform.
- the processing time is shortened by estimating the speech signal section using only the probability model of the required distribution, and the probability weight obtained by the probability weight calculation processing unit 116 is obtained.
- Weighting with w Kurt, t, j enhances the difference between the non-voice GMM output probability and the voice GMM output probability, thereby improving the non-voice / voice discrimination.
- the method of predicting the parameter of the current frame from the estimation result of the previous frame by the random walk process has been described.
- the auto-regression method linear prediction method or the like is used. Also good.
- the final noise model parameter estimation performance can be improved according to the order of the autoregressive coefficient.
- another probability model such as HMM (HiddenidMarkov Model) may be used as a probability model of the acoustic signal.
- the processing means in the above apparatus is realized by a computer
- the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.
- the program describing the processing contents can be recorded on a computer-readable recording medium.
- a computer-readable recording medium for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.
- a magnetic recording device a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.
- this program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.
- each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
次に、確率モデルパラメータ推定・確率計算部11のより具体的な機能構成例を示して更に詳しく説明する。
図3と図4は確率モデルパラメータ推定・確率計算部11のより具体的な機能構成例を2つの部分に分けて示し、その動作フローを図5に示す。確率モデルパラメータ推定・確率計算部11は、フレーム判定処理部110と、初期雑音確率モデル推定処理部111と、パラメータ予測処理部112と、パラメータ更新処理部113と、確率モデルパラメータ生成推定処理部114と、出力確率算出部115と、確率重み算出処理部116と、必要分布決定処理部117と、第1加重平均処理部118と、第2加重平均処理部119とを備える。
確率モデルパラメータ推定・確率計算部11は、以上述べた処理を行い音響信号分析部10の出力パラメータとしてフレームtにおける音声/非音声確率bw,1,0(Ot),bw,1,1(Ot)を音声/非音声状態確率比算出部95に出力する。
図10に音声/非音声状態確率比算出部95の機能構成例を示す。音声/非音声状態確率比算出部95は、確率算出部950と、パラメータ記憶部951を備える。
図13に音声信号区間推定部96の機能構成例を示す。音声信号区間推定部96は、閾値処理部960と、音声信号区間整形部961を備え、音声状態/非音声状態確率比L(t)を入力として音響信号Atのフレームtが音声状態に属するか非音声状態に属するかを判定する。
信号平均部50の動作フローを図15に示す。信号平均部50は、まず、複数チャネルの音響信号入力を、時間軸方向に一定時間幅で始点を移動させながら、一定時間長の音響信号をフレームとして切り出す。例えば、サンプリング周波数8kHzでサンプリングした200サンプル点(25ms)の音響信号At,chを、80サンプル点(10ms)ずつ始点を移動させながらチャネル毎に切り出す。その際、例えば次式(39)によるハミング窓w(n)を利用して切り出す(ステップS50)。
図16に第2音響信号分析部60の機能構成例を示す。その動作フローを図17に示す。第2音響信号分析部60は、離散フーリエ変換部61と、パワー算出部62と、基本周波数推定部63と、周期性成分パワー算出部64と、減算部65と、除算部66と、確率算出部67とを備える。
パワー算出部62は、平均化信号At,nの平均パワーρtを、離散フーリエ変換部61の出力する周波数スペクトルXt(k)から式(42)により算出する(ステップS62)。
減算部65は、パワー算出部62の出力するパワーρtから、周期性成分パワー算出部64が出力する周期性成分のパワー^ρp tを式(48)で減算することで、周期性成分以外の非周期性成分のパワー^ρa tを推定する(ステップS65)。
実施例1では音声/非音声状態確率比L(t)を式(38)により計算するために、音響信号分析部10が出力する音声/非音声確率bw,1,j(Ot)を使って式(37)により前向き確率αt,jを求めた。実施例2の音声/非音声状態確率比算出部95′は、音声/非音声状態確率比L(t)を式(38)により計算するために、音響信号分析部10が出力する音声/非音声確率bw,1,j(Ot)に第2音響信号分析部60が出力する音声/非音声確率b2,j(ρt)を乗じた値を用いて前向き確率αt,jを式(54)により計算する点で、実施例1における音声/非音声状態確率比算出部95と異なる。それ以外の動作は同じである。
この発明の音声信号区間推定装置の音声信号区間検出性能を評価する実験を行った。実験条件は次の通りである。データベースには、音声信号区間検出の評価用に設計されたCENSREC-1-Cを用いた。CENSREC-1-Cは、人工的に作成したシミュレーションデータと、実環境で収録した実データの2種類のデータを含んでおり、この実験では、実環境における雑音及び発生変形の影響等の音声品質劣化の影響を調査するため実データを用いて評価した。CENSREC-1-Cは、参考文献「CENSREC-1-C:雑音下音声区間検出評価基盤の構築、情報処理学会研究報告、SLP-63-1,pp.1-6, Oct.2006.」に示されている。
性能の評価は次式の区間検出正解率Correct rateと区間検出正解精度Accuracyで行った。
Claims (16)
- 音声信号と雑音信号を含む音響ディジタル信号を入力として、その音響ディジタル信号のフレーム毎に予め生成した無音混合正規分布モデル、以下混合正規分布モデルをGMMと呼ぶ、とクリーン音声GMMを用いて雑音環境に適合させた非音声GMMと音声GMMを生成し、それぞれのGMMの中から最も小さい出力確率の1つ以上の正規分布を除いた残りの正規分布の非音声出力確率と音声出力確率を計算する音響信号分析部と、
上記非音声出力確率と音声出力確立を用いて音声状態/非音声状態の状態遷移モデルに基づいた音声/非音声状態確率比を算出し、算出した上記確率比に基づいて音声区間に関する情報を生成し、音声区間推定情報として出力する区間推定情報生成部と、
を含む音声信号区間推定装置。 - 請求項1に記載した音声信号区間推定装置において、上記音響信号分析部は、
初期の雑音確率モデルパラメータを推定する初期雑音確率モデル推定処理部と、
1フレーム前の雑音確率モデルパラメータの推定結果より現在のフレームの雑音確率モデルパラメータをランダムウオーク過程により予測するパラメータ予測処理部と、
現在のフレームの雑音確率モデルパラメータを入力として無音GMMとクリーン音声GMMに含まれる全ての正規分布のパラメータを更新するパラメータ更新処理部と、
更新された正規分布のパラメータと無音GMMとクリーン音声GMMの複数の正規分布のパラメータを用いて現在のフレームにおける雑音環境に適合させた非音声GMMと音声GMMを生成する確率モデルパラメータ生成推定処理部と、
上記フレームGMMに含まれる各正規分布の出力確率を算出する出力確率算出処理部と、
上記各正規分布の出力確率の散らばりの度合いを高次統計量でパラメータ化して非音声状態と音声状態のそれぞれの上記各正規分布の出力確率を重み付けする確率重みを算出する確率重み算出処理部と、
上記出力確率の値が微小となる正規分布を取り除き、十分大きな出力確率を持つ正規分布のみを抽出する必要分布決定処理部と、
上記パラメータ予測処理部で予測した現在のフレームの雑音確率モデルパラメータを、確率重み算出部が算出した確率重みを用いて加重平均する第1加重平均処理部と、
第1加重平均処理部で加重平均された雑音確率モデルパラメータを、上記必要分布決定処理部が抽出した正規分布についてのみ加重平均する第2加重平均処理部と、
を含む。 - 請求項1に記載の音声信号区間推定装置において、上記音響信号分析部は、
上記非音声出力確率及び音声出力確率の散らばり度合いを計算し、散らばり度合いが小さいほど、当該正規分布の出力確率が大きくなるように、その非音声出力確率及び音声出力確率を補正する確率重みをそれぞれ算出する確率重み算出処理部を含む。 - 請求項1記載の音声信号区間推定装置において、上記音声信号分析部は、上記出力確率の大きい順に順次累積和を算出し、所定値を越える累積和を与える出力確率の正規分布を除去すべき上記最も小さい出力確率の1つ以上の正規分布と決定する必要分布決定処理部を含む。
- 請求項1に記載の音声信号区間推定装置において、更に、
複数チャネルの上記音響ディジタル信号をフレーム毎に平均化する信号平均部と、
周期成分パワーと非周期成分パワーを用いて音声確率と非音声確率を求める第2音響信号分析部とを含み、
上記区間推定情報生成部は、上記音響信号分析部と上記第2音響信号分析部が出力する音声確率及び非音声確率の対応するものをそれぞれ乗算し、乗算結果を用いて上記音声/非音声状態確率比を算出する。 - 請求項1乃至5の何れか記載の音声信号区間推定装置において、上記区間推定情報生成部は、
上記音声/非音声状態確率比を算出する音声/非音声状態確率比算出部と、
上記音声/非音声状態確率比から当該フレームの音響信号が音声状態であるか非音声状態であるかを判定し、判定結果に基づいて上記音声区間推定情報を生成する音声信号区間推定部とを含む。 - 請求項1乃至5の何れかに記載した音声信号区間推定装置は、更に、
上記区間推定情報生成部が出力する確率比と、上記音響信号分析部が出力する出力確率とを入力として雑音抑圧フィルタを生成し、上記音響ディジタル信号の雑音を抑圧する雑音抑圧部を含む。 - 音声信号と雑音信号を含む音響ディジタル信号を入力として、その音響ディジタル信号のフレーム毎に予め生成した無音混合正規分布モデル、以下混合正規分布モデルをGMMと呼ぶ、とクリーン音声GMMを用いて雑音環境に適合した非音声GMM及び音声GMMの確率モデルを生成し、それぞれのGMMの中から最も小さい出力確率の1つ以上の正規分布を除いた残りの正規分布の非音声出力確率と音声出力確率を計算する音響信号分析過程と、
上記非音声出力確率と音声出力確率を用いて音声状態/非音声状態の状態遷移モデルに基づいた確率比を算出し、算出した上記確率比に基づいて音声区間に関する情報を生成し、音声区間推定情報として出力する区間推定情報生成過程と、
を含む音声信号区間推定方法。 - 請求項8に記載した音声信号区間推定方法において、上記音響信号分析過程は、
初期の雑音確率モデルパラメータを推定する初期雑音確率モデル推定処理ステップと、
1フレーム前の雑音確率モデルパラメータの推定結果より現在のフレームの雑音確率モデルパラメータをランダムウオーク過程により予測するパラメータ予測処理ステップと、
現在のフレームの雑音確率モデルパラメータを入力として無音GMMとクリーン音声GMMに含まれる全ての正規分布のパラメータを更新するパラメータ更新処理ステップと、
更新された正規分布のパラメータと無音GMMとクリーン音声GMMの複数の正規分布のパラメータを用いて現在のフレームにおける雑音環境に適合させた非音声GMMと音声GMMを生成する確率モデルパラメータ生成推定処理ステップと、
上記フレームGMMに含まれる各正規分布の出力確率を算出する出力確率算出処理ステップと、
上記各正規分布の出力確率の散らばりの度合いを高次統計量でパラメータ化して非音声状態と音声状態のそれぞれの上記各正規分布の出力確率を重み付けする確率重みを算出する確率重み算出ステップと、
上記出力確率の値が微小となる正規分布を取り除き、十分大きな出力確率を持つ正規分布のみを抽出する必要分布決定処理ステップと、
上記パラメータ予測処理部で予測した現在のフレームの雑音確率モデルパラメータを、確率重み算出部が算出した確率重みを用いて加重平均する第1加重平均処理ステップと、
第1加重平均処理部で加重平均された雑音確率モデルパラメータを、上記必要分布決定処理部が抽出した正規分布についてのみ加重平均する第2加重平均処理ステップと、
を含む。 - 請求項8に記載の音声信号区間推定方法において、上記音響信号分析過程は、
上記非音声出力確率及び音声出力確率の散らばり度合いを計算し、散らばり度合いが小さいほど、当該正規分布の出力確率が大きくなるように上記非音声出力確率及び音声出力確率を補正する過程を含む。 - 請求項8記載の音声信号区間推定方法において、上記音声信号分析過程は、上記出力確率の大きい順に順次累積和を算出し、所定値を越える累積和を与える出力確率の正規分布を除去すべき上記最も小さい出力確率の1つ以上の正規分布と決定する過程を含む。
- 請求項8に記載の音声信号区間推定方法において、更に、
信号平均部が、複数チャネルの上記音響ディジタル信号をフレーム毎に平均化する信号平均過程と、
周期成分パワーと非周期成分パワーを用いて音声確率と非音声確率を求める第2音響信号分析過程とを含み、
上記区間推定情報生成過程は、上記音響信号分析部と上記第2音響信号分析部が出力する音声確率及び非音声確率の対応するものをそれぞれ乗算し、乗算結果を用いて上記音声/非音声状態確率比を算出する過程である。 - 請求項8乃至12の何れか記載の音声信号区間推定方法において、上記区間推定情報生成過程は、上記必要とする分布の出力確率を用いて音声状態/非音声状態の状態遷移モデルに基づいた確率比を算出する音声/非音声状態確率比算出過程と、音声信号区間推定部が、上記確率比から当該フレームの音響信号が音声状態であるか非音声状態であるかを判定し、判定結果に基づいて上記音声区間推定情報を生成する音声信号区間推定過程とを含む。
- 請求項8乃至12の何れかに記載した音声信号区間推定方法に、更に、
上記区間推定情報生成過程が出力する確率比と、上記音響信号分析部が出力する出力確率とを入力として雑音抑圧フィルタを生成し、上記音響ディジタル信号の雑音を抑圧する雑音抑圧過程を含む。 - 請求項1に記載した装置としてコンピュータを機能させるためのプログラム。
- 請求項1に記載した装置としてコンピュータを機能させるためのプログラムを記録した記録媒体。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/384,917 US9208780B2 (en) | 2009-07-21 | 2010-07-15 | Audio signal section estimating apparatus, audio signal section estimating method, and recording medium |
CN201080032747.5A CN102473412B (zh) | 2009-07-21 | 2010-07-15 | 语音信号区间估计装置与方法 |
JP2011523623A JP5411936B2 (ja) | 2009-07-21 | 2010-07-15 | 音声信号区間推定装置と音声信号区間推定方法及びそのプログラムと記録媒体 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009-169788 | 2009-07-21 | ||
JP2009169788 | 2009-07-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011010604A1 true WO2011010604A1 (ja) | 2011-01-27 |
Family
ID=43499077
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/061999 WO2011010604A1 (ja) | 2009-07-21 | 2010-07-15 | 音声信号区間推定装置と音声信号区間推定方法及びそのプログラムと記録媒体 |
Country Status (4)
Country | Link |
---|---|
US (1) | US9208780B2 (ja) |
JP (1) | JP5411936B2 (ja) |
CN (1) | CN102473412B (ja) |
WO (1) | WO2011010604A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013007975A (ja) * | 2011-06-27 | 2013-01-10 | Nippon Telegr & Teleph Corp <Ntt> | 雑音抑圧装置、方法及びプログラム |
CN107134277A (zh) * | 2017-06-15 | 2017-09-05 | 深圳市潮流网络技术有限公司 | 一种基于gmm模型的语音激活检测方法 |
CN112967738A (zh) * | 2021-02-01 | 2021-06-15 | 腾讯音乐娱乐科技(深圳)有限公司 | 人声检测方法、装置及电子设备和计算机可读存储介质 |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140100847A1 (en) * | 2011-07-05 | 2014-04-10 | Mitsubishi Electric Corporation | Voice recognition device and navigation device |
KR101247652B1 (ko) * | 2011-08-30 | 2013-04-01 | 광주과학기술원 | 잡음 제거 장치 및 방법 |
CN103903629B (zh) * | 2012-12-28 | 2017-02-15 | 联芯科技有限公司 | 基于隐马尔科夫链模型的噪声估计方法和装置 |
JP6169849B2 (ja) * | 2013-01-15 | 2017-07-26 | 本田技研工業株式会社 | 音響処理装置 |
US9886968B2 (en) * | 2013-03-04 | 2018-02-06 | Synaptics Incorporated | Robust speech boundary detection system and method |
CN103646649B (zh) * | 2013-12-30 | 2016-04-13 | 中国科学院自动化研究所 | 一种高效的语音检测方法 |
US10134425B1 (en) * | 2015-06-29 | 2018-11-20 | Amazon Technologies, Inc. | Direction-based speech endpointing |
JP6501259B2 (ja) * | 2015-08-04 | 2019-04-17 | 本田技研工業株式会社 | 音声処理装置及び音声処理方法 |
US9754607B2 (en) | 2015-08-26 | 2017-09-05 | Apple Inc. | Acoustic scene interpretation systems and related methods |
US9792907B2 (en) | 2015-11-24 | 2017-10-17 | Intel IP Corporation | Low resource key phrase detection for wake on voice |
CN108292508B (zh) * | 2015-12-02 | 2021-11-23 | 日本电信电话株式会社 | 空间相关矩阵估计装置、空间相关矩阵估计方法和记录介质 |
US10964329B2 (en) * | 2016-07-11 | 2021-03-30 | FTR Labs Pty Ltd | Method and system for automatically diarising a sound recording |
JP6703460B2 (ja) * | 2016-08-25 | 2020-06-03 | 本田技研工業株式会社 | 音声処理装置、音声処理方法及び音声処理プログラム |
US10650621B1 (en) | 2016-09-13 | 2020-05-12 | Iocurrents, Inc. | Interfacing with a vehicular controller area network |
CN106653047A (zh) * | 2016-12-16 | 2017-05-10 | 广州视源电子科技股份有限公司 | 一种音频数据的自动增益控制方法与装置 |
US10339935B2 (en) * | 2017-06-19 | 2019-07-02 | Intel Corporation | Context-aware enrollment for text independent speaker recognition |
CN107393559B (zh) * | 2017-07-14 | 2021-05-18 | 深圳永顺智信息科技有限公司 | 检校语音检测结果的方法及装置 |
CN108175436A (zh) * | 2017-12-28 | 2018-06-19 | 北京航空航天大学 | 一种肠鸣音智能自动识别方法 |
JP6725186B2 (ja) * | 2018-02-20 | 2020-07-15 | 三菱電機株式会社 | 学習装置、音声区間検出装置および音声区間検出方法 |
US11276390B2 (en) * | 2018-03-22 | 2022-03-15 | Casio Computer Co., Ltd. | Audio interval detection apparatus, method, and recording medium to eliminate a specified interval that does not represent speech based on a divided phoneme |
US10714122B2 (en) * | 2018-06-06 | 2020-07-14 | Intel Corporation | Speech classification of audio for wake on voice |
US20190044860A1 (en) * | 2018-06-18 | 2019-02-07 | Intel Corporation | Technologies for providing adaptive polling of packet queues |
US10650807B2 (en) | 2018-09-18 | 2020-05-12 | Intel Corporation | Method and system of neural network keyphrase detection |
US11955138B2 (en) * | 2019-03-15 | 2024-04-09 | Advanced Micro Devices, Inc. | Detecting voice regions in a non-stationary noisy environment |
US11127394B2 (en) | 2019-03-29 | 2021-09-21 | Intel Corporation | Method and system of high accuracy keyphrase detection for low resource devices |
CN110265012A (zh) * | 2019-06-19 | 2019-09-20 | 泉州师范学院 | 基于开源硬件可交互智能语音家居控制装置及控制方法 |
CN111722696B (zh) * | 2020-06-17 | 2021-11-05 | 思必驰科技股份有限公司 | 用于低功耗设备的语音数据处理方法和装置 |
CN112435691B (zh) * | 2020-10-12 | 2024-03-12 | 珠海亿智电子科技有限公司 | 在线语音端点检测后处理方法、装置、设备及存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001343984A (ja) * | 2000-05-30 | 2001-12-14 | Matsushita Electric Ind Co Ltd | 有音/無音判定装置、音声復号化装置及び音声復号化方法 |
JP2002132289A (ja) * | 2000-10-23 | 2002-05-09 | Seiko Epson Corp | 音声認識方法および音声認識処理プログラムを記録した記録媒体ならびに音声認識装置 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5950158A (en) * | 1997-07-30 | 1999-09-07 | Nynex Science And Technology, Inc. | Methods and apparatus for decreasing the size of pattern recognition models by pruning low-scoring models from generated sets of models |
US6631348B1 (en) * | 2000-08-08 | 2003-10-07 | Intel Corporation | Dynamic speech recognition pattern switching for enhanced speech recognition accuracy |
US7305132B2 (en) * | 2003-11-19 | 2007-12-04 | Mitsubishi Electric Research Laboratories, Inc. | Classification in likelihood spaces |
US7725314B2 (en) * | 2004-02-16 | 2010-05-25 | Microsoft Corporation | Method and apparatus for constructing a speech filter using estimates of clean speech and noise |
US7509259B2 (en) * | 2004-12-21 | 2009-03-24 | Motorola, Inc. | Method of refining statistical pattern recognition models and statistical pattern recognizers |
GB2426166B (en) * | 2005-05-09 | 2007-10-17 | Toshiba Res Europ Ltd | Voice activity detection apparatus and method |
-
2010
- 2010-07-15 JP JP2011523623A patent/JP5411936B2/ja active Active
- 2010-07-15 US US13/384,917 patent/US9208780B2/en active Active
- 2010-07-15 WO PCT/JP2010/061999 patent/WO2011010604A1/ja active Application Filing
- 2010-07-15 CN CN201080032747.5A patent/CN102473412B/zh active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001343984A (ja) * | 2000-05-30 | 2001-12-14 | Matsushita Electric Ind Co Ltd | 有音/無音判定装置、音声復号化装置及び音声復号化方法 |
JP2002132289A (ja) * | 2000-10-23 | 2002-05-09 | Seiko Epson Corp | 音声認識方法および音声認識処理プログラムを記録した記録媒体ならびに音声認識装置 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013007975A (ja) * | 2011-06-27 | 2013-01-10 | Nippon Telegr & Teleph Corp <Ntt> | 雑音抑圧装置、方法及びプログラム |
CN107134277A (zh) * | 2017-06-15 | 2017-09-05 | 深圳市潮流网络技术有限公司 | 一种基于gmm模型的语音激活检测方法 |
CN112967738A (zh) * | 2021-02-01 | 2021-06-15 | 腾讯音乐娱乐科技(深圳)有限公司 | 人声检测方法、装置及电子设备和计算机可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2011010604A1 (ja) | 2012-12-27 |
US20120173234A1 (en) | 2012-07-05 |
JP5411936B2 (ja) | 2014-02-12 |
CN102473412B (zh) | 2014-06-11 |
CN102473412A (zh) | 2012-05-23 |
US9208780B2 (en) | 2015-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5411936B2 (ja) | 音声信号区間推定装置と音声信号区間推定方法及びそのプログラムと記録媒体 | |
JP4586577B2 (ja) | 外乱成分抑圧装置、コンピュータプログラム、及び音声認識システム | |
JP5949553B2 (ja) | 音声認識装置、音声認識方法、および音声認識プログラム | |
JP4856662B2 (ja) | 雑音除去装置、その方法、そのプログラム及び記録媒体 | |
JPWO2009078093A1 (ja) | 非音声区間検出方法及び非音声区間検出装置 | |
JP6004792B2 (ja) | 音響処理装置、音響処理方法、及び音響処理プログラム | |
US7552049B2 (en) | Noise adaptation system of speech model, noise adaptation method, and noise adaptation program for speech recognition | |
JP2006215564A (ja) | 自動音声認識システムにおける単語精度予測方法、及び装置 | |
JP2007279517A (ja) | 音源分離装置、音源分離装置用のプログラム及び音源分離方法 | |
JP5339426B2 (ja) | ケプストラムノイズ減算を用いた音声認識システム及び方法 | |
JP4673828B2 (ja) | 音声信号区間推定装置、その方法、そのプログラム及び記録媒体 | |
KR100784456B1 (ko) | Gmm을 이용한 음질향상 시스템 | |
JP5191500B2 (ja) | 雑音抑圧フィルタ算出方法と、その装置と、プログラム | |
JP4755555B2 (ja) | 音声信号区間推定方法、及びその装置とそのプログラムとその記憶媒体 | |
JP4413175B2 (ja) | 非定常雑音判別方法、その装置、そのプログラム及びその記録媒体 | |
JPH10133688A (ja) | 音声認識装置 | |
JP4691079B2 (ja) | 音声信号区間推定装置、方法、プログラムおよびこれを記録した記録媒体 | |
JP4690973B2 (ja) | 信号区間推定装置、方法、プログラム及びその記録媒体 | |
Sadeghi et al. | The effect of different acoustic noise on speech signal formant frequency location | |
JP5457999B2 (ja) | 雑音抑圧装置とその方法とプログラム | |
JP6653687B2 (ja) | 音響信号処理装置、方法及びプログラム | |
JP6599408B2 (ja) | 音響信号処理装置、方法及びプログラム | |
Ondusko et al. | Blind signal-to-noise ratio estimation of speech based on vector quantizer classifiers and decision level fusion | |
JP2019028301A (ja) | 音響信号処理装置、方法及びプログラム | |
JP4653673B2 (ja) | 信号判定装置、信号判定方法、信号判定プログラムおよび記録媒体 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201080032747.5 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10802224 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011523623 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13384917 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 10802224 Country of ref document: EP Kind code of ref document: A1 |