CN1716380A - Audio frequency splitting method for changing detection based on decision tree and speaking person - Google Patents

Audio frequency splitting method for changing detection based on decision tree and speaking person Download PDF

Info

Publication number
CN1716380A
CN1716380A CNA2005100508645A CN200510050864A CN1716380A CN 1716380 A CN1716380 A CN 1716380A CN A2005100508645 A CNA2005100508645 A CN A2005100508645A CN 200510050864 A CN200510050864 A CN 200510050864A CN 1716380 A CN1716380 A CN 1716380A
Authority
CN
China
Prior art keywords
audio
section
frame
zero
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2005100508645A
Other languages
Chinese (zh)
Other versions
CN100505040C (en
Inventor
吴朝晖
赵民德
孟晓楠
李红
厉蒋
姜旭锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CNB2005100508645A priority Critical patent/CN100505040C/en
Publication of CN1716380A publication Critical patent/CN1716380A/en
Application granted granted Critical
Publication of CN100505040C publication Critical patent/CN100505040C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The audio splitting method based on the decision tree and the talker change detection includes the first adaptive silencing detection to find out the mute in audio frequency and the coarse splitting of audio frequency signal with the mute; the subsequent mutation detection for fine splitting of audio frequency signal and classifying the audio segment into phonetic part and non-phonetic part with the decision tree; and the final detecting the talker change point in the phonetic segment to obtain the final splitting result. The present invention performs phonetic detection via combining two methods of mute detection and mutation detection and adopting phonetic/non-phonetic decision tree to raise the accuracy of phonetic detection, and performs the talker change detection in phonetic segment with saving in calculation time.

Description

Change the audio frequency splitting method of detection based on decision tree and speaker
Technical field
The present invention relates to signal Processing and pattern-recognition, mainly is a kind of audio frequency splitting method that changes detection based on decision tree and speaker.
Background technology
Speaker's retrieval technique is meant utilizes signal Processing and mode identification method, retrieval speaker dependent's technology in a large amount of audio documents.Speaker's retrieval technique need solve two problems, and promptly who is speaking and when is speaking.Whom common speaker's retrieval solves in the problem of speaking by speaker Recognition Technology, then needs to use audio segmentation when in a minute.
Dividing method commonly used has based on cutting apart of bayesian information criterion and cutting apart based on the KL2 distance.The dividing method of bayesian information criterion determines whether cutting apart by Bayes's value of calculating " two section audio features are obeyed same Gaussian distribution " and " two section audio features are obeyed two Gaussian distribution respectively " two hypothesis.But bayesian information criterion often only is confined to cutting apart between the speaker, lacks robustness for the irregular situation of characteristic distribution such as noise.The arithmetic speed of bayesian information criterion is slower in addition, is unfavorable for real-time processing.
Based on the dividing method of the KL2 distance KL2 distance of MFCC relatively, and relatively come to determine speaker's change with empirical value.But come from the window of mobile regular length based on the voice segments that the algorithm of KL2 distance is used for computed range, make distance value and unreliable.
Summary of the invention
The present invention will solve the existing defective of above-mentioned technology, a kind of audio frequency splitting method that changes detection based on decision tree and speaker is provided, change by detecting voice and speaker, realize audio segmentation one-tenth is belonged to the voice segments of different people, be used for the audio segmentation of speaker's retrieval.
The technical solution adopted for the present invention to solve the technical problems: at first utilize adaptive silence detection to find out quiet in the audio frequency, and utilize these quiet audio frequency to be carried out coarse segmentation, detect to segment according to sudden change then and cut, and come to carry out the classification of speech/non-speech to cutting apart the audio fragment that obtains with decision tree, the detection speaker changes a little between sound bite at last, is changed by the speaker a little to obtain final segmentation result.
The technical solution adopted for the present invention to solve the technical problems can also be further perfect.Described silence detection is calculated each frame energy after audio frequency being divided frame, determine quiet by adaptive energy threshold value and time threshold.Described sudden change detects to determine catastrophe point by the distance between the distribution of calculating energy and zero-crossing rate.Described decision tree is the decision rule of one group of precondition, the section feature and the rule of correspondence of audio-frequency fragments is judged successively, by the final value decision audio fragment type of decision tree.Described speaker changes detection for distance between the adjacent voice segments and adaptive threshold are compared, and determines that the speaker changes a little.
The effect that the present invention is useful is: detect two kinds of methods in conjunction with silence detection and sudden change, and adopt the speech/non-speech decision tree to carry out speech detection, utilize advantage separately to improve the speech detection accuracy.Between voice snippet, carry out the speaker again and change detection, compare general needs in twos the clustering algorithm of computed range more save computing time.
Description of drawings
Fig. 1 is a speech/non-speech categorised decision tree topology structural drawing of the present invention;
Fig. 2 is a method flow diagram of the present invention;
Embodiment
The invention will be described further below in conjunction with drawings and Examples: this audio frequency splitting method that changes detection based on decision tree and speaker was divided into for six steps:
The first step: audio frequency pre-service
The audio frequency pre-service is divided into sample quantization, zero-suppresses and floats, three parts of pre-emphasis and windowing.
1, sample quantization
A), sound signal is carried out filtering, make its nyquist frequency F with sharp filter NBe 4KHZ;
B), audio sample rate F=2F is set N
C), to sound signal S a(t) sample by the cycle, obtain the amplitude sequence of digital audio and video signals s ( n ) = s a ( n F ) ;
D), s (n) is carried out quantization encoding, the quantization means s ' that obtains amplitude sequence (n) with pulse code modulation (pcm).
2, zero-suppress and float
A), calculate the mean value s of the amplitude sequence that quantizes;
B), each amplitude is deducted mean value, obtain zero-suppressing that to float back mean value be 0 amplitude sequence s " (n).
3, pre-emphasis
A), Z transfer function H (the z)=1-α z of digital filter is set -1In pre emphasis factor α, α desirable 1 or slightly little value than 1:
B), s " (n) by digital filter, obtain the suitable amplitude sequence s (n) of high, medium and low frequency amplitude of sound signal.
4, windowing
A), calculate frame length N (32 milliseconds) and the frame amount of the moving T (10 milliseconds) of audio frame, satisfied respectively:
N F = 0.032
T F = 0.010
Here F is an audio sample rate, and unit is Hz;
B), be that N, the frame amount of moving are T with the frame length, s (n) is divided into~the audio frame F of series m, each audio frame comprises N audio signal samples;
C), calculate the hamming code window function:
Figure A20051005086400093
D), to each audio frame F mAdd hamming code window:
ω ( n ) × F m ( n ) ⇒ { F m ′ ( n ) | n = 0,1 , · · · , N - 1 } .
Second step: feature extraction
Feature extraction on the audio frame comprises energy, the extraction of zero-crossing rate and Mel cepstrum coefficient (MFcc).
1, the extraction of energy:
E = Σ n = 1 N s 2 ( n )
2, the extraction of zero-crossing rate:
Zcr = 1 2 N - 1 Σ n = 1 N - 1 | [ sgn ( s ( n + 1 ) ) - sgn ( s ( n ) ) ] |
3, the extraction of MFCC:
A), the exponent number p of Mel cepstrum coefficient is set;
B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).
C), calculate Mel territory scale:
M i = i p × 2595 log ( 1 + 8000 / 2.0 700.0 ) , ( i = 0,1,2 , . . . , p )
D), calculate corresponding frequency domain scale:
f i = 700 × e M i 2595 ln 10 - 1 , ( i = 0,1,2 , . . . , p )
E), calculate each Mel territory passage φ JOn the logarithm energy spectrum:
E j = Σ k = 0 K 2 - 1 φ j ( k ) | X ( k ) | 2
Wherein Σ k = 0 K 2 - 1 φ j ( k ) = 1 .
F), be discrete cosine transform DCT
The 3rd step, silence detection
1, the calculating of energy threshold
Detect the quiet significant limitation that has with unified energy threshold because the audio power under the various environment differs greatly, but voice and quiet between the relativeness of energy size be constant, so can calculate adaptive threshold:
Threshold(E)=min(E)+0.3×[mean(E)-min(E)]
Wherein, Threshold (E) is the adaptive energy threshold value, and min (E) is the minimum value of each frame energy, and mean (E) is the mean value of each frame energy.
2, quiet section detection
The energy and the energy threshold T of each audio frame are compared, and the frame that is lower than threshold value is quiet frame.Continuous quiet frame is formed one quiet section.
3, the calculating of zero-crossing rate threshold value
Threshold(Zcr)=0.5×mean(Zr i),i∈{i|E i<Threshold(E)}
Wherein, Threshold (Zcr) is a self-adaptation zero-crossing rate threshold value, mean (Zcr i) be the zero-crossing rate mean value of quiet frame.
4, quiet section zero-crossing rate correction
Check each frame zero-crossing rate successively from each two ends of quiet section,, then be considered as the initial or voiceless sound when finishing of syllable, shift out quiet section if be higher than threshold value.
5, smoothing processing
Being lower than quiet section of 10 frames (0.1 second) is regarded as the pause in short-term between continuous speech and casts out.
The 4th step, voice are cut apart
After the silence detection, sound signal is divided into continuous quiet section and non-quiet section.For avoiding noise jamming, need to length greater than 10 seconds non-quiet section further cut apart.
1, the parameter estimation of energy and zero-crossing rate distribution
In non-quiet section of need further cut apart, be that window is long with 50 frames, 10 frames are the window step-length, calculate the energy and the zero-crossing rate X of 50 frames in each window 2The parameter that distributes:
p ( x ) = x a e - bx b a + 1 Γ ( a + 1 ) , ( x ≥ 0 )
Wherein a = μ 2 σ 2 - 1 , b = σ 2 μ , μ is a mean value, σ is a variance.
Distance calculation
The distance definition of each window is as follows:
D ( i ) = 1 - Γ ( a i - 1 + a i + 1 2 + 1 ) Γ ( a i - 1 + 1 ) Γ ( a i + 1 + 1 ) 2 a i - 1 + a i + 1 2 + 1 b i - 1 a i + 1 + 1 2 b i + 1 a i - 1 + 1 2 ( b i - 1 + b i + 1 ) a i - 1 + a i + 1 2 + 1
a I-1, a I+1, b I-1, b I+1The parameter a of window before and after being respectively, b
3, sudden change detects
In each has the window of maximum value distance D (i), once more each frame is calculated same distance, getting the maximum frame of distance is cut-point.
The 5th step, speech/non-speech classification
We come non-quiet section classification with the decision tree that trains: voice or non-voice.The corresponding section feature of each node of decision tree.The topological diagram of decision tree is seen accompanying drawing one.
Selected section feature is as follows:
A), high zero-crossing rate ratio HZCRR:
HZCRR = 1 2 N Σ n = 0 N - 1 [ sgn ( Zcr ( n ) - 1.5 × Zcr ‾ ) + 1 ]
The relative noise of HZCRR distribution center of voice segments and music etc. are higher.HZCRR is higher than f hSection be regarded as voice segments.
B), low-yield ratio LRMSR:
LRMSR = 1 2 N Σ n = 0 N - 1 [ sgn ( 0.5 × E ‾ - E ( n ) + 1 ]
The relative noise of LRMSR distribution center of voice segments and music etc. are lower.LRMSR is lower than f lSection be regarded as voice segments.
C), fundamental frequency Mean F
The fundamental frequency of audio section can be estimated with zero-crossing rate:
MeanF=max[Zcr(n)]×F/N,
Wherein F is a sample frequency 8000, and N is frame length 32ms.The fundamental frequency of voice distributes narrow than non-voice, so fundamental frequency is higher than f fSection be regarded as non-speech segment.
D), no zero-crossing rate space-number NZCRR
The number of times that zero-crossing rate is zero frame appears in the definition section of being of NZCRR the inside.Continuous no zero passage frame is only calculated once.
General voice can not have or not zero-crossing rate at interval, so NZCRR is lower than f nSection be regarded as voice segments.
E), energy variance VarRMS
The variance of the definition section of the being self-energy of Var RMS.
The energy variation of voice is little a lot of than music etc., so variance is less than f vSection be regarded as voice segments.
F), f h, f l, f f, f n, f vValue all obtain by decision tree training.The language material of training decision tree comprises about 30 minutes voice and about 30 minutes noise, and wherein voice are the voice that 20 speakers (10 male sex, 10 women) enroll under office environment; Noise comprises 5 minutes white noises, 5 minutes Gaussian noises, 10 minutes music and 10 minutes neighbourhood noises.
The 6th step, speaker change detection
Each speaker's phonetic feature has all formed specific distribution in feature space, can describe speaker's individual character with this distribution.Different speakers' distribution is also different, so can detect speaker's change with the similarity between the characteristic distribution.Here we use T 2Distance is calculated the MFCC characteristic distance between each voice segments.
1, T 2Distance calculation
Change in order to detect the speaker, need to calculate the T between per two adjacent voice segments 2Distance.T 2Distance definition
As follows:
T 2 = ab a + b ( μ 1 - μ 2 ) T Σ - 1 ( μ 1 - μ 2 )
A wherein, the length of the b section of being, μ 1, μ 2Be the mean value of MFCC in each section, ∑ is common covariance matrix.
2, adaptive threshold calculates
By comparing T 2Whether distance and threshold value can detect and exist the speaker to change.The computing formula of adaptive threshold is as follows:
T=μ+λσ
Wherein μ is a mean distance, and σ is a distance variance, and λ is a penalty coefficient, is set as-1.5 here.
3, merge
If the distance between two voice segments is less than threshold value, these two voice segments are regarded as belonging to same speaker so, these two voice segments can be merged into one.If exist quietly between these two voice segments, this section is quiet so also will merge.If have non-voice between two voice segments, then nonjoinder.This is in order to prevent the interference of noise.
Experimental result:
This method is tested on 1997 Mandarin Broadcast News Speech Corpus (Hub4-NE) news broadcast voice.This sound bank comprises CCTV, the news broadcast of KAZN and VOA, and about 40 hours T.T.s, wherein about 10 hours content is music or noise.
We use on this storehouse simultaneously based on the dividing method of bayesian information criterion with based on the dividing method of KL2 distance and have carried out same experiment, are used for comparing with this method.These two kinds of methods all are directly to change with speaker characteristic MFCC search speaker between fixing window long (1 second).
The likelihood score and the parameter use number that compare the parameter estimation of two hypothesis based on the method for bayesian information criterion.1: two window of wig belongs to same speaker, and feature is obeyed same Gaussian distribution; Suppose that 2: two windows belong to different speakers, feature is obeyed two Gaussian distribution respectively.If suppose that Bayes's value (likelihood score deducts the penalty term number of parameters) of 2 is higher, then thinking has the speaker to change.
The KL2 distance is to be used for the method that the speaker is cut apart.By the KL2 between the speaker characteristic that calculates two sections voice distance and with threshold ratio, exist the speaker to change with detection.
We carry out the assessment of five aspects to the result of partitioning algorithm:
1) cut-point false drop rate: the cut-point of mistake accounts for the ratio that detects cut-point
2) cut-point loss: nd cut-point accounts for the ratio of actual cut-point
3) pure voice ratio: detect the ratio that pure voice segments total length accounts for the actual speech total length
4) voice segments recall rate: the actual speech section ratio that is detected
5) error rate such as retrieval: the value when false rejection rate equates with wrong acceptance rate in speaker's retrieval
The definition of pure voice segments is the voice segments that only comprises speaker's voice.The voice segments that comprises noise or a plurality of speaker's voice is impure voice segments.Pure voice ratio is the ratio that pure voice segments total length accounts for whole voice length.The voice segments recall rate is meant the voice segments ratio that is detected corresponding pure voice segments.These two indexs can better be weighed the effect of segmentation effect to speaker's retrieval, are replenishing of false drop rate and loss.Error rates such as retrieval be on the basis of cutting apart as a result, be the speaker retrieve experiment etc. error rate.This index is used for weighing the final effect of partitioning algorithm.
Experimental result is as follows:
Algorithm False drop rate Loss Pure voice ratio Recall rate Etc. error rate
BIC 25.87% 13.37% 72.39% 85.42% 15.91%
KL2 25.50% 14.42% 71.69% 83.72% 25.84%
This method 26.48% 10.47% 80.40% 89.29% 12.94%
Each method is as follows working time:
Algorithm Processing time (second) Speed (minute audio frequency/second)
BIC 2190 1.08
KL2 1331 1.78
This method 883 2.69
Experimental machine device configuration CPU is AMD Athlon (tm) XP2500+, in save as 512M ddr400.
Experimental result shows that with respect to BIC and KL2 method, the working time of this dividing method is the shortest, and cuts apart and obtain at most the longest pure voice, thereby the performance of raising speaker that can be best retrieval makes that its mistake such as grade is minimum.

Claims (8)

1, a kind of audio frequency splitting method that changes detection based on decision tree and speaker, it is characterized in that: at first utilize adaptive silence detection to find out quiet in the audio frequency, and utilize these quiet audio frequency to be carried out coarse segmentation, detect to segment according to sudden change then and cut, and come to carry out the classification of speech/non-speech to cutting apart the audio fragment that obtains with decision tree, the detection speaker changes a little between sound bite at last, is changed by the speaker a little to obtain final segmentation result.
2, the audio frequency splitting method that changes detection based on decision tree and speaker according to claim 1 is characterized in that: comprise the steps:
1), audio frequency is carried out pre-service: the audio frequency pre-service is divided into sample quantization, zero-suppresses and floats, three parts of pre-emphasis and windowing;
2), audio feature extraction: the feature extraction on the audio frame comprises energy, the extraction of zero-crossing rate and Mel cepstrum coefficient;
3), silence detection: audio frequency divided calculate each frame energy behind the frame, determine quiet by adaptive energy threshold value and time threshold;
4), voice cut apart: after the silence detection, sound signal is divided into continuous quiet section and non-quiet section, and length non-quiet section greater than 10 seconds further cut apart; It is to determine catastrophe point by the distance between the distribution of calculating energy and zero-crossing rate that sudden change detects;
5), speech/non-speech classification: come non-quiet section classification: voice or non-voice, the corresponding section feature of each node of decision tree with the decision tree that trains; Decision tree is the decision rule of one group of precondition, the section feature and the rule of correspondence of audio-frequency fragments is judged successively, by the final value decision audio fragment type of decision tree;
6), the speaker changes detection: detect speaker's change with the similarity between the characteristic distribution, be about to distance between the adjacent voice segments and adaptive threshold relatively, determine that the speaker changes a little.
3, according to claim 2ly change the audio frequency splitting method of detection based on decision tree and speaker, it is characterized in that: described audio frequency pre-service concrete steps are:
1), sample quantization:
A), sound signal is carried out filtering, make its nyquist frequency F with sharp filter NBe 4KHZ;
B), audio sample rate F=2F is set N
C), to sound signal s a(t) sample by the cycle, obtain the amplitude sequence of digital audio and video signals s ( n ) = s a ( n F ) ;
D), s (n) is carried out quantization encoding, the quantization means s ' that obtains amplitude sequence (n) with pulse code modulation (pcm);
2), zero-suppress and float:
A), calculate the mean value s of the amplitude sequence that quantizes;
B), each amplitude is deducted mean value, obtain zero-suppressing that to float back mean value be 0 amplitude sequence s " (n):
3), pre-emphasis:
A), Z transfer function H (the z)=1-α z of digital filter is set -1In pre emphasis factor α, α desirable 1 or slightly little value than 1;
B), s " (n) by digital filter, obtains the suitable amplitude sequence s (n) of high, medium and low frequency amplitude of sound signal;
4), windowing:
A), calculate frame length N (32 milliseconds) and the frame amount of the moving T (10 milliseconds) of audio frame, satisfied respectively:
N F = 0.032
T F = 0.010
Here F is an audio sample rate, and unit is Hz;
B), be that N, the frame amount of moving are T with the frame length, s (n) is divided into a series of audio frame F m, each audio frame comprises N audio signal samples;
C), calculate the hamming code window function:
Figure A2005100508640003C3
D), to each audio frame F mAdd hamming code window:
ω(n)×F m(n){F m′(n)|n=0,1,…,N-1}。
4, according to claim 2ly change the audio frequency splitting method of detection based on decision tree and speaker, it is characterized in that: the concrete steps of described audio feature extraction are:
1), the extraction of energy:
E = Σ n = 1 N s 2 ( n )
2), the extraction of zero-crossing rate:
Zcr = 1 2 N - 1 Σ n = 1 N - 1 | [ sgn ( s ( n + 1 ) ) - sgn ( s ( n ) ) ] |
3), Mel cepstrum coefficient, the i.e. extraction of MFCC:
A), the exponent number p of Mel cepstrum coefficient is set;
B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);
C), calculate Mel territory scale:
M i = i p × 2595 log ( 1 + 8000 / 2.0 700.0 ) , ( i = 0,1,2 , . . . , p )
D), calculate corresponding frequency domain scale:
f i = 700 × e M i 2595 ln 10 - 1 , (i=0,1,2,...,p)
E), calculate each Mel territory passage φ jOn the logarithm energy spectrum:
E j = Σ k = 0 K 2 - 1 φ j ( k ) | X ( k ) | 2
Wherein Σ k = 0 K 2 - 1 φ j ( k ) = 1 .
F), be discrete cosine transform DCT.
5, according to claim 2ly change the audio frequency splitting method of detection based on decision tree and speaker, it is characterized in that: described silence detection concrete steps are:
A, calculating adaptive energy threshold value:
Threshold(E)=min(E)+0.3×[mean(E)-min(E)]
Wherein, Threshold (E) is the adaptive energy threshold value, and min (E) is the minimum value of each frame energy, and mean (E) is the mean value of each frame energy;
B, quiet section detection: the energy and the energy threshold T of each audio frame are compared, and the frame that is lower than threshold value is quiet frame, and continuous quiet frame is formed one quiet section;
The calculating of C, zero-crossing rate threshold value:
Threshold(Zcr)=0.5×mean(Zcr i),i∈{i|E i<Threshold(E)}
Wherein, Threshold (Zcr) is a self-adaptation zero-crossing rate threshold value, mean (Zcr i) be the zero-crossing rate mean value of quiet frame;
D, quiet section zero-crossing rate correction: check each frame zero-crossing rate successively from each two ends of quiet section,, then be considered as the initial or voiceless sound when finishing of syllable, shift out quiet section if be higher than threshold value;
E, smoothing processing: be lower than 10 frames, promptly 0.1 second quiet section is regarded as the pause in short-term between continuous speech and casts out.
6, according to claim 2ly change the audio frequency splitting method of detection based on decision tree and speaker, it is characterized in that: described voice are cut apart concrete steps and are:
The parameter estimation that A, energy and zero-crossing rate distribute: in non-quiet section of need further cut apart, be that window is long with 50 frames, 10 frames are the window step-length, calculate the energy and the zero-crossing rate x of 50 frames in each window 2The parameter that distributes:
p ( x ) = x a e - bx b a + 1 Γ ( a + 1 ) , ( x ≥ 0 )
Wherein a = μ 2 σ 2 - 1 , b = σ 2 μ , μ is a mean value, and σ is a variance;
B, distance calculation:
The distance definition of each window is as follows:
D ( i ) = 1 - Γ ( a i - 1 + a i + 1 2 + 1 ) Γ ( a i - 1 + 1 ) Γ ( a i + 1 + 1 ) 2 a i - 1 + a i + 1 2 + 1 b i - 1 a i + 1 + 1 2 b i + 1 a i - 1 + 1 2 ( b i - 1 + b i + 1 ) a i - 1 + a i + 1 2 + 1
a I-1, a I+1, b I-1, b I+1The parameter a of window before and after being respectively, b;
C, sudden change detect: in each has the window of maximum value distance D (i), once more each frame is calculated same distance, getting the maximum frame of distance is cut-point.
7, according to claim 2ly change the audio frequency splitting method of detection based on decision tree and speaker, it is characterized in that: section feature selected in step 5) is as follows:
A), high zero-crossing rate ratio HZCRR:
HZCRR = 1 2 N Σ n = 0 N - 1 [ sgn ( Zcr ( n ) - 1.5 × Zcr ‾ ) + 1 ]
The relative noise of HZCRR distribution center of voice segments and music etc. are higher, and HZCRR is higher than f hSection be regarded as voice segments;
B), low-yield ratio LRMSR:
LRMSR = 1 2 N Σ n = 0 N - 1 [ sgn ( 0.5 × E ‾ - E ( n ) + 1 ]
The relative noise of LRMSR distribution center of voice segments and music etc. are lower, and LRMSR is lower than f lSection be regarded as voice segments;
C), fundamental frequency Mean F:
The fundamental frequency of audio section is estimated with zero-crossing rate:
MeanF=max[Zcr(n)]×F/N,
Wherein F is a sample frequency 8000, and N is frame length 32ms, and fundamental frequency is higher than f fSection be regarded as non-speech segment;
D), no zero-crossing rate space-number NZCRR:
The number of times that zero-crossing rate is zero frame appears in the definition section of being of NZCRR the inside, and continuous no zero passage frame is only calculated once, and NZCRR is lower than f nSection be regarded as voice segments;
E), energy variance Var RMS
The variance of the definition section of the being self-energy of Var RMS, variance is less than f vSection be regarded as voice segments;
F), f h, f l, f f, f n, f vValue all obtain by decision tree training.
8, the audio frequency splitting method that changes detection based on decision tree and speaker according to claim 2 is characterized in that: use T in step 6) 2Distance is calculated the MFCC characteristic distance between each voice segments;
1), T 2Distance definition is as follows:
T 2 = ab a + b ( μ 1 - μ 2 ) T Σ - 1 ( μ 1 - μ 2 )
A wherein, the length of the b section of being, μ 1, μ 2Be the mean value of MFCC in each section, ∑ is common covariance matrix;
2), adaptive threshold calculates
By comparing T 2Whether distance and threshold value can detect and exist the speaker to change, and the computing formula of adaptive threshold is as follows:
T=μ+λσ
Wherein μ is a mean distance, and σ is a distance variance, and λ is a penalty coefficient;
3), merge:
If the distance between two voice segments is less than threshold value, these two voice segments are regarded as belonging to same speaker so, these two voice segments can be merged into one; If exist quietly between these two voice segments, this section is quiet so also will merge; If have non-voice between two voice segments, then nonjoinder.
CNB2005100508645A 2005-07-26 2005-07-26 Audio frequency splitting method for changing detection based on decision tree and speaking person Expired - Fee Related CN100505040C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100508645A CN100505040C (en) 2005-07-26 2005-07-26 Audio frequency splitting method for changing detection based on decision tree and speaking person

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100508645A CN100505040C (en) 2005-07-26 2005-07-26 Audio frequency splitting method for changing detection based on decision tree and speaking person

Publications (2)

Publication Number Publication Date
CN1716380A true CN1716380A (en) 2006-01-04
CN100505040C CN100505040C (en) 2009-06-24

Family

ID=35822152

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100508645A Expired - Fee Related CN100505040C (en) 2005-07-26 2005-07-26 Audio frequency splitting method for changing detection based on decision tree and speaking person

Country Status (1)

Country Link
CN (1) CN100505040C (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101399039B (en) * 2007-09-30 2011-05-11 华为技术有限公司 Method and device for determining non-noise audio signal classification
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN102543080A (en) * 2010-12-24 2012-07-04 索尼公司 Audio editing system and audio editing method
CN103165127A (en) * 2011-12-15 2013-06-19 佳能株式会社 Sound segmentation equipment, sound segmentation method and sound detecting system
CN103310787A (en) * 2012-03-07 2013-09-18 嘉兴学院 Abnormal sound rapid-detection method for building security
CN103405217A (en) * 2013-07-08 2013-11-27 上海昭鸣投资管理有限责任公司 System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology
CN103996399A (en) * 2014-04-21 2014-08-20 深圳市北科瑞声科技有限公司 Voice detection method and system
CN104143342A (en) * 2013-05-15 2014-11-12 腾讯科技(深圳)有限公司 Voiceless sound and voiced sound judging method and device and voice synthesizing system
CN104347068A (en) * 2013-08-08 2015-02-11 索尼公司 Audio signal processing device, audio signal processing method and monitoring system
CN105825870A (en) * 2016-03-14 2016-08-03 江苏时间环三维科技有限公司 Voice instruction data obtaining method and device
CN102543080B (en) * 2010-12-24 2016-12-14 索尼公司 audio editing system and audio editing method
CN103796145B (en) * 2014-01-26 2017-01-11 深圳市微纳集成电路与***应用研究院 Auditory sense threshold value determining method and device and hearing aid
CN106504773A (en) * 2016-11-08 2017-03-15 上海贝生医疗设备有限公司 A kind of wearable device and voice and activities monitoring system
CN106548782A (en) * 2016-10-31 2017-03-29 维沃移动通信有限公司 The processing method and mobile terminal of acoustical signal
CN106782506A (en) * 2016-11-23 2017-05-31 语联网(武汉)信息技术有限公司 A kind of method that recorded audio is divided into section
CN108021675A (en) * 2017-12-07 2018-05-11 北京慧听科技有限公司 A kind of automatic segmentation alignment schemes of more equipment recording
WO2018113243A1 (en) * 2016-12-19 2018-06-28 平安科技(深圳)有限公司 Speech segmentation method, device and apparatus, and computer storage medium
CN108538312A (en) * 2018-04-28 2018-09-14 华中师范大学 Digital audio based on bayesian information criterion distorts a method for automatic positioning
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
CN109389999A (en) * 2018-09-28 2019-02-26 北京亿幕信息技术有限公司 A kind of high performance audio-video is made pauses in reading unpunctuated ancient writings method and system automatically
CN109712641A (en) * 2018-12-24 2019-05-03 重庆第二师范学院 A kind of processing method of audio classification and segmentation based on support vector machines
CN109766929A (en) * 2018-12-24 2019-05-17 重庆第二师范学院 A kind of audio frequency classification method and system based on SVM
WO2019227547A1 (en) * 2018-05-31 2019-12-05 平安科技(深圳)有限公司 Voice segmenting method and apparatus, and computer device and storage medium
CN111312219A (en) * 2020-01-16 2020-06-19 上海携程国际旅行社有限公司 Telephone recording marking method, system, storage medium and electronic equipment
CN111508498A (en) * 2020-04-09 2020-08-07 携程计算机技术(上海)有限公司 Conversational speech recognition method, system, electronic device and storage medium
CN111883169A (en) * 2019-12-12 2020-11-03 马上消费金融股份有限公司 Audio file cutting position processing method and device
CN112614515A (en) * 2020-12-18 2021-04-06 广州虎牙科技有限公司 Audio processing method and device, electronic equipment and storage medium

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101399039B (en) * 2007-09-30 2011-05-11 华为技术有限公司 Method and device for determining non-noise audio signal classification
CN102543080A (en) * 2010-12-24 2012-07-04 索尼公司 Audio editing system and audio editing method
CN102543080B (en) * 2010-12-24 2016-12-14 索尼公司 audio editing system and audio editing method
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN103165127A (en) * 2011-12-15 2013-06-19 佳能株式会社 Sound segmentation equipment, sound segmentation method and sound detecting system
CN103165127B (en) * 2011-12-15 2015-07-22 佳能株式会社 Sound segmentation equipment, sound segmentation method and sound detecting system
CN103310787A (en) * 2012-03-07 2013-09-18 嘉兴学院 Abnormal sound rapid-detection method for building security
CN104143342B (en) * 2013-05-15 2016-08-17 腾讯科技(深圳)有限公司 A kind of pure and impure sound decision method, device and speech synthesis system
CN104143342A (en) * 2013-05-15 2014-11-12 腾讯科技(深圳)有限公司 Voiceless sound and voiced sound judging method and device and voice synthesizing system
WO2014183411A1 (en) * 2013-05-15 2014-11-20 Tencent Technology (Shenzhen) Company Limited Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound
CN103405217A (en) * 2013-07-08 2013-11-27 上海昭鸣投资管理有限责任公司 System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology
CN103405217B (en) * 2013-07-08 2015-01-14 泰亿格电子(上海)有限公司 System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology
CN104347068A (en) * 2013-08-08 2015-02-11 索尼公司 Audio signal processing device, audio signal processing method and monitoring system
CN103796145B (en) * 2014-01-26 2017-01-11 深圳市微纳集成电路与***应用研究院 Auditory sense threshold value determining method and device and hearing aid
CN103996399A (en) * 2014-04-21 2014-08-20 深圳市北科瑞声科技有限公司 Voice detection method and system
CN103996399B (en) * 2014-04-21 2017-07-28 深圳市北科瑞声科技股份有限公司 Speech detection method and system
CN105825870A (en) * 2016-03-14 2016-08-03 江苏时间环三维科技有限公司 Voice instruction data obtaining method and device
CN105825870B (en) * 2016-03-14 2019-04-02 江苏时间环三维科技有限公司 A kind of voice command data acquisition methods and device
CN106548782A (en) * 2016-10-31 2017-03-29 维沃移动通信有限公司 The processing method and mobile terminal of acoustical signal
CN106504773A (en) * 2016-11-08 2017-03-15 上海贝生医疗设备有限公司 A kind of wearable device and voice and activities monitoring system
CN106782506A (en) * 2016-11-23 2017-05-31 语联网(武汉)信息技术有限公司 A kind of method that recorded audio is divided into section
WO2018113243A1 (en) * 2016-12-19 2018-06-28 平安科技(深圳)有限公司 Speech segmentation method, device and apparatus, and computer storage medium
CN108021675B (en) * 2017-12-07 2021-11-09 北京慧听科技有限公司 Automatic segmentation and alignment method for multi-equipment recording
CN108021675A (en) * 2017-12-07 2018-05-11 北京慧听科技有限公司 A kind of automatic segmentation alignment schemes of more equipment recording
CN108538312A (en) * 2018-04-28 2018-09-14 华中师范大学 Digital audio based on bayesian information criterion distorts a method for automatic positioning
WO2019227547A1 (en) * 2018-05-31 2019-12-05 平安科技(深圳)有限公司 Voice segmenting method and apparatus, and computer device and storage medium
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
CN109389999A (en) * 2018-09-28 2019-02-26 北京亿幕信息技术有限公司 A kind of high performance audio-video is made pauses in reading unpunctuated ancient writings method and system automatically
CN109389999B (en) * 2018-09-28 2020-12-11 北京亿幕信息技术有限公司 High-performance audio and video automatic sentence-breaking method and system
CN109712641A (en) * 2018-12-24 2019-05-03 重庆第二师范学院 A kind of processing method of audio classification and segmentation based on support vector machines
CN109766929A (en) * 2018-12-24 2019-05-17 重庆第二师范学院 A kind of audio frequency classification method and system based on SVM
CN111883169B (en) * 2019-12-12 2021-11-23 马上消费金融股份有限公司 Audio file cutting position processing method and device
CN111883169A (en) * 2019-12-12 2020-11-03 马上消费金融股份有限公司 Audio file cutting position processing method and device
CN111312219A (en) * 2020-01-16 2020-06-19 上海携程国际旅行社有限公司 Telephone recording marking method, system, storage medium and electronic equipment
CN111312219B (en) * 2020-01-16 2023-11-28 上海携程国际旅行社有限公司 Telephone recording labeling method, system, storage medium and electronic equipment
CN111508498A (en) * 2020-04-09 2020-08-07 携程计算机技术(上海)有限公司 Conversational speech recognition method, system, electronic device and storage medium
CN111508498B (en) * 2020-04-09 2024-01-30 携程计算机技术(上海)有限公司 Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium
CN112614515A (en) * 2020-12-18 2021-04-06 广州虎牙科技有限公司 Audio processing method and device, electronic equipment and storage medium
CN112614515B (en) * 2020-12-18 2023-11-21 广州虎牙科技有限公司 Audio processing method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN100505040C (en) 2009-06-24

Similar Documents

Publication Publication Date Title
CN1716380A (en) Audio frequency splitting method for changing detection based on decision tree and speaking person
CN1758331A (en) Quick audio-frequency separating method based on tonic frequency
Likitha et al. Speech based human emotion recognition using MFCC
CN105161093B (en) A kind of method and system judging speaker's number
US9842608B2 (en) Automatic selective gain control of audio data for speech recognition
US9364669B2 (en) Automated method of classifying and suppressing noise in hearing devices
CN1758332A (en) Speaker recognition method based on MFCC linear emotion compensation
CN1787076A (en) Method for distinguishing speek person based on hybrid supporting vector machine
CN1787075A (en) Method for distinguishing speek speek person by supporting vector machine model basedon inserted GMM core
CN1773605A (en) Sound end detecting method for sound identifying system
CN1920947A (en) Voice/music detector for audio frequency coding with low bit ratio
CN1841500A (en) Method and apparatus for resisting noise based on adaptive nonlinear spectral subtraction
Ryant et al. Highly accurate mandarin tone classification in the absence of pitch information
CN1308911C (en) Method and system for identifying status of speaker
CN1758263A (en) Multi-model ID recognition method based on scoring difference weight compromised
JP5050698B2 (en) Voice processing apparatus and program
Sharma et al. Automatic identification of silence, unvoiced and voiced chunks in speech
CN101030374A (en) Method and apparatus for extracting base sound period
Abdullah et al. A discrete wavelet transform-based voice activity detection and noise classification with sub-band selection
Aibinu et al. Evaluating the effect of voice activity detection in isolated Yoruba word recognition system
CN1296887C (en) Training method for embedded automatic sound identification system
Mondal et al. Speech activity detection using time-frequency auditory spectral pattern
Sorin et al. The ETSI extended distributed speech recognition (DSR) standards: client side processing and tonal language recognition evaluation
Tu et al. Towards improving statistical model based voice activity detection
Tomchuk Spectral masking in MFCC calculation for noisy speech

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090624

Termination date: 20170726